Linux-XFS Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous
@ 2020-05-22  3:50 Dave Chinner
  2020-05-22  3:50 ` [PATCH 01/24] xfs: remove logged flag from inode log item Dave Chinner
                   ` (25 more replies)
  0 siblings, 26 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

Inode flushing requires that we first lock an inode, then check it,
then lock the underlying buffer, flush the inode to the buffer and
finally add the inode to the buffer to be unlocked on IO completion.
We then walk all the other cached inodes in the buffer range and
optimistically lock and flush them to the buffer without blocking.

This cluster write effectively repeats the same code we do with the
initial inode, except now it has to special case that initial inode
that is already locked. Hence we have multiple copies of very
similar code, and it is a result of inode cluster flushing being
based on a specific inode rather than grabbing the buffer and
flushing all available inodes to it.

The problem with this at the moment is that we we can't look up the
buffer until we have guaranteed that an inode is held exclusively
and it's not going away while we get the buffer through an imap
lookup. Hence we are kinda stuck locking an inode before we can look
up the buffer.

This is also a result of inodes being detached from the cluster
buffer except when IO is being done. This has the further problem
that the cluster buffer can be reclaimed from memory and then the
inode can be dirtied. At this point cleaning the inode requires a
read-modify-write cycle on the cluster buffer. If we then are put
under memory pressure, cleaning that dirty inode to reclaim it
requires allocating memory for the cluster buffer and this leads to
all sorts of problems.

We used synchronous inode writeback in reclaim as a throttle that
provided a forwards progress mechanism when RMW cycles were required
to clean inodes. Async writeback of inodes (e.g. via the AIL) would
immediately exhaust remaining memory reserves trying to allocate
inode cluster after inode cluster. The synchronous writeback of an
inode cluster allowed reclaim to release the inode cluster and have
it freed almost immediately which could then be used to allocate the
next inode cluster buffer. Hence the IO based throttling mechanism
largely guaranteed forwards progress in inode reclaim. By removing
the requirement for require memory allocation for inode writeback
filesystem level, we can issue writeback asynchrnously and not have
to worry about the memory exhaustion anymore.

Another issue is that if we have slow disks, we can build up dirty
inodes in memory that can then take hours for an operation like
unmount to flush. A RMW cycle per inode on a slow RAID6 device can
mean we only clean 50 inodes a second, and when there are hundreds
of thousands of dirty inodes that need to be cleaned this can take a
long time. PInning the cluster buffers will greatly speed up inode
writeback on slow storage systems like this.

These limitations all stem from the same source: inode writeback is
inode centric, And they are largely solved by the same architectural
change: make inode writeback cluster buffer centric.  This series is
makes that architectural change.

Firstly, we start by pinning the inode backing buffer in memory
when an inode is marked dirty (i.e. when it is logged). By tracking
the number of dirty inodes on a buffer as a counter rather than a
flag, we avoid the problem of overlapping inode dirtying and buffer
flushing racing to set/clear the dirty flag. Hence as long as there
is a dirty inode in memory, the buffer will not be able to be
reclaimed. We can safely do this inode cluster buffer lookup when we
dirty an inode as we do not hold the buffer locked - we merely take
a reference to it and then release it - and hence we don't cause any
new lock order issues.

When the inode is finally cleaned, the reference to the buffer can
be removed from the inode log item and the buffer released. This is
done from the inode completion callbacks that are attached to the
buffer when the inode is flushed.

Pinning the cluster buffer in this way immediately avoids the RMW
problem in inode writeback and reclaim contexts by moving the memory
allocation and the blocking buffer read into the transaction context
that dirties the inode.  This inverts our dirty inode throttling
mechanism - we now throttle the rate at which we can dirty inodes to
rate at which we can allocate memory and read inode cluster buffers
into memory rather than via throttling reclaim to rate at which we
can clean dirty inodes.

Hence if we are under memory pressure, we'll block on memory
allocation when trying to dirty the referenced inode, rather than in
the memory reclaim path where we are trying to clean unreferenced
inodes to free memory.  Hence we no longer have to guarantee
forwards progress in inode reclaim as we aren't doing memory
allocation, and that means we can remove inode writeback from the
XFS inode shrinker completely without changing the system tolerance
for low memory operation.

Tracking the buffers via the inode log item also allows us to
completely rework the inode flushing mechanism. While the inode log
item is in the AIL, it is safe for the AIL to access any member of
the log item. Hence the AIL push mechanisms can access the buffer
attached to the inode without first having to lock the inode.

This means we can essentially lock the buffer directly and then
call xfs_iflush_cluster() without first going through xfs_iflush()
to find the buffer. Hence we can remove xfs_iflush() altogether,
because the two places that call it - the inode item push code and
inode reclaim - no longer need to flush inodes directly.

This can be further optimised by attaching the inode to the cluster
buffer when the inode is dirtied. i.e. when we add the buffer
reference to the inode log item, we also attach the inode to the
buffer for IO processing. This leads to the dirty inodes always
being attached to the buffer and hence we no longer need to add them
when we flush the inode and remove them when IO completes. Instead
the inodes are attached when the node log item is dirtied, and
removed when the inode log item is cleaned.

With this structure in place, we no longer need to do
lookups to find the dirty inodes in the cache to attach to the
buffer in xfs_iflush_cluster() - they are already attached to the
buffer. Hence when the AIL pushes an inode, we just grab the buffer
from the log item, and then walk the buffer log item list to lock
and flush the dirty inodes attached to the buffer.

This greatly simplifies inode writeback, and removes another memory
allocation from the inode writeback path (the array used for the
radix tree gang lookup). And while the radix tree lookups are fast,
walking the linked list of dirty inodes is faster.

There is followup work I am doing that uses the inode cluster buffer
as a replacement in the AIL for tracking dirty inodes. This part of
the series is not ready yet as it has some intricate locking
requirements. That is an optimisation, so I've left that out because
solving the inode reclaim blocking problems is the important part of
this work.

In short, this series simplifies inode writeback and fixes the long
standing inode reclaim blocking issues without requiring any changes
to the memory reclaim infrastructure.

Note: dquots should probably be converted to cluster flushing in a
similar way, as they have many of the same issues as inode flushing.

Thoughts, comments and improvemnts welcome.

-Dave.



Dave Chinner (24):
  xfs: remove logged flag from inode log item
  xfs: add an inode item lock
  xfs: mark inode buffers in cache
  xfs: mark dquot buffers in cache
  xfs: mark log recovery buffers for completion
  xfs: call xfs_buf_iodone directly
  xfs: clean up whacky buffer log item list reinit
  xfs: fold xfs_istale_done into xfs_iflush_done
  xfs: use direct calls for dquot IO completion
  xfs: clean up the buffer iodone callback functions
  xfs: get rid of log item callbacks
  xfs: pin inode backing buffer to the inode log item
  xfs: make inode reclaim almost non-blocking
  xfs: remove IO submission from xfs_reclaim_inode()
  xfs: allow multiple reclaimers per AG
  xfs: don't block inode reclaim on the ILOCK
  xfs: remove SYNC_TRYLOCK from inode reclaim
  xfs: clean up inode reclaim comments
  xfs: attach inodes to the cluster buffer when dirtied
  xfs: xfs_iflush() is no longer necessary
  xfs: rename xfs_iflush_int()
  xfs: rework xfs_iflush_cluster() dirty inode iteration
  xfs: factor xfs_iflush_done
  xfs: remove xfs_inobp_check()

 fs/xfs/libxfs/xfs_inode_buf.c   |  27 +-
 fs/xfs/libxfs/xfs_inode_buf.h   |   6 -
 fs/xfs/libxfs/xfs_trans_inode.c | 108 +++++--
 fs/xfs/xfs_buf.c                |  44 ++-
 fs/xfs/xfs_buf.h                |  49 +--
 fs/xfs/xfs_buf_item.c           | 205 +++++--------
 fs/xfs/xfs_buf_item.h           |   8 +-
 fs/xfs/xfs_buf_item_recover.c   |   5 +-
 fs/xfs/xfs_dquot.c              |  32 +-
 fs/xfs/xfs_dquot.h              |   1 +
 fs/xfs/xfs_dquot_item_recover.c |   2 +-
 fs/xfs/xfs_file.c               |   9 +-
 fs/xfs/xfs_icache.c             | 293 +++++-------------
 fs/xfs/xfs_inode.c              | 515 +++++++++++---------------------
 fs/xfs/xfs_inode.h              |   2 +-
 fs/xfs/xfs_inode_item.c         | 281 ++++++++---------
 fs/xfs/xfs_inode_item.h         |   9 +-
 fs/xfs/xfs_inode_item_recover.c |   2 +-
 fs/xfs/xfs_log_recover.c        |   5 +-
 fs/xfs/xfs_mount.c              |   4 -
 fs/xfs/xfs_mount.h              |   1 -
 fs/xfs/xfs_trans.h              |   3 -
 fs/xfs/xfs_trans_buf.c          |  15 +-
 fs/xfs/xfs_trans_priv.h         |  12 +-
 24 files changed, 680 insertions(+), 958 deletions(-)

-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 01/24] xfs: remove logged flag from inode log item
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22  7:25   ` Christoph Hellwig
  2020-05-22 21:13   ` Darrick J. Wong
  2020-05-22  3:50 ` [PATCH 02/24] xfs: add an inode item lock Dave Chinner
                   ` (24 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

This was used to track if the item had logged fields being flushed
to disk. We log everything in the inode these days, so this logic is
no longer needed. Remove it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c      | 13 ++++---------
 fs/xfs/xfs_inode_item.c | 35 ++++++++++-------------------------
 fs/xfs/xfs_inode_item.h |  1 -
 3 files changed, 14 insertions(+), 35 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 64f5f9a440aed..ca9f2688b745d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2658,7 +2658,6 @@ xfs_ifree_cluster(
 		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
 			if (lip->li_type == XFS_LI_INODE) {
 				iip = (struct xfs_inode_log_item *)lip;
-				ASSERT(iip->ili_logged == 1);
 				lip->li_cb = xfs_istale_done;
 				xfs_trans_ail_copy_lsn(mp->m_ail,
 							&iip->ili_flush_lsn,
@@ -2687,7 +2686,6 @@ xfs_ifree_cluster(
 			iip->ili_last_fields = iip->ili_fields;
 			iip->ili_fields = 0;
 			iip->ili_fsync_fields = 0;
-			iip->ili_logged = 1;
 			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 						&iip->ili_item.li_lsn);
 
@@ -3819,19 +3817,16 @@ xfs_iflush_int(
 	 *
 	 * We can play with the ili_fields bits here, because the inode lock
 	 * must be held exclusively in order to set bits there and the flush
-	 * lock protects the ili_last_fields bits.  Set ili_logged so the flush
-	 * done routine can tell whether or not to look in the AIL.  Also, store
-	 * the current LSN of the inode so that we can tell whether the item has
-	 * moved in the AIL from xfs_iflush_done().  In order to read the lsn we
-	 * need the AIL lock, because it is a 64 bit value that cannot be read
-	 * atomically.
+	 * lock protects the ili_last_fields bits.  Store the current LSN of the
+	 * inode so that we can tell whether the item has moved in the AIL from
+	 * xfs_iflush_done().  In order to read the lsn we need the AIL lock,
+	 * because it is a 64 bit value that cannot be read atomically.
 	 */
 	error = 0;
 flush_out:
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
 	iip->ili_fsync_fields = 0;
-	iip->ili_logged = 1;
 
 	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 				&iip->ili_item.li_lsn);
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index ba47bf65b772b..b17384aa8df40 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -528,8 +528,6 @@ xfs_inode_item_push(
 	}
 
 	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
-	ASSERT(iip->ili_logged == 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
-
 	spin_unlock(&lip->li_ailp->ail_lock);
 
 	error = xfs_iflush(ip, &bp);
@@ -690,30 +688,24 @@ xfs_iflush_done(
 			continue;
 
 		list_move_tail(&blip->li_bio_list, &tmp);
-		/*
-		 * while we have the item, do the unlocked check for needing
-		 * the AIL lock.
-		 */
+
+		/* Do an unlocked check for needing the AIL lock. */
 		iip = INODE_ITEM(blip);
-		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
+		if (blip->li_lsn == iip->ili_flush_lsn ||
 		    test_bit(XFS_LI_FAILED, &blip->li_flags))
 			need_ail++;
 	}
 
 	/* make sure we capture the state of the initial inode. */
 	iip = INODE_ITEM(lip);
-	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
+	if (lip->li_lsn == iip->ili_flush_lsn ||
 	    test_bit(XFS_LI_FAILED, &lip->li_flags))
 		need_ail++;
 
 	/*
-	 * We only want to pull the item from the AIL if it is
-	 * actually there and its location in the log has not
-	 * changed since we started the flush.  Thus, we only bother
-	 * if the ili_logged flag is set and the inode's lsn has not
-	 * changed.  First we check the lsn outside
-	 * the lock since it's cheaper, and then we recheck while
-	 * holding the lock before removing the inode from the AIL.
+	 * We only want to pull the item from the AIL if it is actually there
+	 * and its location in the log has not changed since we started the
+	 * flush.  Thus, we only bother if the inode's lsn has not changed.
 	 */
 	if (need_ail) {
 		xfs_lsn_t	tail_lsn = 0;
@@ -721,8 +713,7 @@ xfs_iflush_done(
 		/* this is an opencoded batch version of xfs_trans_ail_delete */
 		spin_lock(&ailp->ail_lock);
 		list_for_each_entry(blip, &tmp, li_bio_list) {
-			if (INODE_ITEM(blip)->ili_logged &&
-			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
+			if (blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
 				/*
 				 * xfs_ail_update_finish() only cares about the
 				 * lsn of the first tail item removed, any
@@ -740,14 +731,13 @@ xfs_iflush_done(
 	}
 
 	/*
-	 * clean up and unlock the flush lock now we are done. We can clear the
+	 * Clean up and unlock the flush lock now we are done. We can clear the
 	 * ili_last_fields bits now that we know that the data corresponding to
 	 * them is safely on disk.
 	 */
 	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
 		list_del_init(&blip->li_bio_list);
 		iip = INODE_ITEM(blip);
-		iip->ili_logged = 0;
 		iip->ili_last_fields = 0;
 		xfs_ifunlock(iip->ili_inode);
 	}
@@ -768,16 +758,11 @@ xfs_iflush_abort(
 
 	if (iip) {
 		xfs_trans_ail_delete(&iip->ili_item, 0);
-		iip->ili_logged = 0;
-		/*
-		 * Clear the ili_last_fields bits now that we know that the
-		 * data corresponding to them is safely on disk.
-		 */
-		iip->ili_last_fields = 0;
 		/*
 		 * Clear the inode logging fields so no more flushes are
 		 * attempted.
 		 */
+		iip->ili_last_fields = 0;
 		iip->ili_fields = 0;
 		iip->ili_fsync_fields = 0;
 	}
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 60b34bb66e8ed..4de5070e07655 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -19,7 +19,6 @@ struct xfs_inode_log_item {
 	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
 	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
 	unsigned short		ili_lock_flags;	   /* lock flags */
-	unsigned short		ili_logged;	   /* flushed logged data */
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
 	unsigned int		ili_fsync_fields;  /* logged since last fsync */
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 02/24] xfs: add an inode item lock
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
  2020-05-22  3:50 ` [PATCH 01/24] xfs: remove logged flag from inode log item Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22  6:45   ` Amir Goldstein
                     ` (2 more replies)
  2020-05-22  3:50 ` [PATCH 03/24] xfs: mark inode buffers in cache Dave Chinner
                   ` (23 subsequent siblings)
  25 siblings, 3 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The inode log item is kind of special in that it can be aggregating
new changes in memory at the same time time existing changes are
being written back to disk. This means there are fields in the log
item that are accessed concurrently from contexts that don't share
any locking at all.

e.g. updating ili_last_fields occurs at flush time under the
ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
completion time, and is read under the ILOCK_EXCL when the inode is
logged.  Hence there is no actual serialisation between reading the
field during logging of the inode in transactions vs clearing the
field in IO completion.

We currently get away with this by the fact that we are only
clearing fields in IO completion, and nothing bad happens if we
accidentally log more of the inode than we actually modify. Worst
case is we consume a tiny bit more memory and log bandwidth.

However, if we want to do more complex state manipulations on the
log item that requires updates at all three of these potential
locations, we need to have some mechanism of serialising those
operations. To do this, introduce a spinlock into the log item to
serialise internal state.

This could be done via the xfs_inode i_flags_lock, but this then
leads to potential lock inversion issues where inode flag updates
need to occur inside locks that best nest inside the inode log item
locks (e.g. marking inodes stale during inode cluster freeing).
Using a separate spinlock avoids these sorts of problems and
simplifies future code.

This does not touch the use of ili_fields in the item formatting
code - that is entirely protected by the ILOCK_EXCL at this point in
time, so it remains untouched.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_inode.c | 44 ++++++++++++++++++---------------
 fs/xfs/xfs_file.c               |  9 ++++---
 fs/xfs/xfs_inode.c              | 20 +++++++++------
 fs/xfs/xfs_inode_item.c         |  7 ++++++
 fs/xfs/xfs_inode_item.h         |  3 ++-
 5 files changed, 51 insertions(+), 32 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index b5dfb66548422..510b996008221 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -81,15 +81,19 @@ xfs_trans_ichgtime(
  */
 void
 xfs_trans_log_inode(
-	xfs_trans_t	*tp,
-	xfs_inode_t	*ip,
-	uint		flags)
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	uint			flags)
 {
-	struct inode	*inode = VFS_I(ip);
+	struct xfs_inode_log_item *iip = ip->i_itemp;
+	struct inode		*inode = VFS_I(ip);
+	uint			iversion_flags = 0;
 
 	ASSERT(ip->i_itemp != NULL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 
+	tp->t_flags |= XFS_TRANS_DIRTY;
+
 	/*
 	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
 	 * don't matter - we either will need an extra transaction in 24 hours
@@ -102,15 +106,6 @@ xfs_trans_log_inode(
 		spin_unlock(&inode->i_lock);
 	}
 
-	/*
-	 * Record the specific change for fdatasync optimisation. This
-	 * allows fdatasync to skip log forces for inodes that are only
-	 * timestamp dirty. We do this before the change count so that
-	 * the core being logged in this case does not impact on fdatasync
-	 * behaviour.
-	 */
-	ip->i_itemp->ili_fsync_fields |= flags;
-
 	/*
 	 * First time we log the inode in a transaction, bump the inode change
 	 * counter if it is configured for this to occur. While we have the
@@ -120,13 +115,21 @@ xfs_trans_log_inode(
 	 * set however, then go ahead and bump the i_version counter
 	 * unconditionally.
 	 */
-	if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
-	    IS_I_VERSION(VFS_I(ip))) {
-		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
-			flags |= XFS_ILOG_CORE;
+	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags) &&
+	    IS_I_VERSION(inode)) {
+		if (inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
+			iversion_flags = XFS_ILOG_CORE;
 	}
 
-	tp->t_flags |= XFS_TRANS_DIRTY;
+	/*
+	 * Record the specific change for fdatasync optimisation. This
+	 * allows fdatasync to skip log forces for inodes that are only
+	 * timestamp dirty. We do this before the change count so that
+	 * the core being logged in this case does not impact on fdatasync
+	 * behaviour.
+	 */
+	spin_lock(&iip->ili_lock);
+	iip->ili_fsync_fields |= flags;
 
 	/*
 	 * Always OR in the bits from the ili_last_fields field.
@@ -135,8 +138,9 @@ xfs_trans_log_inode(
 	 * See the big comment in xfs_iflush() for an explanation of
 	 * this coordination mechanism.
 	 */
-	flags |= ip->i_itemp->ili_last_fields;
-	ip->i_itemp->ili_fields |= flags;
+	flags |= iip->ili_last_fields | iversion_flags;
+	iip->ili_fields |= flags;
+	spin_unlock(&iip->ili_lock);
 }
 
 int
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 403c90309a8ff..0abf770b77498 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -94,6 +94,7 @@ xfs_file_fsync(
 {
 	struct inode		*inode = file->f_mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_inode_log_item *iip = ip->i_itemp;
 	struct xfs_mount	*mp = ip->i_mount;
 	int			error = 0;
 	int			log_flushed = 0;
@@ -137,13 +138,15 @@ xfs_file_fsync(
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	if (xfs_ipincount(ip)) {
 		if (!datasync ||
-		    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
-			lsn = ip->i_itemp->ili_last_lsn;
+		    (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
+			lsn = iip->ili_last_lsn;
 	}
 
 	if (lsn) {
 		error = xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
-		ip->i_itemp->ili_fsync_fields = 0;
+		spin_lock(&iip->ili_lock);
+		iip->ili_fsync_fields = 0;
+		spin_unlock(&iip->ili_lock);
 	}
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ca9f2688b745d..57781c0dbbec5 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2683,9 +2683,11 @@ xfs_ifree_cluster(
 				continue;
 
 			iip = ip->i_itemp;
+			spin_lock(&iip->ili_lock);
 			iip->ili_last_fields = iip->ili_fields;
 			iip->ili_fields = 0;
 			iip->ili_fsync_fields = 0;
+			spin_unlock(&iip->ili_lock);
 			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 						&iip->ili_item.li_lsn);
 
@@ -2721,6 +2723,7 @@ xfs_ifree(
 {
 	int			error;
 	struct xfs_icluster	xic = { 0 };
+	struct xfs_inode_log_item *iip = ip->i_itemp;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(VFS_I(ip)->i_nlink == 0);
@@ -2758,7 +2761,9 @@ xfs_ifree(
 	ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS;
 
 	/* Don't attempt to replay owner changes for a deleted inode */
-	ip->i_itemp->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);
+	spin_lock(&iip->ili_lock);
+	iip->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);
+	spin_unlock(&iip->ili_lock);
 
 	/*
 	 * Bump the generation count so no one will be confused
@@ -3814,20 +3819,19 @@ xfs_iflush_int(
 	 * know that the information those bits represent is permanently on
 	 * disk.  As long as the flush completes before the inode is logged
 	 * again, then both ili_fields and ili_last_fields will be cleared.
-	 *
-	 * We can play with the ili_fields bits here, because the inode lock
-	 * must be held exclusively in order to set bits there and the flush
-	 * lock protects the ili_last_fields bits.  Store the current LSN of the
-	 * inode so that we can tell whether the item has moved in the AIL from
-	 * xfs_iflush_done().  In order to read the lsn we need the AIL lock,
-	 * because it is a 64 bit value that cannot be read atomically.
 	 */
 	error = 0;
 flush_out:
+	spin_lock(&iip->ili_lock);
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
 	iip->ili_fsync_fields = 0;
+	spin_unlock(&iip->ili_lock);
 
+	/*
+	 * Store the current LSN of the inode so that we can tell whether the
+	 * item has moved in the AIL from xfs_iflush_done().
+	 */
 	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 				&iip->ili_item.li_lsn);
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index b17384aa8df40..6ef9cbcfc94a7 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -637,6 +637,7 @@ xfs_inode_item_init(
 	iip = ip->i_itemp = kmem_zone_zalloc(xfs_ili_zone, 0);
 
 	iip->ili_inode = ip;
+	spin_lock_init(&iip->ili_lock);
 	xfs_log_item_init(mp, &iip->ili_item, XFS_LI_INODE,
 						&xfs_inode_item_ops);
 }
@@ -738,7 +739,11 @@ xfs_iflush_done(
 	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
 		list_del_init(&blip->li_bio_list);
 		iip = INODE_ITEM(blip);
+
+		spin_lock(&iip->ili_lock);
 		iip->ili_last_fields = 0;
+		spin_unlock(&iip->ili_lock);
+
 		xfs_ifunlock(iip->ili_inode);
 	}
 	list_del(&tmp);
@@ -762,9 +767,11 @@ xfs_iflush_abort(
 		 * Clear the inode logging fields so no more flushes are
 		 * attempted.
 		 */
+		spin_lock(&iip->ili_lock);
 		iip->ili_last_fields = 0;
 		iip->ili_fields = 0;
 		iip->ili_fsync_fields = 0;
+		spin_unlock(&iip->ili_lock);
 	}
 	/*
 	 * Release the inode's flush lock since we're done with it.
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 4de5070e07655..1234e8cd3726d 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -18,7 +18,8 @@ struct xfs_inode_log_item {
 	struct xfs_inode	*ili_inode;	   /* inode ptr */
 	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
 	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
-	unsigned short		ili_lock_flags;	   /* lock flags */
+	spinlock_t		ili_lock;	   /* internal state lock */
+	unsigned short		ili_lock_flags;	   /* inode lock flags */
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
 	unsigned int		ili_fsync_fields;  /* logged since last fsync */
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 03/24] xfs: mark inode buffers in cache
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
  2020-05-22  3:50 ` [PATCH 01/24] xfs: remove logged flag from inode log item Dave Chinner
  2020-05-22  3:50 ` [PATCH 02/24] xfs: add an inode item lock Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22  7:45   ` Amir Goldstein
                     ` (2 more replies)
  2020-05-22  3:50 ` [PATCH 04/24] xfs: mark dquot " Dave Chinner
                   ` (22 subsequent siblings)
  25 siblings, 3 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Inode buffers always have write IO callbacks, so by marking them
directly we can avoid needing to attach ->b_iodone functions to
them. This avoids an indirect call, and makes future modifications
much simpler.

This is largely a rearrangement of the code at this point - no IO
completion functionality changes at this point, just how the
code is run is modified.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c       | 18 +++++++++++++-----
 fs/xfs/xfs_buf.h       | 39 ++++++++++++++++++++++++++-------------
 fs/xfs/xfs_buf_item.c  | 42 +++++++++++++++++++++++++++++++-----------
 fs/xfs/xfs_buf_item.h  |  1 +
 fs/xfs/xfs_inode.c     |  2 +-
 fs/xfs/xfs_trans_buf.c |  3 +++
 6 files changed, 75 insertions(+), 30 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 9c2fbb6bbf89d..6105b97028d6a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -14,6 +14,8 @@
 #include "xfs_mount.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
+#include "xfs_trans.h"
+#include "xfs_buf_item.h"
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 
@@ -1202,12 +1204,18 @@ xfs_buf_ioend(
 		bp->b_flags |= XBF_DONE;
 	}
 
-	if (bp->b_iodone)
+	/* inodes always have a callback on write */
+	if (!read && (bp->b_flags & _XBF_INODES)) {
+		xfs_buf_inode_iodone(bp);
+		return;
+	}
+
+	if (bp->b_iodone) {
 		(*(bp->b_iodone))(bp);
-	else if (bp->b_flags & XBF_ASYNC)
-		xfs_buf_relse(bp);
-	else
-		complete(&bp->b_iowait);
+		return;
+	}
+
+	xfs_buf_ioend_finish(bp);
 }
 
 static void
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 050c53b739e24..b3e5d653d09f1 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -30,15 +30,19 @@
 #define XBF_STALE	 (1 << 6) /* buffer has been staled, do not find it */
 #define XBF_WRITE_FAIL	 (1 << 7) /* async writes have failed on this buffer */
 
-/* flags used only as arguments to access routines */
-#define XBF_TRYLOCK	 (1 << 16)/* lock requested, but do not wait */
-#define XBF_UNMAPPED	 (1 << 17)/* do not map the buffer */
+/* buffer type flags for write callbacks */
+#define _XBF_INODES	 (1 << 16)/* inode buffer */
 
 /* flags used only internally */
 #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
 #define _XBF_KMEM	 (1 << 21)/* backed by heap memory */
 #define _XBF_DELWRI_Q	 (1 << 22)/* buffer on a delwri queue */
 
+/* flags used only as arguments to access routines */
+#define XBF_TRYLOCK	 (1 << 30)/* lock requested, but do not wait */
+#define XBF_UNMAPPED	 (1 << 31)/* do not map the buffer */
+
+
 typedef unsigned int xfs_buf_flags_t;
 
 #define XFS_BUF_FLAGS \
@@ -50,12 +54,13 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_DONE,		"DONE" }, \
 	{ XBF_STALE,		"STALE" }, \
 	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
-	{ XBF_TRYLOCK,		"TRYLOCK" },	/* should never be set */\
-	{ XBF_UNMAPPED,		"UNMAPPED" },	/* ditto */\
+	{ _XBF_INODES,		"INODES" }, \
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
-	{ _XBF_DELWRI_Q,	"DELWRI_Q" }
-
+	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
+	/* The following interface flags should never be set */ \
+	{ XBF_TRYLOCK,		"TRYLOCK" }, \
+	{ XBF_UNMAPPED,		"UNMAPPED" }
 
 /*
  * Internal state flags.
@@ -257,9 +262,23 @@ extern void xfs_buf_unlock(xfs_buf_t *);
 #define xfs_buf_islocked(bp) \
 	((bp)->b_sema.count <= 0)
 
+static inline void xfs_buf_relse(xfs_buf_t *bp)
+{
+	xfs_buf_unlock(bp);
+	xfs_buf_rele(bp);
+}
+
 /* Buffer Read and Write Routines */
 extern int xfs_bwrite(struct xfs_buf *bp);
 extern void xfs_buf_ioend(struct xfs_buf *bp);
+static inline void xfs_buf_ioend_finish(struct xfs_buf *bp)
+{
+	if (bp->b_flags & XBF_ASYNC)
+		xfs_buf_relse(bp);
+	else
+		complete(&bp->b_iowait);
+}
+
 extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error,
 		xfs_failaddr_t failaddr);
 #define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address)
@@ -324,12 +343,6 @@ static inline int xfs_buf_ispinned(struct xfs_buf *bp)
 	return atomic_read(&bp->b_pin_count);
 }
 
-static inline void xfs_buf_relse(xfs_buf_t *bp)
-{
-	xfs_buf_unlock(bp);
-	xfs_buf_rele(bp);
-}
-
 static inline int
 xfs_buf_verify_cksum(struct xfs_buf *bp, unsigned long cksum_offset)
 {
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 9e75e8d6042ec..8659cf4282a64 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1158,20 +1158,15 @@ xfs_buf_iodone_callback_error(
 	return false;
 }
 
-/*
- * This is the iodone() function for buffers which have had callbacks attached
- * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
- * callback list, mark the buffer as having no more callbacks and then push the
- * buffer through IO completion processing.
- */
-void
-xfs_buf_iodone_callbacks(
+static void
+xfs_buf_run_callbacks(
 	struct xfs_buf		*bp)
 {
+
 	/*
-	 * If there is an error, process it. Some errors require us
-	 * to run callbacks after failure processing is done so we
-	 * detect that and take appropriate action.
+	 * If there is an error, process it. Some errors require us to run
+	 * callbacks after failure processing is done so we detect that and take
+	 * appropriate action.
 	 */
 	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
 		return;
@@ -1188,9 +1183,34 @@ xfs_buf_iodone_callbacks(
 	bp->b_log_item = NULL;
 	list_del_init(&bp->b_li_list);
 	bp->b_iodone = NULL;
+}
+
+/*
+ * This is the iodone() function for buffers which have had callbacks attached
+ * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
+ * callback list, mark the buffer as having no more callbacks and then push the
+ * buffer through IO completion processing.
+ */
+void
+xfs_buf_iodone_callbacks(
+	struct xfs_buf		*bp)
+{
+	xfs_buf_run_callbacks(bp);
 	xfs_buf_ioend(bp);
 }
 
+/*
+ * Inode buffer iodone callback function.
+ */
+void
+xfs_buf_inode_iodone(
+	struct xfs_buf		*bp)
+{
+	xfs_buf_run_callbacks(bp);
+	xfs_buf_ioend_finish(bp);
+}
+
+
 /*
  * This is the iodone() function for buffers which have been
  * logged.  It is called when they are eventually flushed out.
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index c9c57e2da9327..a342933ad9b8d 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -59,6 +59,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
 			      struct xfs_log_item *);
 void	xfs_buf_iodone_callbacks(struct xfs_buf *);
 void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
+void	xfs_buf_inode_iodone(struct xfs_buf *);
 bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
 
 extern kmem_zone_t	*xfs_buf_item_zone;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 57781c0dbbec5..607c9d9bb2b40 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3841,13 +3841,13 @@ xfs_iflush_int(
 	 * completion on the buffer to remove the inode from the AIL and release
 	 * the flush lock.
 	 */
+	bp->b_flags |= _XBF_INODES;
 	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);
 
 	/* generate the checksum. */
 	xfs_dinode_calc_crc(mp, dip);
 
 	ASSERT(!list_empty(&bp->b_li_list));
-	ASSERT(bp->b_iodone != NULL);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 08174ffa21189..552d0869aa0fe 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -626,6 +626,7 @@ xfs_trans_inode_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_INODE_BUF;
+	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
 
@@ -651,6 +652,7 @@ xfs_trans_stale_inode_buf(
 
 	bip->bli_flags |= XFS_BLI_STALE_INODE;
 	bip->bli_item.li_cb = xfs_buf_iodone;
+	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
 
@@ -675,6 +677,7 @@ xfs_trans_inode_alloc_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_INODE_ALLOC_BUF;
+	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
 
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 04/24] xfs: mark dquot buffers in cache
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (2 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 03/24] xfs: mark inode buffers in cache Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22  7:46   ` Amir Goldstein
  2020-05-22 21:38   ` Darrick J. Wong
  2020-05-22  3:50 ` [PATCH 05/24] xfs: mark log recovery buffers for completion Dave Chinner
                   ` (21 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

dquot buffers always have write IO callbacks, so by marking them
directly we can avoid needing to attach ->b_iodone functions to
them. This avoids an indirect call, and makes future modifications
much simpler.

This is largely a rearrangement of the code at this point - no IO
completion functionality changes at this point, just how the
code is run is modified.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c       | 12 +++++++++++-
 fs/xfs/xfs_buf.h       |  2 ++
 fs/xfs/xfs_buf_item.c  | 10 ++++++++++
 fs/xfs/xfs_buf_item.h  |  1 +
 fs/xfs/xfs_dquot.c     |  1 +
 fs/xfs/xfs_trans_buf.c |  1 +
 6 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 6105b97028d6a..77d40eb4a11db 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1204,17 +1204,27 @@ xfs_buf_ioend(
 		bp->b_flags |= XBF_DONE;
 	}
 
+	if (read)
+		goto out_finish;
+
 	/* inodes always have a callback on write */
-	if (!read && (bp->b_flags & _XBF_INODES)) {
+	if (bp->b_flags & _XBF_INODES) {
 		xfs_buf_inode_iodone(bp);
 		return;
 	}
 
+	/* dquots always have a callback on write */
+	if (bp->b_flags & _XBF_DQUOTS) {
+		xfs_buf_dquot_iodone(bp);
+		return;
+	}
+
 	if (bp->b_iodone) {
 		(*(bp->b_iodone))(bp);
 		return;
 	}
 
+out_finish:
 	xfs_buf_ioend_finish(bp);
 }
 
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index b3e5d653d09f1..cbde44ecb3963 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -32,6 +32,7 @@
 
 /* buffer type flags for write callbacks */
 #define _XBF_INODES	 (1 << 16)/* inode buffer */
+#define _XBF_DQUOTS	 (1 << 17)/* dquot buffer */
 
 /* flags used only internally */
 #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
@@ -55,6 +56,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_STALE,		"STALE" }, \
 	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
 	{ _XBF_INODES,		"INODES" }, \
+	{ _XBF_DQUOTS,		"DQUOTS" }, \
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 8659cf4282a64..a42cdf9ccc47d 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1210,6 +1210,16 @@ xfs_buf_inode_iodone(
 	xfs_buf_ioend_finish(bp);
 }
 
+/*
+ * Dquot buffer iodone callback function.
+ */
+void
+xfs_buf_dquot_iodone(
+	struct xfs_buf		*bp)
+{
+	xfs_buf_run_callbacks(bp);
+	xfs_buf_ioend_finish(bp);
+}
 
 /*
  * This is the iodone() function for buffers which have been
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index a342933ad9b8d..27d13d29b5bbb 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -60,6 +60,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
 void	xfs_buf_iodone_callbacks(struct xfs_buf *);
 void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
 void	xfs_buf_inode_iodone(struct xfs_buf *);
+void	xfs_buf_dquot_iodone(struct xfs_buf *);
 bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
 
 extern kmem_zone_t	*xfs_buf_item_zone;
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 55b95d45303b8..25592b701db40 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1174,6 +1174,7 @@ xfs_qm_dqflush(
 	 * Attach an iodone routine so that we can remove this dquot from the
 	 * AIL and release the flush lock once the dquot is synced to disk.
 	 */
+	bp->b_flags |= _XBF_DQUOTS;
 	xfs_buf_attach_iodone(bp, xfs_qm_dqflush_done,
 				  &dqp->q_logitem.qli_item);
 
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 552d0869aa0fe..93d62cb864c15 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -788,5 +788,6 @@ xfs_trans_dquot_buf(
 		break;
 	}
 
+	bp->b_flags |= _XBF_DQUOTS;
 	xfs_trans_buf_set_type(tp, bp, type);
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 05/24] xfs: mark log recovery buffers for completion
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (3 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 04/24] xfs: mark dquot " Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22  7:41   ` Amir Goldstein
  2020-05-22 21:41   ` Darrick J. Wong
  2020-05-22  3:50 ` [PATCH 06/24] xfs: call xfs_buf_iodone directly Dave Chinner
                   ` (20 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Log recovery has it's own buffer write completion handler for
buffers that it directly recovers. Convert these to direct calls by
flagging these buffers as being log recovery buffers. The flag will
get cleared by the log recovery IO completion routine, so it will
never leak out of log recovery.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c                | 10 ++++++++++
 fs/xfs/xfs_buf.h                |  2 ++
 fs/xfs/xfs_buf_item_recover.c   |  5 ++---
 fs/xfs/xfs_dquot_item_recover.c |  2 +-
 fs/xfs/xfs_inode_item_recover.c |  2 +-
 fs/xfs/xfs_log_recover.c        |  5 ++---
 6 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 77d40eb4a11db..b89685ce8519d 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -14,6 +14,7 @@
 #include "xfs_mount.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
+#include "xfs_log_recover.h"
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
 #include "xfs_errortag.h"
@@ -1207,6 +1208,15 @@ xfs_buf_ioend(
 	if (read)
 		goto out_finish;
 
+	/*
+	 * If this is a log recovery buffer, we aren't doing transactional IO
+	 * yet so we need to let it handle IO completions.
+	 */
+	if (bp->b_flags & _XBF_LOGRCVY) {
+		xlog_recover_iodone(bp);
+		return;
+	}
+
 	/* inodes always have a callback on write */
 	if (bp->b_flags & _XBF_INODES) {
 		xfs_buf_inode_iodone(bp);
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index cbde44ecb3963..c5fe4c48c9080 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -33,6 +33,7 @@
 /* buffer type flags for write callbacks */
 #define _XBF_INODES	 (1 << 16)/* inode buffer */
 #define _XBF_DQUOTS	 (1 << 17)/* dquot buffer */
+#define _XBF_LOGRCVY	 (1 << 18)/* log recovery buffer */
 
 /* flags used only internally */
 #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
@@ -57,6 +58,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
 	{ _XBF_INODES,		"INODES" }, \
 	{ _XBF_DQUOTS,		"DQUOTS" }, \
+	{ _XBF_LOGRCVY,		"LOG_RECOVERY" }, \
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
index 04faa7310c4f0..bfd50daa16606 100644
--- a/fs/xfs/xfs_buf_item_recover.c
+++ b/fs/xfs/xfs_buf_item_recover.c
@@ -419,8 +419,7 @@ xlog_recover_validate_buf_type(
 	if (bp->b_ops) {
 		struct xfs_buf_log_item	*bip;
 
-		ASSERT(!bp->b_iodone || bp->b_iodone == xlog_recover_iodone);
-		bp->b_iodone = xlog_recover_iodone;
+		bp->b_flags |= _XBF_LOGRCVY;
 		xfs_buf_item_init(bp, mp);
 		bip = bp->b_log_item;
 		bip->bli_item.li_lsn = current_lsn;
@@ -963,7 +962,7 @@ xlog_recover_buf_commit_pass2(
 		error = xfs_bwrite(bp);
 	} else {
 		ASSERT(bp->b_mount == mp);
-		bp->b_iodone = xlog_recover_iodone;
+		bp->b_flags |= _XBF_LOGRCVY;
 		xfs_buf_delwri_queue(bp, buffer_list);
 	}
 
diff --git a/fs/xfs/xfs_dquot_item_recover.c b/fs/xfs/xfs_dquot_item_recover.c
index 3400be4c88f08..a0a4b089e0cdd 100644
--- a/fs/xfs/xfs_dquot_item_recover.c
+++ b/fs/xfs/xfs_dquot_item_recover.c
@@ -153,7 +153,7 @@ xlog_recover_dquot_commit_pass2(
 
 	ASSERT(dq_f->qlf_size == 2);
 	ASSERT(bp->b_mount == mp);
-	bp->b_iodone = xlog_recover_iodone;
+	bp->b_flags |= _XBF_LOGRCVY;
 	xfs_buf_delwri_queue(bp, buffer_list);
 
 out_release:
diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
index dc3e26ff16c90..b67f1b7c5b65f 100644
--- a/fs/xfs/xfs_inode_item_recover.c
+++ b/fs/xfs/xfs_inode_item_recover.c
@@ -376,7 +376,7 @@ xlog_recover_inode_commit_pass2(
 	xfs_dinode_calc_crc(log->l_mp, dip);
 
 	ASSERT(bp->b_mount == mp);
-	bp->b_iodone = xlog_recover_iodone;
+	bp->b_flags |= _XBF_LOGRCVY;
 	xfs_buf_delwri_queue(bp, buffer_list);
 
 out_release:
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index ec015df55b77a..0aa823aeafca9 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -287,9 +287,8 @@ xlog_recover_iodone(
 	if (bp->b_log_item)
 		xfs_buf_item_relse(bp);
 	ASSERT(bp->b_log_item == NULL);
-
-	bp->b_iodone = NULL;
-	xfs_buf_ioend(bp);
+	bp->b_flags &= ~_XBF_LOGRCVY;
+	xfs_buf_ioend_finish(bp);
 }
 
 /*
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 06/24] xfs: call xfs_buf_iodone directly
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (4 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 05/24] xfs: mark log recovery buffers for completion Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22  7:56   ` Amir Goldstein
  2020-05-22 21:53   ` Darrick J. Wong
  2020-05-22  3:50 ` [PATCH 07/24] xfs: clean up whacky buffer log item list reinit Dave Chinner
                   ` (19 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

All unmarked dirty buffers should be in the AIL and have log items
attached to them. Hence when they are written, we will run a
callback to remove the item from the AIL if appropriate. Now that
we've handled inode and dquot buffers, all remaining calls are to
xfs_buf_iodone() and so we can hard code this rather than use an
indirect call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c       | 24 +++++++++---------------
 fs/xfs/xfs_buf.h       |  6 +-----
 fs/xfs/xfs_buf_item.c  | 38 +++++++++-----------------------------
 fs/xfs/xfs_buf_item.h  |  2 +-
 fs/xfs/xfs_trans_buf.c |  9 +--------
 5 files changed, 21 insertions(+), 58 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index b89685ce8519d..5fc83637f7aff 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -658,7 +658,6 @@ xfs_buf_find(
 	 */
 	if (bp->b_flags & XBF_STALE) {
 		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
-		ASSERT(bp->b_iodone == NULL);
 		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
 		bp->b_ops = NULL;
 	}
@@ -1194,10 +1193,13 @@ xfs_buf_ioend(
 	if (!bp->b_error && bp->b_io_error)
 		xfs_buf_ioerror(bp, bp->b_io_error);
 
-	/* Only validate buffers that were read without errors */
-	if (read && !bp->b_error && bp->b_ops) {
-		ASSERT(!bp->b_iodone);
-		bp->b_ops->verify_read(bp);
+	if (read) {
+		if (!bp->b_error && bp->b_ops)
+			bp->b_ops->verify_read(bp);
+		if (!bp->b_error)
+			bp->b_flags |= XBF_DONE;
+		xfs_buf_ioend_finish(bp);
+		return;
 	}
 
 	if (!bp->b_error) {
@@ -1205,9 +1207,6 @@ xfs_buf_ioend(
 		bp->b_flags |= XBF_DONE;
 	}
 
-	if (read)
-		goto out_finish;
-
 	/*
 	 * If this is a log recovery buffer, we aren't doing transactional IO
 	 * yet so we need to let it handle IO completions.
@@ -1229,13 +1228,8 @@ xfs_buf_ioend(
 		return;
 	}
 
-	if (bp->b_iodone) {
-		(*(bp->b_iodone))(bp);
-		return;
-	}
-
-out_finish:
-	xfs_buf_ioend_finish(bp);
+	/* dirty buffers always need to be removed from the AIL */
+	xfs_buf_dirty_iodone(bp);
 }
 
 static void
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index c5fe4c48c9080..a127d56d5eff0 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -18,6 +18,7 @@
 /*
  *	Base types
  */
+struct xfs_buf;
 
 #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
 
@@ -103,10 +104,6 @@ typedef struct xfs_buftarg {
 	struct ratelimit_state	bt_ioerror_rl;
 } xfs_buftarg_t;
 
-struct xfs_buf;
-typedef void (*xfs_buf_iodone_t)(struct xfs_buf *);
-
-
 #define XB_PAGES	2
 
 struct xfs_buf_map {
@@ -159,7 +156,6 @@ typedef struct xfs_buf {
 	xfs_buftarg_t		*b_target;	/* buffer target (device) */
 	void			*b_addr;	/* virtual address of buffer */
 	struct work_struct	b_ioend_work;
-	xfs_buf_iodone_t	b_iodone;	/* I/O completion function */
 	struct completion	b_iowait;	/* queue for I/O waiters */
 	struct xfs_buf_log_item	*b_log_item;
 	struct list_head	b_li_list;	/* Log items list head */
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index a42cdf9ccc47d..c2e7d14e35c66 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -460,7 +460,6 @@ xfs_buf_item_unpin(
 			xfs_buf_do_callbacks(bp);
 			bp->b_log_item = NULL;
 			list_del_init(&bp->b_li_list);
-			bp->b_iodone = NULL;
 		} else {
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
 			xfs_buf_item_relse(bp);
@@ -936,11 +935,7 @@ xfs_buf_item_free(
 }
 
 /*
- * This is called when the buf log item is no longer needed.  It should
- * free the buf log item associated with the given buffer and clear
- * the buffer's pointer to the buf log item.  If there are no more
- * items in the list, clear the b_iodone field of the buffer (see
- * xfs_buf_attach_iodone() below).
+ * xfs_buf_item_relse() is called when the buf log item is no longer needed.
  */
 void
 xfs_buf_item_relse(
@@ -952,9 +947,6 @@ xfs_buf_item_relse(
 	ASSERT(!test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags));
 
 	bp->b_log_item = NULL;
-	if (list_empty(&bp->b_li_list))
-		bp->b_iodone = NULL;
-
 	xfs_buf_rele(bp);
 	xfs_buf_item_free(bip);
 }
@@ -962,10 +954,7 @@ xfs_buf_item_relse(
 
 /*
  * Add the given log item with its callback to the list of callbacks
- * to be called when the buffer's I/O completes.  If it is not set
- * already, set the buffer's b_iodone() routine to be
- * xfs_buf_iodone_callbacks() and link the log item into the list of
- * items rooted at b_li_list.
+ * to be called when the buffer's I/O completes.
  */
 void
 xfs_buf_attach_iodone(
@@ -977,10 +966,6 @@ xfs_buf_attach_iodone(
 
 	lip->li_cb = cb;
 	list_add_tail(&lip->li_bio_list, &bp->b_li_list);
-
-	ASSERT(bp->b_iodone == NULL ||
-	       bp->b_iodone == xfs_buf_iodone_callbacks);
-	bp->b_iodone = xfs_buf_iodone_callbacks;
 }
 
 /*
@@ -1096,7 +1081,6 @@ xfs_buf_iodone_callback_error(
 		goto out_stale;
 
 	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
-	ASSERT(bp->b_iodone != NULL);
 
 	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
 
@@ -1182,28 +1166,24 @@ xfs_buf_run_callbacks(
 	xfs_buf_do_callbacks(bp);
 	bp->b_log_item = NULL;
 	list_del_init(&bp->b_li_list);
-	bp->b_iodone = NULL;
 }
 
 /*
- * This is the iodone() function for buffers which have had callbacks attached
- * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
- * callback list, mark the buffer as having no more callbacks and then push the
- * buffer through IO completion processing.
+ * Inode buffer iodone callback function.
  */
 void
-xfs_buf_iodone_callbacks(
+xfs_buf_inode_iodone(
 	struct xfs_buf		*bp)
 {
 	xfs_buf_run_callbacks(bp);
-	xfs_buf_ioend(bp);
+	xfs_buf_ioend_finish(bp);
 }
 
 /*
- * Inode buffer iodone callback function.
+ * Dquot buffer iodone callback function.
  */
 void
-xfs_buf_inode_iodone(
+xfs_buf_dquot_iodone(
 	struct xfs_buf		*bp)
 {
 	xfs_buf_run_callbacks(bp);
@@ -1211,10 +1191,10 @@ xfs_buf_inode_iodone(
 }
 
 /*
- * Dquot buffer iodone callback function.
+ * Dirty buffer iodone callback function.
  */
 void
-xfs_buf_dquot_iodone(
+xfs_buf_dirty_iodone(
 	struct xfs_buf		*bp)
 {
 	xfs_buf_run_callbacks(bp);
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 27d13d29b5bbb..96f994ec90915 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -57,10 +57,10 @@ bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
 void	xfs_buf_attach_iodone(struct xfs_buf *,
 			      void(*)(struct xfs_buf *, struct xfs_log_item *),
 			      struct xfs_log_item *);
-void	xfs_buf_iodone_callbacks(struct xfs_buf *);
 void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
 void	xfs_buf_inode_iodone(struct xfs_buf *);
 void	xfs_buf_dquot_iodone(struct xfs_buf *);
+void	xfs_buf_dirty_iodone(struct xfs_buf *);
 bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
 
 extern kmem_zone_t	*xfs_buf_item_zone;
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 93d62cb864c15..69e0ebe94a915 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -465,23 +465,16 @@ xfs_trans_dirty_buf(
 
 	ASSERT(bp->b_transp == tp);
 	ASSERT(bip != NULL);
-	ASSERT(bp->b_iodone == NULL ||
-	       bp->b_iodone == xfs_buf_iodone_callbacks);
 
 	/*
 	 * Mark the buffer as needing to be written out eventually,
 	 * and set its iodone function to remove the buffer's buf log
 	 * item from the AIL and free it when the buffer is flushed
-	 * to disk.  See xfs_buf_attach_iodone() for more details
-	 * on li_cb and xfs_buf_iodone_callbacks().
-	 * If we end up aborting this transaction, we trap this buffer
-	 * inside the b_bdstrat callback so that this won't get written to
-	 * disk.
+	 * to disk.
 	 */
 	bp->b_flags |= XBF_DONE;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
-	bp->b_iodone = xfs_buf_iodone_callbacks;
 	bip->bli_item.li_cb = xfs_buf_iodone;
 
 	/*
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 07/24] xfs: clean up whacky buffer log item list reinit
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (5 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 06/24] xfs: call xfs_buf_iodone directly Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 22:01   ` Darrick J. Wong
  2020-05-23  8:50   ` Christoph Hellwig
  2020-05-22  3:50 ` [PATCH 08/24] xfs: fold xfs_istale_done into xfs_iflush_done Dave Chinner
                   ` (18 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we've emptied the buffer log item list, it does a list_del_init
on itself to reset it's pointers to itself. This is unnecessary as
the list is already empty at this point. I'm guessing this was a
bandaid for a iodone item leak or list corruption at some point in
the past, and we've carried it ever since. Get rid of it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index c2e7d14e35c66..b7ffb117e141e 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -459,7 +459,6 @@ xfs_buf_item_unpin(
 		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
 			xfs_buf_do_callbacks(bp);
 			bp->b_log_item = NULL;
-			list_del_init(&bp->b_li_list);
 		} else {
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
 			xfs_buf_item_relse(bp);
@@ -1165,7 +1164,6 @@ xfs_buf_run_callbacks(
 
 	xfs_buf_do_callbacks(bp);
 	bp->b_log_item = NULL;
-	list_del_init(&bp->b_li_list);
 }
 
 /*
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 08/24] xfs: fold xfs_istale_done into xfs_iflush_done
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (6 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 07/24] xfs: clean up whacky buffer log item list reinit Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 22:10   ` Darrick J. Wong
  2020-05-23  9:12   ` Christoph Hellwig
  2020-05-22  3:50 ` [PATCH 09/24] xfs: use direct calls for dquot IO completion Dave Chinner
                   ` (17 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Having different io completion callbacks for different inode states
makes things complex. We can detect if the inode is stale via the
XFS_ISTALE flag in IO completion, so we don't need a special
callback just for this.

This means inodes only have a single iodone callback, and inode IO
completion is entirely buffer centric at this point. Hence we no
longer need to use a log item callback at all as we can just call
xfs_iflush_done() directly from the buffer completions and walk the
buffer log item list to complete the all inodes under IO.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c   | 35 ++++++++++++++++++----
 fs/xfs/xfs_inode.c      |  6 ++--
 fs/xfs/xfs_inode_item.c | 65 ++++++++++++++---------------------------
 fs/xfs/xfs_inode_item.h |  5 ++--
 4 files changed, 56 insertions(+), 55 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index b7ffb117e141e..e376f778bf57c 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -13,6 +13,8 @@
 #include "xfs_mount.h"
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
+#include "xfs_inode.h"
+#include "xfs_inode_item.h"
 #include "xfs_trans_priv.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
@@ -457,7 +459,8 @@ xfs_buf_item_unpin(
 		 * the AIL lock.
 		 */
 		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
-			xfs_buf_do_callbacks(bp);
+			lip->li_cb(bp, lip);
+			xfs_iflush_done(bp);
 			bp->b_log_item = NULL;
 		} else {
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
@@ -1141,8 +1144,8 @@ xfs_buf_iodone_callback_error(
 	return false;
 }
 
-static void
-xfs_buf_run_callbacks(
+static inline bool
+xfs_buf_had_callback_errors(
 	struct xfs_buf		*bp)
 {
 
@@ -1152,7 +1155,7 @@ xfs_buf_run_callbacks(
 	 * appropriate action.
 	 */
 	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
-		return;
+		return true;
 
 	/*
 	 * Successful IO or permanent error. Either way, we can clear the
@@ -1161,7 +1164,16 @@ xfs_buf_run_callbacks(
 	bp->b_last_error = 0;
 	bp->b_retries = 0;
 	bp->b_first_retry_time = 0;
+	return false;
+}
 
+static void
+xfs_buf_run_callbacks(
+	struct xfs_buf		*bp)
+{
+
+	if (xfs_buf_had_callback_errors(bp))
+		return;
 	xfs_buf_do_callbacks(bp);
 	bp->b_log_item = NULL;
 }
@@ -1173,7 +1185,20 @@ void
 xfs_buf_inode_iodone(
 	struct xfs_buf		*bp)
 {
-	xfs_buf_run_callbacks(bp);
+	struct xfs_buf_log_item *blip = bp->b_log_item;
+	struct xfs_log_item	*lip;
+
+	if (xfs_buf_had_callback_errors(bp))
+		return;
+
+	/* If there is a buf_log_item attached, run its callback */
+	if (blip) {
+		lip = &blip->bli_item;
+		lip->li_cb(bp, lip);
+		bp->b_log_item = NULL;
+	}
+
+	xfs_iflush_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 607c9d9bb2b40..c75d625de7945 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2658,7 +2658,6 @@ xfs_ifree_cluster(
 		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
 			if (lip->li_type == XFS_LI_INODE) {
 				iip = (struct xfs_inode_log_item *)lip;
-				lip->li_cb = xfs_istale_done;
 				xfs_trans_ail_copy_lsn(mp->m_ail,
 							&iip->ili_flush_lsn,
 							&iip->ili_item.li_lsn);
@@ -2691,8 +2690,7 @@ xfs_ifree_cluster(
 			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 						&iip->ili_item.li_lsn);
 
-			xfs_buf_attach_iodone(bp, xfs_istale_done,
-						  &iip->ili_item);
+			xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
 
 			if (ip != free_ip)
 				xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -3842,7 +3840,7 @@ xfs_iflush_int(
 	 * the flush lock.
 	 */
 	bp->b_flags |= _XBF_INODES;
-	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);
+	xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
 
 	/* generate the checksum. */
 	xfs_dinode_calc_crc(mp, dip);
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6ef9cbcfc94a7..7049f2ae8d186 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -668,40 +668,34 @@ xfs_inode_item_destroy(
  */
 void
 xfs_iflush_done(
-	struct xfs_buf		*bp,
-	struct xfs_log_item	*lip)
+	struct xfs_buf		*bp)
 {
 	struct xfs_inode_log_item *iip;
-	struct xfs_log_item	*blip, *n;
-	struct xfs_ail		*ailp = lip->li_ailp;
+	struct xfs_log_item	*lip, *n;
+	struct xfs_ail		*ailp = bp->b_mount->m_ail;
 	int			need_ail = 0;
 	LIST_HEAD(tmp);
 
 	/*
-	 * Scan the buffer IO completions for other inodes being completed and
-	 * attach them to the current inode log item.
+	 * Pull the attached inodes from the buffer one at a time and take the
+	 * appropriate action on them.
 	 */
-
-	list_add_tail(&lip->li_bio_list, &tmp);
-
-	list_for_each_entry_safe(blip, n, &bp->b_li_list, li_bio_list) {
-		if (lip->li_cb != xfs_iflush_done)
+	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
+		iip = INODE_ITEM(lip);
+		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
+			list_del_init(&lip->li_bio_list);
+			xfs_iflush_abort(iip->ili_inode);
 			continue;
+		}
 
-		list_move_tail(&blip->li_bio_list, &tmp);
+		list_move_tail(&lip->li_bio_list, &tmp);
 
 		/* Do an unlocked check for needing the AIL lock. */
-		iip = INODE_ITEM(blip);
-		if (blip->li_lsn == iip->ili_flush_lsn ||
-		    test_bit(XFS_LI_FAILED, &blip->li_flags))
+		if (lip->li_lsn == iip->ili_flush_lsn ||
+		    test_bit(XFS_LI_FAILED, &lip->li_flags))
 			need_ail++;
 	}
-
-	/* make sure we capture the state of the initial inode. */
-	iip = INODE_ITEM(lip);
-	if (lip->li_lsn == iip->ili_flush_lsn ||
-	    test_bit(XFS_LI_FAILED, &lip->li_flags))
-		need_ail++;
+	ASSERT(list_empty(&bp->b_li_list));
 
 	/*
 	 * We only want to pull the item from the AIL if it is actually there
@@ -713,19 +707,13 @@ xfs_iflush_done(
 
 		/* this is an opencoded batch version of xfs_trans_ail_delete */
 		spin_lock(&ailp->ail_lock);
-		list_for_each_entry(blip, &tmp, li_bio_list) {
-			if (blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
-				/*
-				 * xfs_ail_update_finish() only cares about the
-				 * lsn of the first tail item removed, any
-				 * others will be at the same or higher lsn so
-				 * we just ignore them.
-				 */
-				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip);
+		list_for_each_entry(lip, &tmp, li_bio_list) {
+			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
+				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
 				if (!tail_lsn && lsn)
 					tail_lsn = lsn;
 			} else {
-				xfs_clear_li_failed(blip);
+				xfs_clear_li_failed(lip);
 			}
 		}
 		xfs_ail_update_finish(ailp, tail_lsn);
@@ -736,9 +724,9 @@ xfs_iflush_done(
 	 * ili_last_fields bits now that we know that the data corresponding to
 	 * them is safely on disk.
 	 */
-	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
-		list_del_init(&blip->li_bio_list);
-		iip = INODE_ITEM(blip);
+	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
+		list_del_init(&lip->li_bio_list);
+		iip = INODE_ITEM(lip);
 
 		spin_lock(&iip->ili_lock);
 		iip->ili_last_fields = 0;
@@ -746,7 +734,6 @@ xfs_iflush_done(
 
 		xfs_ifunlock(iip->ili_inode);
 	}
-	list_del(&tmp);
 }
 
 /*
@@ -779,14 +766,6 @@ xfs_iflush_abort(
 	xfs_ifunlock(ip);
 }
 
-void
-xfs_istale_done(
-	struct xfs_buf		*bp,
-	struct xfs_log_item	*lip)
-{
-	xfs_iflush_abort(INODE_ITEM(lip)->ili_inode);
-}
-
 /*
  * convert an xfs_inode_log_format struct from the old 32 bit version
  * (which can have different field alignments) to the native 64 bit version
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 1234e8cd3726d..ef427cd03ffc3 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -25,15 +25,14 @@ struct xfs_inode_log_item {
 	unsigned int		ili_fsync_fields;  /* logged since last fsync */
 };
 
-static inline int xfs_inode_clean(xfs_inode_t *ip)
+static inline int xfs_inode_clean(struct xfs_inode *ip)
 {
 	return !ip->i_itemp || !(ip->i_itemp->ili_fields & XFS_ILOG_ALL);
 }
 
 extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *);
 extern void xfs_inode_item_destroy(struct xfs_inode *);
-extern void xfs_iflush_done(struct xfs_buf *, struct xfs_log_item *);
-extern void xfs_istale_done(struct xfs_buf *, struct xfs_log_item *);
+extern void xfs_iflush_done(struct xfs_buf *);
 extern void xfs_iflush_abort(struct xfs_inode *);
 extern int xfs_inode_item_format_convert(xfs_log_iovec_t *,
 					 struct xfs_inode_log_format *);
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 09/24] xfs: use direct calls for dquot IO completion
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (7 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 08/24] xfs: fold xfs_istale_done into xfs_iflush_done Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 22:13   ` Darrick J. Wong
  2020-05-23  9:16   ` Christoph Hellwig
  2020-05-22  3:50 ` [PATCH 10/24] xfs: clean up the buffer iodone callback functions Dave Chinner
                   ` (16 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Similar to inodes, we can call the dquot IO completion functions
directly from the buffer completion code, removing another user of
log item callbacks for IO completion processing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 18 +++++++++++++++++-
 fs/xfs/xfs_dquot.c    | 27 +++++++++++++++++++++++----
 fs/xfs/xfs_dquot.h    |  1 +
 3 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index e376f778bf57c..57e5d37f6852e 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -15,6 +15,9 @@
 #include "xfs_buf_item.h"
 #include "xfs_inode.h"
 #include "xfs_inode_item.h"
+#include "xfs_quota.h"
+#include "xfs_dquot_item.h"
+#include "xfs_dquot.h"
 #include "xfs_trans_priv.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
@@ -1209,7 +1212,20 @@ void
 xfs_buf_dquot_iodone(
 	struct xfs_buf		*bp)
 {
-	xfs_buf_run_callbacks(bp);
+	struct xfs_buf_log_item *blip = bp->b_log_item;
+	struct xfs_log_item	*lip;
+
+	if (xfs_buf_had_callback_errors(bp))
+		return;
+
+	/* a newly allocated dquot buffer might have a log item attached */
+	if (blip) {
+		lip = &blip->bli_item;
+		lip->li_cb(bp, lip);
+		bp->b_log_item = NULL;
+	}
+
+	xfs_dquot_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
 
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 25592b701db40..1d7f34a9bc989 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1043,9 +1043,8 @@ xfs_qm_dqrele(
  * from the AIL if it has not been re-logged, and unlocking the dquot's
  * flush lock. This behavior is very similar to that of inodes..
  */
-STATIC void
+static void
 xfs_qm_dqflush_done(
-	struct xfs_buf		*bp,
 	struct xfs_log_item	*lip)
 {
 	struct xfs_dq_logitem	*qip = (struct xfs_dq_logitem *)lip;
@@ -1086,6 +1085,27 @@ xfs_qm_dqflush_done(
 	xfs_dqfunlock(dqp);
 }
 
+void
+xfs_dquot_done(
+	struct xfs_buf		*bp)
+{
+	struct xfs_log_item	*lip;
+
+	while (!list_empty(&bp->b_li_list)) {
+		lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
+				       li_bio_list);
+
+		/*
+		 * Remove the item from the list, so we don't have any
+		 * confusion if the item is added to another buf.
+		 * Don't touch the log item after calling its
+		 * callback, because it could have freed itself.
+		 */
+		list_del_init(&lip->li_bio_list);
+		xfs_qm_dqflush_done(lip);
+	}
+}
+
 /*
  * Write a modified dquot to disk.
  * The dquot must be locked and the flush lock too taken by caller.
@@ -1175,8 +1195,7 @@ xfs_qm_dqflush(
 	 * AIL and release the flush lock once the dquot is synced to disk.
 	 */
 	bp->b_flags |= _XBF_DQUOTS;
-	xfs_buf_attach_iodone(bp, xfs_qm_dqflush_done,
-				  &dqp->q_logitem.qli_item);
+	xfs_buf_attach_iodone(bp, NULL, &dqp->q_logitem.qli_item);
 
 	/*
 	 * If the buffer is pinned then push on the log so we won't
diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
index fe3e46df604b4..f1924288986d3 100644
--- a/fs/xfs/xfs_dquot.h
+++ b/fs/xfs/xfs_dquot.h
@@ -174,6 +174,7 @@ void		xfs_qm_dqput(struct xfs_dquot *dqp);
 void		xfs_dqlock2(struct xfs_dquot *, struct xfs_dquot *);
 
 void		xfs_dquot_set_prealloc_limits(struct xfs_dquot *);
+void		xfs_dquot_done(struct xfs_buf *);
 
 static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp)
 {
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 10/24] xfs: clean up the buffer iodone callback functions
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (8 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 09/24] xfs: use direct calls for dquot IO completion Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 22:26   ` Darrick J. Wong
  2020-05-23  9:19   ` Christoph Hellwig
  2020-05-22  3:50 ` [PATCH 11/24] xfs: get rid of log item callbacks Dave Chinner
                   ` (15 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we've sorted inode and dquot buffers, we can apply the same
cleanups to dirty buffers with buffer log items. They only have one
callback, too, so we don't need the log item callback. Collapse the
iodone functions and remove all the now unnecessary infrastructure
around callback processing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c  | 139 +++++++++--------------------------------
 fs/xfs/xfs_buf_item.h  |   1 -
 fs/xfs/xfs_trans_buf.c |   2 -
 3 files changed, 29 insertions(+), 113 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 57e5d37f6852e..d44b3e3f46613 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -30,7 +30,7 @@ static inline struct xfs_buf_log_item *BUF_ITEM(struct xfs_log_item *lip)
 	return container_of(lip, struct xfs_buf_log_item, bli_item);
 }
 
-STATIC void	xfs_buf_do_callbacks(struct xfs_buf *bp);
+static void xfs_buf_item_done(struct xfs_buf *bp);
 
 /* Is this log iovec plausibly large enough to contain the buffer log format? */
 bool
@@ -462,9 +462,8 @@ xfs_buf_item_unpin(
 		 * the AIL lock.
 		 */
 		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
-			lip->li_cb(bp, lip);
+			xfs_buf_item_done(bp);
 			xfs_iflush_done(bp);
-			bp->b_log_item = NULL;
 		} else {
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
 			xfs_buf_item_relse(bp);
@@ -973,46 +972,6 @@ xfs_buf_attach_iodone(
 	list_add_tail(&lip->li_bio_list, &bp->b_li_list);
 }
 
-/*
- * We can have many callbacks on a buffer. Running the callbacks individually
- * can cause a lot of contention on the AIL lock, so we allow for a single
- * callback to be able to scan the remaining items in bp->b_li_list for other
- * items of the same type and callback to be processed in the first call.
- *
- * As a result, the loop walking the callback list below will also modify the
- * list. it removes the first item from the list and then runs the callback.
- * The loop then restarts from the new first item int the list. This allows the
- * callback to scan and modify the list attached to the buffer and we don't
- * have to care about maintaining a next item pointer.
- */
-STATIC void
-xfs_buf_do_callbacks(
-	struct xfs_buf		*bp)
-{
-	struct xfs_buf_log_item *blip = bp->b_log_item;
-	struct xfs_log_item	*lip;
-
-	/* If there is a buf_log_item attached, run its callback */
-	if (blip) {
-		lip = &blip->bli_item;
-		lip->li_cb(bp, lip);
-	}
-
-	while (!list_empty(&bp->b_li_list)) {
-		lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
-				       li_bio_list);
-
-		/*
-		 * Remove the item from the list, so we don't have any
-		 * confusion if the item is added to another buf.
-		 * Don't touch the log item after calling its
-		 * callback, because it could have freed itself.
-		 */
-		list_del_init(&lip->li_bio_list);
-		lip->li_cb(bp, lip);
-	}
-}
-
 /*
  * Invoke the error state callback for each log item affected by the failed I/O.
  *
@@ -1025,8 +984,8 @@ STATIC void
 xfs_buf_do_callbacks_fail(
 	struct xfs_buf		*bp)
 {
+	struct xfs_ail		*ailp = bp->b_mount->m_ail;
 	struct xfs_log_item	*lip;
-	struct xfs_ail		*ailp;
 
 	/*
 	 * Buffer log item errors are handled directly by xfs_buf_item_push()
@@ -1036,9 +995,6 @@ xfs_buf_do_callbacks_fail(
 	if (list_empty(&bp->b_li_list))
 		return;
 
-	lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
-			li_bio_list);
-	ailp = lip->li_ailp;
 	spin_lock(&ailp->ail_lock);
 	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
 		if (lip->li_ops->iop_error)
@@ -1051,22 +1007,11 @@ static bool
 xfs_buf_iodone_callback_error(
 	struct xfs_buf		*bp)
 {
-	struct xfs_buf_log_item	*bip = bp->b_log_item;
-	struct xfs_log_item	*lip;
-	struct xfs_mount	*mp;
+	struct xfs_mount	*mp = bp->b_mount;
 	static ulong		lasttime;
 	static xfs_buftarg_t	*lasttarg;
 	struct xfs_error_cfg	*cfg;
 
-	/*
-	 * The failed buffer might not have a buf_log_item attached or the
-	 * log_item list might be empty. Get the mp from the available
-	 * xfs_log_item
-	 */
-	lip = list_first_entry_or_null(&bp->b_li_list, struct xfs_log_item,
-				       li_bio_list);
-	mp = lip ? lip->li_mountp : bip->bli_item.li_mountp;
-
 	/*
 	 * If we've already decided to shutdown the filesystem because of
 	 * I/O errors, there's no point in giving this a retry.
@@ -1171,14 +1116,27 @@ xfs_buf_had_callback_errors(
 }
 
 static void
-xfs_buf_run_callbacks(
+xfs_buf_item_done(
 	struct xfs_buf		*bp)
 {
+	struct xfs_buf_log_item	*bip = bp->b_log_item;
 
-	if (xfs_buf_had_callback_errors(bp))
+	if (!bip)
 		return;
-	xfs_buf_do_callbacks(bp);
+
+	/*
+	 * If we are forcibly shutting down, this may well be off the AIL
+	 * already. That's because we simulate the log-committed callbacks to
+	 * unpin these buffers. Or we may never have put this item on AIL
+	 * because of the transaction was aborted forcibly.
+	 * xfs_trans_ail_delete() takes care of these.
+	 *
+	 * Either way, AIL is useless if we're forcing a shutdown.
+	 */
+	xfs_trans_ail_delete(&bip->bli_item, SHUTDOWN_CORRUPT_INCORE);
 	bp->b_log_item = NULL;
+	xfs_buf_item_free(bip);
+	xfs_buf_rele(bp);
 }
 
 /*
@@ -1188,19 +1146,10 @@ void
 xfs_buf_inode_iodone(
 	struct xfs_buf		*bp)
 {
-	struct xfs_buf_log_item *blip = bp->b_log_item;
-	struct xfs_log_item	*lip;
-
 	if (xfs_buf_had_callback_errors(bp))
 		return;
 
-	/* If there is a buf_log_item attached, run its callback */
-	if (blip) {
-		lip = &blip->bli_item;
-		lip->li_cb(bp, lip);
-		bp->b_log_item = NULL;
-	}
-
+	xfs_buf_item_done(bp);
 	xfs_iflush_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
@@ -1212,59 +1161,29 @@ void
 xfs_buf_dquot_iodone(
 	struct xfs_buf		*bp)
 {
-	struct xfs_buf_log_item *blip = bp->b_log_item;
-	struct xfs_log_item	*lip;
-
 	if (xfs_buf_had_callback_errors(bp))
 		return;
 
 	/* a newly allocated dquot buffer might have a log item attached */
-	if (blip) {
-		lip = &blip->bli_item;
-		lip->li_cb(bp, lip);
-		bp->b_log_item = NULL;
-	}
-
+	xfs_buf_item_done(bp);
 	xfs_dquot_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
 
 /*
  * Dirty buffer iodone callback function.
+ *
+ * Note that for things like remote attribute buffers, there may not be a buffer
+ * log item here, so processing the buffer log item must remain be optional.
  */
 void
 xfs_buf_dirty_iodone(
 	struct xfs_buf		*bp)
 {
-	xfs_buf_run_callbacks(bp);
+	if (xfs_buf_had_callback_errors(bp))
+		return;
+
+	xfs_buf_item_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
 
-/*
- * This is the iodone() function for buffers which have been
- * logged.  It is called when they are eventually flushed out.
- * It should remove the buf item from the AIL, and free the buf item.
- * It is called by xfs_buf_iodone_callbacks() above which will take
- * care of cleaning up the buffer itself.
- */
-void
-xfs_buf_iodone(
-	struct xfs_buf		*bp,
-	struct xfs_log_item	*lip)
-{
-	ASSERT(BUF_ITEM(lip)->bli_buf == bp);
-
-	xfs_buf_rele(bp);
-
-	/*
-	 * If we are forcibly shutting down, this may well be off the AIL
-	 * already. That's because we simulate the log-committed callbacks to
-	 * unpin these buffers. Or we may never have put this item on AIL
-	 * because of the transaction was aborted forcibly.
-	 * xfs_trans_ail_delete() takes care of these.
-	 *
-	 * Either way, AIL is useless if we're forcing a shutdown.
-	 */
-	xfs_trans_ail_delete(lip, SHUTDOWN_CORRUPT_INCORE);
-	xfs_buf_item_free(BUF_ITEM(lip));
-}
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 96f994ec90915..3f436efb0b67a 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -57,7 +57,6 @@ bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
 void	xfs_buf_attach_iodone(struct xfs_buf *,
 			      void(*)(struct xfs_buf *, struct xfs_log_item *),
 			      struct xfs_log_item *);
-void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
 void	xfs_buf_inode_iodone(struct xfs_buf *);
 void	xfs_buf_dquot_iodone(struct xfs_buf *);
 void	xfs_buf_dirty_iodone(struct xfs_buf *);
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 69e0ebe94a915..11cd666cd99a6 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -475,7 +475,6 @@ xfs_trans_dirty_buf(
 	bp->b_flags |= XBF_DONE;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
-	bip->bli_item.li_cb = xfs_buf_iodone;
 
 	/*
 	 * If we invalidated the buffer within this transaction, then
@@ -644,7 +643,6 @@ xfs_trans_stale_inode_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_STALE_INODE;
-	bip->bli_item.li_cb = xfs_buf_iodone;
 	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 11/24] xfs: get rid of log item callbacks
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (9 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 10/24] xfs: clean up the buffer iodone callback functions Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 22:27   ` Darrick J. Wong
  2020-05-23  9:19   ` Christoph Hellwig
  2020-05-22  3:50 ` [PATCH 12/24] xfs: pin inode backing buffer to the inode log item Dave Chinner
                   ` (14 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

They are not used anymore, so remove them from the log item and the
buffer iodone attachment interfaces.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 17 -----------------
 fs/xfs/xfs_buf_item.h |  3 ---
 fs/xfs/xfs_dquot.c    |  6 +++---
 fs/xfs/xfs_inode.c    |  5 +++--
 fs/xfs/xfs_trans.h    |  3 ---
 5 files changed, 6 insertions(+), 28 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index d44b3e3f46613..d855a2b7486c5 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -955,23 +955,6 @@ xfs_buf_item_relse(
 	xfs_buf_item_free(bip);
 }
 
-
-/*
- * Add the given log item with its callback to the list of callbacks
- * to be called when the buffer's I/O completes.
- */
-void
-xfs_buf_attach_iodone(
-	struct xfs_buf		*bp,
-	void			(*cb)(struct xfs_buf *, struct xfs_log_item *),
-	struct xfs_log_item	*lip)
-{
-	ASSERT(xfs_buf_islocked(bp));
-
-	lip->li_cb = cb;
-	list_add_tail(&lip->li_bio_list, &bp->b_li_list);
-}
-
 /*
  * Invoke the error state callback for each log item affected by the failed I/O.
  *
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 3f436efb0b67a..ecfad1915a86b 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -54,9 +54,6 @@ void	xfs_buf_item_relse(struct xfs_buf *);
 bool	xfs_buf_item_put(struct xfs_buf_log_item *);
 void	xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint);
 bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
-void	xfs_buf_attach_iodone(struct xfs_buf *,
-			      void(*)(struct xfs_buf *, struct xfs_log_item *),
-			      struct xfs_log_item *);
 void	xfs_buf_inode_iodone(struct xfs_buf *);
 void	xfs_buf_dquot_iodone(struct xfs_buf *);
 void	xfs_buf_dirty_iodone(struct xfs_buf *);
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 1d7f34a9bc989..34fc1bcb1eefd 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1191,11 +1191,11 @@ xfs_qm_dqflush(
 	}
 
 	/*
-	 * Attach an iodone routine so that we can remove this dquot from the
-	 * AIL and release the flush lock once the dquot is synced to disk.
+	 * Attach the dquot to the buffer so that we can remove this dquot from
+	 * the AIL and release the flush lock once the dquot is synced to disk.
 	 */
 	bp->b_flags |= _XBF_DQUOTS;
-	xfs_buf_attach_iodone(bp, NULL, &dqp->q_logitem.qli_item);
+	list_add_tail(&dqp->q_logitem.qli_item.li_bio_list, &bp->b_li_list);
 
 	/*
 	 * If the buffer is pinned then push on the log so we won't
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c75d625de7945..c5529853f513c 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2690,7 +2690,8 @@ xfs_ifree_cluster(
 			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 						&iip->ili_item.li_lsn);
 
-			xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
+			list_add_tail(&iip->ili_item.li_bio_list,
+						&bp->b_li_list);
 
 			if (ip != free_ip)
 				xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -3840,7 +3841,7 @@ xfs_iflush_int(
 	 * the flush lock.
 	 */
 	bp->b_flags |= _XBF_INODES;
-	xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
+	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
 
 	/* generate the checksum. */
 	xfs_dinode_calc_crc(mp, dip);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 8308bf6d7e404..d27429e3a82a6 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -37,9 +37,6 @@ struct xfs_log_item {
 	unsigned long			li_flags;	/* misc flags */
 	struct xfs_buf			*li_buf;	/* real buffer pointer */
 	struct list_head		li_bio_list;	/* buffer item list */
-	void				(*li_cb)(struct xfs_buf *,
-						 struct xfs_log_item *);
-							/* buffer item iodone */
 							/* callback func */
 	const struct xfs_item_ops	*li_ops;	/* function list */
 
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 12/24] xfs: pin inode backing buffer to the inode log item
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (10 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 11/24] xfs: get rid of log item callbacks Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 22:39   ` Darrick J. Wong
  2020-05-23  9:34   ` Christoph Hellwig
  2020-05-22  3:50 ` [PATCH 13/24] xfs: make inode reclaim almost non-blocking Dave Chinner
                   ` (13 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we dirty an inode, we are going to have to write it disk at
some point in the near future. This requires the inode cluster
backing buffer to be present in memory. Unfortunately, under severe
memory pressure we can reclaim the inode backing buffer while the
inode is dirty in memory, resulting in stalling the AIL pushing
because it has to do a read-modify-write cycle on the cluster
buffer.

When we have no memory available, the read of the cluster buffer
blocks the AIL pushing process, and this causes all sorts of issues
for memory reclaim as it requires inode writeback to make forwards
progress. Allocating a cluster buffer causes more memory pressure,
and results in more cluster buffers to be reclaimed, resulting in
more RMW cycles to be done in the AIL context and everything then
backs up on AIL progress. Only the synchronous inode cluster
writeback in the the inode reclaim code provides some level of
forwards progress guarantees that prevent OOM-killer rampages in
this situation.

Fix this by pinning the inode backing buffer to the inode log item
when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
This may mean the first modification of an inode that has been held
in cache for a long time may block on a cluster buffer read, but
we can do that in transaction context and block safely until the
buffer has been allocated and read.

Once we have the cluster buffer, the inode log item takes a
reference to it, pinning it in memory, and attaches it to the log
item for future reference. This means we can always grab the cluster
buffer from the inode log item when we need it.

When the inode is finally cleaned and removed from the AIL, we can
drop the reference the inode log item holds on the cluster buffer.
Once all inodes on the cluster buffer are clean, the cluster buffer
will be unpinned and it will be available for memory reclaim to
reclaim again.

This avoids the issues with needing to do RMW cycles in the AIL
pushing context, and hence allows complete non-blocking inode
flushing to be performed by the AIL pushing context.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c   |  3 +-
 fs/xfs/libxfs/xfs_trans_inode.c | 71 +++++++++++++++++++++++++--------
 fs/xfs/xfs_inode_item.c         | 63 ++++++++++++++++++++++++-----
 fs/xfs/xfs_trans_priv.h         | 12 +-----
 4 files changed, 112 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 6f84ea85fdd83..1af97235785c8 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -176,7 +176,8 @@ xfs_imap_to_bp(
 	}
 
 	*bpp = bp;
-	*dipp = xfs_buf_offset(bp, imap->im_boffset);
+	if (dipp)
+		*dipp = xfs_buf_offset(bp, imap->im_boffset);
 	return 0;
 }
 
diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 510b996008221..e130eb2994156 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -8,6 +8,8 @@
 #include "xfs_shared.h"
 #include "xfs_format.h"
 #include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
 #include "xfs_inode.h"
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
@@ -71,13 +73,19 @@ xfs_trans_ichgtime(
 }
 
 /*
- * This is called to mark the fields indicated in fieldmask as needing
- * to be logged when the transaction is committed.  The inode must
- * already be associated with the given transaction.
+ * This is called to mark the fields indicated in fieldmask as needing to be
+ * logged when the transaction is committed.  The inode must already be
+ * associated with the given transaction.
  *
- * The values for fieldmask are defined in xfs_inode_item.h.  We always
- * log all of the core inode if any of it has changed, and we always log
- * all of the inline data/extents/b-tree root if any of them has changed.
+ * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
+ * of the core inode if any of it has changed, and we always log all of the
+ * inline data/extents/b-tree root if any of them has changed.
+ *
+ * Grab and pin the cluster buffer associated with this inode to avoid RMW
+ * cycles at inode writeback time. Avoid the need to add error handling to every
+ * xfs_trans_log_inode() call by shutting down on read error.  This will cause
+ * transactions to fail and everything to error out, just like if we return a
+ * read error in a dirty transaction and cancel it.
  */
 void
 xfs_trans_log_inode(
@@ -122,21 +130,52 @@ xfs_trans_log_inode(
 	}
 
 	/*
-	 * Record the specific change for fdatasync optimisation. This
-	 * allows fdatasync to skip log forces for inodes that are only
-	 * timestamp dirty. We do this before the change count so that
-	 * the core being logged in this case does not impact on fdatasync
-	 * behaviour.
+	 * Record the specific change for fdatasync optimisation. This allows
+	 * fdatasync to skip log forces for inodes that are only timestamp
+	 * dirty. We do this before the change count so that the core being
+	 * logged in this case does not impact on fdatasync behaviour.
 	 */
 	spin_lock(&iip->ili_lock);
 	iip->ili_fsync_fields |= flags;
 
+	if (!iip->ili_item.li_buf) {
+		struct xfs_buf	*bp;
+		int		error;
+
+		/*
+		 * We hold the ILOCK here, so this inode is not going to be
+		 * flushed while we are here. Further, because there is no
+		 * buffer attached to the item, we know that there is no IO in
+		 * progress, so nothing will clear the ili_fields while we read
+		 * in the buffer. Hence we can safely drop the spin lock and
+		 * read the buffer knowing that the state will not change from
+		 * here.
+		 */
+		spin_unlock(&iip->ili_lock);
+		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, NULL,
+					&bp, 0);
+		if (error) {
+			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
+			return;
+		}
+
+		/*
+		 * We need an explicit buffer reference for the log item, We
+		 * don't want the buffer attached to the transaction, so we have
+		 * to release the transaction reference we just gained.
+		 */
+		xfs_buf_hold(bp);
+		xfs_trans_brelse(tp, bp);
+
+		spin_lock(&iip->ili_lock);
+		iip->ili_item.li_buf = bp;
+	}
+
 	/*
-	 * Always OR in the bits from the ili_last_fields field.
-	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
-	 * routines in the eventual clearing of the ili_fields bits.
-	 * See the big comment in xfs_iflush() for an explanation of
-	 * this coordination mechanism.
+	 * Always OR in the bits from the ili_last_fields field.  This is to
+	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
+	 * the eventual clearing of the ili_fields bits.  See the big comment in
+	 * xfs_iflush() for an explanation of this coordination mechanism.
 	 */
 	flags |= iip->ili_last_fields | iversion_flags;
 	iip->ili_fields |= flags;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 7049f2ae8d186..86173a52526fe 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -130,6 +130,8 @@ xfs_inode_item_size(
 	xfs_inode_item_data_fork_size(iip, nvecs, nbytes);
 	if (XFS_IFORK_Q(ip))
 		xfs_inode_item_attr_fork_size(iip, nvecs, nbytes);
+
+	ASSERT(iip->ili_item.li_buf);
 }
 
 STATIC void
@@ -439,6 +441,7 @@ xfs_inode_item_pin(
 	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+	ASSERT(lip->li_buf);
 
 	trace_xfs_inode_pin(ip, _RET_IP_);
 	atomic_inc(&ip->i_pincount);
@@ -450,6 +453,12 @@ xfs_inode_item_pin(
  * item which was previously pinned with a call to xfs_inode_item_pin().
  *
  * Also wake up anyone in xfs_iunpin_wait() if the count goes to 0.
+ *
+ * Note that unpin can race with inode cluster buffer freeing marking the buffer
+ * stale. In that case, flush completions are run from the buffer unpin call,
+ * which may happen before the inode is unpinned. If we lose the race, there
+ * will be no buffer attached to the log item, but the inode will be marked
+ * XFS_ISTALE.
  */
 STATIC void
 xfs_inode_item_unpin(
@@ -459,6 +468,7 @@ xfs_inode_item_unpin(
 	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
 
 	trace_xfs_inode_unpin(ip, _RET_IP_);
+	ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE));
 	ASSERT(atomic_read(&ip->i_pincount) > 0);
 	if (atomic_dec_and_test(&ip->i_pincount))
 		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
@@ -647,10 +657,15 @@ xfs_inode_item_init(
  */
 void
 xfs_inode_item_destroy(
-	xfs_inode_t	*ip)
+	struct xfs_inode	*ip)
 {
-	kmem_free(ip->i_itemp->ili_item.li_lv_shadow);
-	kmem_cache_free(xfs_ili_zone, ip->i_itemp);
+	struct xfs_inode_log_item *iip = ip->i_itemp;
+
+	ASSERT(iip->ili_item.li_buf == NULL);
+
+	ip->i_itemp = NULL;
+	kmem_free(iip->ili_item.li_lv_shadow);
+	kmem_cache_free(xfs_ili_zone, iip);
 }
 
 
@@ -665,6 +680,13 @@ xfs_inode_item_destroy(
  * list for other inodes that will run this function. We remove them from the
  * buffer list so we can process all the inode IO completions in one AIL lock
  * traversal.
+ *
+ * Note: Now that we attach the log item to the buffer when we first log the
+ * inode in memory, we can have unflushed inodes on the buffer list here. These
+ * inodes will have a zero ili_last_fields, so skip over them here. We do
+ * this check -after- we've checked for stale inodes, because we're guaranteed
+ * to have XFS_ISTALE set in the case that dirty inodes are in the CIL and have
+ * not yet had their dirtying transactions committed to disk.
  */
 void
 xfs_iflush_done(
@@ -688,14 +710,16 @@ xfs_iflush_done(
 			continue;
 		}
 
+		if (!iip->ili_last_fields)
+			continue;
+
 		list_move_tail(&lip->li_bio_list, &tmp);
 
 		/* Do an unlocked check for needing the AIL lock. */
-		if (lip->li_lsn == iip->ili_flush_lsn ||
+		if (iip->ili_flush_lsn == lip->li_lsn ||
 		    test_bit(XFS_LI_FAILED, &lip->li_flags))
 			need_ail++;
 	}
-	ASSERT(list_empty(&bp->b_li_list));
 
 	/*
 	 * We only want to pull the item from the AIL if it is actually there
@@ -708,7 +732,8 @@ xfs_iflush_done(
 		/* this is an opencoded batch version of xfs_trans_ail_delete */
 		spin_lock(&ailp->ail_lock);
 		list_for_each_entry(lip, &tmp, li_bio_list) {
-			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
+			iip = INODE_ITEM(lip);
+			if (iip->ili_flush_lsn == lip->li_lsn) {
 				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
 				if (!tail_lsn && lsn)
 					tail_lsn = lsn;
@@ -725,14 +750,29 @@ xfs_iflush_done(
 	 * them is safely on disk.
 	 */
 	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
+		bool	drop_buffer = false;
+
 		list_del_init(&lip->li_bio_list);
 		iip = INODE_ITEM(lip);
 
 		spin_lock(&iip->ili_lock);
 		iip->ili_last_fields = 0;
-		spin_unlock(&iip->ili_lock);
+		iip->ili_flush_lsn = 0;
 
+		/*
+		 * Remove the reference to the cluster buffer if the inode is
+		 * clean in memory. Drop the buffer reference once we've dropped
+		 * the locks we hold.
+		 */
+		ASSERT(iip->ili_item.li_buf == bp);
+		if (!iip->ili_fields) {
+			iip->ili_item.li_buf = NULL;
+			drop_buffer = true;
+		}
+		spin_unlock(&iip->ili_lock);
 		xfs_ifunlock(iip->ili_inode);
+		if (drop_buffer)
+			xfs_buf_rele(bp);
 	}
 }
 
@@ -747,6 +787,7 @@ xfs_iflush_abort(
 	struct xfs_inode		*ip)
 {
 	struct xfs_inode_log_item	*iip = ip->i_itemp;
+	struct xfs_buf		*bp = NULL;
 
 	if (iip) {
 		xfs_trans_ail_delete(&iip->ili_item, 0);
@@ -758,12 +799,14 @@ xfs_iflush_abort(
 		iip->ili_last_fields = 0;
 		iip->ili_fields = 0;
 		iip->ili_fsync_fields = 0;
+		iip->ili_flush_lsn = 0;
+		bp = iip->ili_item.li_buf;
+		iip->ili_item.li_buf = NULL;
 		spin_unlock(&iip->ili_lock);
 	}
-	/*
-	 * Release the inode's flush lock since we're done with it.
-	 */
 	xfs_ifunlock(ip);
+	if (bp)
+		xfs_buf_rele(bp);
 }
 
 /*
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 3004aeac91102..21ffc6dfcd13e 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -143,15 +143,10 @@ static inline void
 xfs_clear_li_failed(
 	struct xfs_log_item	*lip)
 {
-	struct xfs_buf	*bp = lip->li_buf;
-
 	ASSERT(test_bit(XFS_LI_IN_AIL, &lip->li_flags));
 	lockdep_assert_held(&lip->li_ailp->ail_lock);
 
-	if (test_and_clear_bit(XFS_LI_FAILED, &lip->li_flags)) {
-		lip->li_buf = NULL;
-		xfs_buf_rele(bp);
-	}
+	clear_bit(XFS_LI_FAILED, &lip->li_flags);
 }
 
 static inline void
@@ -161,10 +156,7 @@ xfs_set_li_failed(
 {
 	lockdep_assert_held(&lip->li_ailp->ail_lock);
 
-	if (!test_and_set_bit(XFS_LI_FAILED, &lip->li_flags)) {
-		xfs_buf_hold(bp);
-		lip->li_buf = bp;
-	}
+	set_bit(XFS_LI_FAILED, &lip->li_flags);
 }
 
 #endif	/* __XFS_TRANS_PRIV_H__ */
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 13/24] xfs: make inode reclaim almost non-blocking
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (11 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 12/24] xfs: pin inode backing buffer to the inode log item Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 12:19   ` Amir Goldstein
  2020-05-22 22:48   ` Darrick J. Wong
  2020-05-22  3:50 ` [PATCH 14/24] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
                   ` (12 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that dirty inode writeback doesn't cause read-modify-write
cycles on the inode cluster buffer under memory pressure, the need
to throttle memory reclaim to the rate at which we can clean dirty
inodes goes away. That is due to the fact that we no longer thrash
inode cluster buffers under memory pressure to clean dirty inodes.

This means inode writeback no longer stalls on memory allocation
or read IO, and hence can be done asynchrnously without generating
memory pressure. As a result, blocking inode writeback in reclaim is
no longer necessary to prevent reclaim priority windup as cleaning
dirty inodes is no longer dependent on having memory reserves
available for the filesystem to make progress reclaiming inodes.

Hence we can convert inode reclaim to be non-blocking for shrinker
callouts, both for direct reclaim and kswapd.

On a vanilla kernel, running a 16-way fsmark create workload on a
4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
userspace mlock(). The OOM killer gets invoked at 15GB of
pinned RAM.

With this patch alone, pinning memory triggers premature OOM
killer invocation, sometimes with as much as 45% of RAM being free.
It's trivially easy to trigger the OOM killer when reclaim does not
block.

With pinning inode clusters in RAM adn then adding this patch, I can
reliably pin 14.5GB of RAM and still have the fsmark workload run to
completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
is only a small amount of memory less than the vanilla kernel. It is
much more reliable than just with async reclaim alone.

simoops shows that allocation stalls go away when async reclaim is
used. Vanilla kernel:

Run time: 1924 seconds
Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)

With inode cluster pinning and async reclaim:

Run time: 1924 seconds
Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)

Latencies don't really change much, nor does the work rate. However,
allocation almost never stalls with these changes, whilst the
vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
period. This difference is due to inode reclaim being largely
non-blocking now.

IOWs, once we have pinned inode cluster buffers, we can make inode
reclaim non-blocking without a major risk of premature and/or
spurious OOM killer invocation, and without any changes to memory
reclaim infrastructure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index d806d3bfa8936..0f0f8fcd61b03 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1420,7 +1420,7 @@ xfs_reclaim_inodes_nr(
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
 }
 
 /*
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 14/24] xfs: remove IO submission from xfs_reclaim_inode()
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (12 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 13/24] xfs: make inode reclaim almost non-blocking Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 23:06   ` Darrick J. Wong
  2020-05-23  9:40   ` Christoph Hellwig
  2020-05-22  3:50 ` [PATCH 15/24] xfs: allow multiple reclaimers per AG Dave Chinner
                   ` (11 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We no longer need to issue IO from shrinker based inode reclaim to
prevent spurious OOM killer invocation. This leaves only the global
filesystem management operations such as unmount needing to
writeback dirty inodes and reclaim them.

Instead of using the reclaim pass to write dirty inodes before
reclaiming them, use the AIL to push all the dirty inodes before we
try to reclaim them. This allows us to remove all the conditional
SYNC_WAIT locking and the writeback code from xfs_reclaim_inode()
and greatly simplify the checks we need to do to reclaim an inode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 116 +++++++++++---------------------------------
 1 file changed, 29 insertions(+), 87 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0f0f8fcd61b03..ee9bc82a0dfbe 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1130,24 +1130,17 @@ xfs_reclaim_inode_grab(
  *	dirty, async	=> requeue
  *	dirty, sync	=> flush, wait and reclaim
  */
-STATIC int
+static bool
 xfs_reclaim_inode(
 	struct xfs_inode	*ip,
 	struct xfs_perag	*pag,
 	int			sync_mode)
 {
-	struct xfs_buf		*bp = NULL;
 	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
-	int			error;
 
-restart:
-	error = 0;
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	if (!xfs_iflock_nowait(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
-			goto out;
-		xfs_iflock(ip);
-	}
+	if (!xfs_iflock_nowait(ip))
+		goto out;
 
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
 		xfs_iunpin_wait(ip);
@@ -1155,51 +1148,19 @@ xfs_reclaim_inode(
 		xfs_iflush_abort(ip);
 		goto reclaim;
 	}
-	if (xfs_ipincount(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
-			goto out_ifunlock;
-		xfs_iunpin_wait(ip);
-	}
+	if (xfs_ipincount(ip))
+		goto out_ifunlock;
 	if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) {
 		xfs_ifunlock(ip);
 		goto reclaim;
 	}
 
-	/*
-	 * Never flush out dirty data during non-blocking reclaim, as it would
-	 * just contend with AIL pushing trying to do the same job.
-	 */
-	if (!(sync_mode & SYNC_WAIT))
-		goto out_ifunlock;
-
-	/*
-	 * Now we have an inode that needs flushing.
-	 *
-	 * Note that xfs_iflush will never block on the inode buffer lock, as
-	 * xfs_ifree_cluster() can lock the inode buffer before it locks the
-	 * ip->i_lock, and we are doing the exact opposite here.  As a result,
-	 * doing a blocking xfs_imap_to_bp() to get the cluster buffer would
-	 * result in an ABBA deadlock with xfs_ifree_cluster().
-	 *
-	 * As xfs_ifree_cluser() must gather all inodes that are active in the
-	 * cache to mark them stale, if we hit this case we don't actually want
-	 * to do IO here - we want the inode marked stale so we can simply
-	 * reclaim it.  Hence if we get an EAGAIN error here,  just unlock the
-	 * inode, back off and try again.  Hopefully the next pass through will
-	 * see the stale flag set on the inode.
-	 */
-	error = xfs_iflush(ip, &bp);
-	if (error == -EAGAIN) {
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		/* backoff longer than in xfs_ifree_cluster */
-		delay(2);
-		goto restart;
-	}
-
-	if (!error) {
-		error = xfs_bwrite(bp);
-		xfs_buf_relse(bp);
-	}
+out_ifunlock:
+	xfs_ifunlock(ip);
+out:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_iflags_clear(ip, XFS_IRECLAIM);
+	return false;
 
 reclaim:
 	ASSERT(!xfs_isiflocked(ip));
@@ -1249,21 +1210,7 @@ xfs_reclaim_inode(
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
 	__xfs_inode_free(ip);
-	return error;
-
-out_ifunlock:
-	xfs_ifunlock(ip);
-out:
-	xfs_iflags_clear(ip, XFS_IRECLAIM);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	/*
-	 * We could return -EAGAIN here to make reclaim rescan the inode tree in
-	 * a short while. However, this just burns CPU time scanning the tree
-	 * waiting for IO to complete and the reclaim work never goes back to
-	 * the idle state. Instead, return 0 to let the next scheduled
-	 * background reclaim attempt to reclaim the inode again.
-	 */
-	return 0;
+	return true;
 }
 
 /*
@@ -1272,20 +1219,17 @@ xfs_reclaim_inode(
  * then a shut down during filesystem unmount reclaim walk leak all the
  * unreclaimed inodes.
  */
-STATIC int
+static int
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
 	int			flags,
 	int			*nr_to_scan)
 {
 	struct xfs_perag	*pag;
-	int			error = 0;
-	int			last_error = 0;
 	xfs_agnumber_t		ag;
 	int			trylock = flags & SYNC_TRYLOCK;
 	int			skipped;
 
-restart:
 	ag = 0;
 	skipped = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
@@ -1359,9 +1303,8 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				if (!batch[i])
 					continue;
-				error = xfs_reclaim_inode(batch[i], pag, flags);
-				if (error && last_error != -EFSCORRUPTED)
-					last_error = error;
+				if (!xfs_reclaim_inode(batch[i], pag, flags))
+					skipped++;
 			}
 
 			*nr_to_scan -= XFS_LOOKUP_BATCH;
@@ -1377,19 +1320,7 @@ xfs_reclaim_inodes_ag(
 		mutex_unlock(&pag->pag_ici_reclaim_lock);
 		xfs_perag_put(pag);
 	}
-
-	/*
-	 * if we skipped any AG, and we still have scan count remaining, do
-	 * another pass this time using blocking reclaim semantics (i.e
-	 * waiting on the reclaim locks and ignoring the reclaim cursors). This
-	 * ensure that when we get more reclaimers than AGs we block rather
-	 * than spin trying to execute reclaim.
-	 */
-	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
-		trylock = 0;
-		goto restart;
-	}
-	return last_error;
+	return skipped;
 }
 
 int
@@ -1398,8 +1329,18 @@ xfs_reclaim_inodes(
 	int		mode)
 {
 	int		nr_to_scan = INT_MAX;
+	int		skipped;
 
-	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	if (!(mode & SYNC_WAIT))
+		return 0;
+
+	do {
+		xfs_ail_push_all_sync(mp->m_ail);
+		skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	} while (skipped > 0);
+
+	return 0;
 }
 
 /*
@@ -1420,7 +1361,8 @@ xfs_reclaim_inodes_nr(
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
+	return 0;
 }
 
 /*
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 15/24] xfs: allow multiple reclaimers per AG
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (13 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 14/24] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 23:10   ` Darrick J. Wong
  2020-05-22  3:50 ` [PATCH 16/24] xfs: don't block inode reclaim on the ILOCK Dave Chinner
                   ` (10 subsequent siblings)
  25 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Inode reclaim will still throttle direct reclaim on the per-ag
reclaim locks. This is no longer necessary as reclaim can run
non-blocking now. Hence we can remove these locks so that we don't
arbitrarily block reclaimers just because there are more direct
reclaimers than there are AGs.

This can result in multiple reclaimers working on the same range of
an AG, but this doesn't cause any apparent issues. Optimising the
spread of concurrent reclaimers for best efficiency can be done in a
future patchset.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 28 +++++++---------------------
 fs/xfs/xfs_mount.c  |  4 ----
 fs/xfs/xfs_mount.h  |  1 -
 3 files changed, 7 insertions(+), 26 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index ee9bc82a0dfbe..f44493b2eae77 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1226,12 +1226,9 @@ xfs_reclaim_inodes_ag(
 	int			*nr_to_scan)
 {
 	struct xfs_perag	*pag;
-	xfs_agnumber_t		ag;
-	int			trylock = flags & SYNC_TRYLOCK;
-	int			skipped;
+	xfs_agnumber_t		ag = 0;
+	int			skipped = 0;
 
-	ag = 0;
-	skipped = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
 		unsigned long	first_index = 0;
 		int		done = 0;
@@ -1239,16 +1236,7 @@ xfs_reclaim_inodes_ag(
 
 		ag = pag->pag_agno + 1;
 
-		if (trylock) {
-			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
-				skipped++;
-				xfs_perag_put(pag);
-				continue;
-			}
-			first_index = pag->pag_ici_reclaim_cursor;
-		} else
-			mutex_lock(&pag->pag_ici_reclaim_lock);
-
+		first_index = READ_ONCE(pag->pag_ici_reclaim_cursor);
 		do {
 			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
 			int	i;
@@ -1313,11 +1301,9 @@ xfs_reclaim_inodes_ag(
 
 		} while (nr_found && !done && *nr_to_scan > 0);
 
-		if (trylock && !done)
-			pag->pag_ici_reclaim_cursor = first_index;
-		else
-			pag->pag_ici_reclaim_cursor = 0;
-		mutex_unlock(&pag->pag_ici_reclaim_lock);
+		if (done)
+			first_index = 0;
+		WRITE_ONCE(pag->pag_ici_reclaim_cursor, first_index);
 		xfs_perag_put(pag);
 	}
 	return skipped;
@@ -1331,7 +1317,7 @@ xfs_reclaim_inodes(
 	int		nr_to_scan = INT_MAX;
 	int		skipped;
 
-	skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
 	if (!(mode & SYNC_WAIT))
 		return 0;
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index d5dcf98698600..03158b42a1943 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -148,7 +148,6 @@ xfs_free_perag(
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
 		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
-		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
 	}
 }
@@ -200,7 +199,6 @@ xfs_initialize_perag(
 		pag->pag_agno = index;
 		pag->pag_mount = mp;
 		spin_lock_init(&pag->pag_ici_lock);
-		mutex_init(&pag->pag_ici_reclaim_lock);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 		if (xfs_buf_hash_init(pag))
 			goto out_free_pag;
@@ -242,7 +240,6 @@ xfs_initialize_perag(
 out_hash_destroy:
 	xfs_buf_hash_destroy(pag);
 out_free_pag:
-	mutex_destroy(&pag->pag_ici_reclaim_lock);
 	kmem_free(pag);
 out_unwind_new_pags:
 	/* unwind any prior newly initialized pags */
@@ -252,7 +249,6 @@ xfs_initialize_perag(
 			break;
 		xfs_buf_hash_destroy(pag);
 		xfs_iunlink_destroy(pag);
-		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		kmem_free(pag);
 	}
 	return error;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 3725d25ad97e8..a72cfcaa4ad12 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -354,7 +354,6 @@ typedef struct xfs_perag {
 	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
-	struct mutex	pag_ici_reclaim_lock;	/* serialisation point */
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 16/24] xfs: don't block inode reclaim on the ILOCK
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (14 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 15/24] xfs: allow multiple reclaimers per AG Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 23:11   ` Darrick J. Wong
  2020-05-22  3:50 ` [PATCH 17/24] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we attempt to reclaim an inode, the first thing we do it take
the inode lock. This is blocking right now, so if the inode being
accessed by something else (e.g. being flushed to the cluster
buffer) we will block here.

Change this to a trylock so that we do not block inode reclaim
unnecessarily here.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index f44493b2eae77..c020d2379e12e 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1138,9 +1138,10 @@ xfs_reclaim_inode(
 {
 	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
 
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	if (!xfs_iflock_nowait(ip))
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
 		goto out;
+	if (!xfs_iflock_nowait(ip))
+		goto out_iunlock;
 
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
 		xfs_iunpin_wait(ip);
@@ -1157,8 +1158,9 @@ xfs_reclaim_inode(
 
 out_ifunlock:
 	xfs_ifunlock(ip);
-out:
+out_iunlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out:
 	xfs_iflags_clear(ip, XFS_IRECLAIM);
 	return false;
 
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 17/24] xfs: remove SYNC_TRYLOCK from inode reclaim
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (15 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 16/24] xfs: don't block inode reclaim on the ILOCK Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 23:14   ` Darrick J. Wong
  2020-05-22  3:50 ` [PATCH 18/24] xfs: clean up inode reclaim comments Dave Chinner
                   ` (8 subsequent siblings)
  25 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

All background reclaim is SYNC_TRYLOCK already, and even blocking
reclaim (SYNC_WAIT) can use trylock mechanisms as
xfs_reclaim_inodes_ag() will keep cycling until there are no more
reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode
reclaim and make everything unconditionally non-blocking.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index c020d2379e12e..8b366bc7b53c9 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1049,10 +1049,9 @@ xfs_inode_ag_iterator_tag(
  * Grab the inode for reclaim exclusively.
  * Return 0 if we grabbed it, non-zero otherwise.
  */
-STATIC int
+static int
 xfs_reclaim_inode_grab(
-	struct xfs_inode	*ip,
-	int			flags)
+	struct xfs_inode	*ip)
 {
 	ASSERT(rcu_read_lock_held());
 
@@ -1061,12 +1060,10 @@ xfs_reclaim_inode_grab(
 		return 1;
 
 	/*
-	 * If we are asked for non-blocking operation, do unlocked checks to
-	 * see if the inode already is being flushed or in reclaim to avoid
-	 * lock traffic.
+	 * Do unlocked checks to see if the inode already is being flushed or in
+	 * reclaim to avoid lock traffic.
 	 */
-	if ((flags & SYNC_TRYLOCK) &&
-	    __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
+	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
 		return 1;
 
 	/*
@@ -1133,8 +1130,7 @@ xfs_reclaim_inode_grab(
 static bool
 xfs_reclaim_inode(
 	struct xfs_inode	*ip,
-	struct xfs_perag	*pag,
-	int			sync_mode)
+	struct xfs_perag	*pag)
 {
 	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
 
@@ -1224,7 +1220,6 @@ xfs_reclaim_inode(
 static int
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
-	int			flags,
 	int			*nr_to_scan)
 {
 	struct xfs_perag	*pag;
@@ -1262,7 +1257,7 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				struct xfs_inode *ip = batch[i];
 
-				if (done || xfs_reclaim_inode_grab(ip, flags))
+				if (done || xfs_reclaim_inode_grab(ip))
 					batch[i] = NULL;
 
 				/*
@@ -1293,7 +1288,7 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				if (!batch[i])
 					continue;
-				if (!xfs_reclaim_inode(batch[i], pag, flags))
+				if (!xfs_reclaim_inode(batch[i], pag))
 					skipped++;
 			}
 
@@ -1319,13 +1314,13 @@ xfs_reclaim_inodes(
 	int		nr_to_scan = INT_MAX;
 	int		skipped;
 
-	xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
 	if (!(mode & SYNC_WAIT))
 		return 0;
 
 	do {
 		xfs_ail_push_all_sync(mp->m_ail);
-		skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+		skipped = xfs_reclaim_inodes_ag(mp, &nr_to_scan);
 	} while (skipped > 0);
 
 	return 0;
@@ -1349,7 +1344,7 @@ xfs_reclaim_inodes_nr(
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
 	return 0;
 }
 
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 18/24] xfs: clean up inode reclaim comments
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (16 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 17/24] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 23:17   ` Darrick J. Wong
  2020-05-22  3:50 ` [PATCH 19/24] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
                   ` (7 subsequent siblings)
  25 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Inode reclaim is quite different now to the way described in various
comments, so update all the comments explaining what it does and how
it works.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 128 ++++++++++++--------------------------------
 1 file changed, 35 insertions(+), 93 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 8b366bc7b53c9..81c23384d3310 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -141,11 +141,8 @@ xfs_inode_free(
 }
 
 /*
- * Queue a new inode reclaim pass if there are reclaimable inodes and there
- * isn't a reclaim pass already in progress. By default it runs every 5s based
- * on the xfs periodic sync default of 30s. Perhaps this should have it's own
- * tunable, but that can be done if this method proves to be ineffective or too
- * aggressive.
+ * Queue background inode reclaim work if there are reclaimable inodes and there
+ * isn't reclaim work already scheduled or in progress.
  */
 static void
 xfs_reclaim_work_queue(
@@ -164,8 +161,7 @@ xfs_reclaim_work_queue(
  * This is a fast pass over the inode cache to try to get reclaim moving on as
  * many inodes as possible in a short period of time. It kicks itself every few
  * seconds, as well as being kicked by the inode cache shrinker when memory
- * goes low. It scans as quickly as possible avoiding locked inodes or those
- * already being flushed, and once done schedules a future pass.
+ * goes low.
  */
 void
 xfs_reclaim_worker(
@@ -618,48 +614,31 @@ xfs_iget_cache_miss(
 }
 
 /*
- * Look up an inode by number in the given file system.
- * The inode is looked up in the cache held in each AG.
- * If the inode is found in the cache, initialise the vfs inode
- * if necessary.
+ * Look up an inode by number in the given file system.  The inode is looked up
+ * in the cache held in each AG.  If the inode is found in the cache, initialise
+ * the vfs inode if necessary.
  *
- * If it is not in core, read it in from the file system's device,
- * add it to the cache and initialise the vfs inode.
+ * If it is not in core, read it in from the file system's device, add it to the
+ * cache and initialise the vfs inode.
  *
  * The inode is locked according to the value of the lock_flags parameter.
- * This flag parameter indicates how and if the inode's IO lock and inode lock
- * should be taken.
- *
- * mp -- the mount point structure for the current file system.  It points
- *       to the inode hash table.
- * tp -- a pointer to the current transaction if there is one.  This is
- *       simply passed through to the xfs_iread() call.
- * ino -- the number of the inode desired.  This is the unique identifier
- *        within the file system for the inode being requested.
- * lock_flags -- flags indicating how to lock the inode.  See the comment
- *		 for xfs_ilock() for a list of valid values.
+ * Inode lookup is only done during metadata operations and not as part of the
+ * data IO path. Hence we only allow locking of the XFS_ILOCK during lookup.
  */
 int
 xfs_iget(
-	xfs_mount_t	*mp,
-	xfs_trans_t	*tp,
-	xfs_ino_t	ino,
-	uint		flags,
-	uint		lock_flags,
-	xfs_inode_t	**ipp)
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_ino_t		ino,
+	uint			flags,
+	uint			lock_flags,
+	struct xfs_inode	**ipp)
 {
-	xfs_inode_t	*ip;
-	int		error;
-	xfs_perag_t	*pag;
-	xfs_agino_t	agino;
+	struct xfs_inode	*ip;
+	struct xfs_perag	*pag;
+	xfs_agino_t		agino;
+	int			error;
 
-	/*
-	 * xfs_reclaim_inode() uses the ILOCK to ensure an inode
-	 * doesn't get freed while it's being referenced during a
-	 * radix tree traversal here.  It assumes this function
-	 * aqcuires only the ILOCK (and therefore it has no need to
-	 * involve the IOLOCK in this synchronization).
-	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED)) == 0);
 
 	/* reject inode numbers outside existing AGs */
@@ -771,15 +750,7 @@ xfs_inode_ag_walk_grab(
 
 	ASSERT(rcu_read_lock_held());
 
-	/*
-	 * check for stale RCU freed inode
-	 *
-	 * If the inode has been reallocated, it doesn't matter if it's not in
-	 * the AG we are walking - we are walking for writeback, so if it
-	 * passes all the "valid inode" checks and is dirty, then we'll write
-	 * it back anyway.  If it has been reallocated and still being
-	 * initialised, the XFS_INEW check below will catch it.
-	 */
+	/* Check for stale RCU freed inode */
 	spin_lock(&ip->i_flags_lock);
 	if (!ip->i_ino)
 		goto out_unlock_noent;
@@ -1089,43 +1060,16 @@ xfs_reclaim_inode_grab(
 }
 
 /*
- * Inodes in different states need to be treated differently. The following
- * table lists the inode states and the reclaim actions necessary:
- *
- *	inode state	     iflush ret		required action
- *      ---------------      ----------         ---------------
- *	bad			-		reclaim
- *	shutdown		EIO		unpin and reclaim
- *	clean, unpinned		0		reclaim
- *	stale, unpinned		0		reclaim
- *	clean, pinned(*)	0		requeue
- *	stale, pinned		EAGAIN		requeue
- *	dirty, async		-		requeue
- *	dirty, sync		0		reclaim
+ * Inode reclaim is non-blocking, so the default action if progress cannot be
+ * made is to "requeue" the inode for reclaim by unlocking it and clearing the
+ * XFS_IRECLAIM flag.  If we are in a shutdown state, we don't care about
+ * blocking anymore and hence we can wait for the inode to be able to reclaim
+ * it.
  *
- * (*) dgc: I don't think the clean, pinned state is possible but it gets
- * handled anyway given the order of checks implemented.
- *
- * Also, because we get the flush lock first, we know that any inode that has
- * been flushed delwri has had the flush completed by the time we check that
- * the inode is clean.
- *
- * Note that because the inode is flushed delayed write by AIL pushing, the
- * flush lock may already be held here and waiting on it can result in very
- * long latencies.  Hence for sync reclaims, where we wait on the flush lock,
- * the caller should push the AIL first before trying to reclaim inodes to
- * minimise the amount of time spent waiting.  For background relaim, we only
- * bother to reclaim clean inodes anyway.
- *
- * Hence the order of actions after gaining the locks should be:
- *	bad		=> reclaim
- *	shutdown	=> unpin and reclaim
- *	pinned, async	=> requeue
- *	pinned, sync	=> unpin
- *	stale		=> reclaim
- *	clean		=> reclaim
- *	dirty, async	=> requeue
- *	dirty, sync	=> flush, wait and reclaim
+ * We do no IO here - if callers require indoes to be cleaned they must push the
+ * AIL first to trigger writeback of dirty inodes. The enbales both the
+ * writeback to be done in the background in a non-blocking manner, as well as
+ * allowing reclaim to make progress without blocking.
  */
 static bool
 xfs_reclaim_inode(
@@ -1327,13 +1271,11 @@ xfs_reclaim_inodes(
 }
 
 /*
- * Scan a certain number of inodes for reclaim.
- *
- * When called we make sure that there is a background (fast) inode reclaim in
- * progress, while we will throttle the speed of reclaim via doing synchronous
- * reclaim of inodes. That means if we come across dirty inodes, we wait for
- * them to be cleaned, which we hope will not be very long due to the
- * background walker having already kicked the IO off on those dirty inodes.
+ * The shrinker infrastructure determines how many inodes we should scan for
+ * reclaim. We want as many clean inodes ready to reclaim as possible, so we
+ * push the AIL here. We also want to proactively free up memory if we can to
+ * minimise the amount of work memory reclaim has to do so we kick the
+ * background reclaim if it isn't already scheduled.
  */
 long
 xfs_reclaim_inodes_nr(
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 19/24] xfs: attach inodes to the cluster buffer when dirtied
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (17 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 18/24] xfs: clean up inode reclaim comments Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 23:48   ` Darrick J. Wong
  2020-05-22  3:50 ` [PATCH 20/24] xfs: xfs_iflush() is no longer necessary Dave Chinner
                   ` (6 subsequent siblings)
  25 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Rather than attach inodes to the cluster buffer just when we are
doing IO, attach the inodes to the cluster buffer when they are
dirtied. The means the buffer always carries a list of dirty inodes
that reference it, and we can use that list to make more fundamental
changes to inode writeback that aren't otherwise possible.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_inode.c |   9 +-
 fs/xfs/xfs_inode.c              | 154 +++++++++++---------------------
 fs/xfs/xfs_inode_item.c         |  19 ++--
 3 files changed, 68 insertions(+), 114 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index e130eb2994156..e5f58a4b13eee 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -162,13 +162,16 @@ xfs_trans_log_inode(
 		/*
 		 * We need an explicit buffer reference for the log item, We
 		 * don't want the buffer attached to the transaction, so we have
-		 * to release the transaction reference we just gained.
+		 * to release the transaction reference we just gained. But we
+		 * don't want to drop the buffer lock until we've attached the
+		 * inode item to the buffer log item list....
 		 */
 		xfs_buf_hold(bp);
-		xfs_trans_brelse(tp, bp);
-
 		spin_lock(&iip->ili_lock);
 		iip->ili_item.li_buf = bp;
+		bp->b_flags |= _XBF_INODES;
+		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
+		xfs_trans_brelse(tp, bp);
 	}
 
 	/*
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c5529853f513c..961eef2ce8c4a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2498,17 +2498,18 @@ xfs_iunlink_remove(
 }
 
 /*
- * Look up the inode number specified and mark it stale if it is found. If it is
- * dirty, return the inode so it can be attached to the cluster buffer so it can
- * be processed appropriately when the cluster free transaction completes.
+ * Look up the inode number specified and if it is not already marked XFS_ISTALE
+ * mark it stale. We should only find clean inodes in this lookup that aren't
+ * already stale.
  */
-static struct xfs_inode *
-xfs_ifree_get_one_inode(
+static void
+xfs_ifree_mark_inode_stale(
 	struct xfs_perag	*pag,
 	struct xfs_inode	*free_ip,
 	xfs_ino_t		inum)
 {
 	struct xfs_mount	*mp = pag->pag_mount;
+	struct xfs_inode_log_item *iip;
 	struct xfs_inode	*ip;
 
 retry:
@@ -2516,8 +2517,10 @@ xfs_ifree_get_one_inode(
 	ip = radix_tree_lookup(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, inum));
 
 	/* Inode not in memory, nothing to do */
-	if (!ip)
-		goto out_rcu_unlock;
+	if (!ip) {
+		rcu_read_unlock();
+		return;
+	}
 
 	/*
 	 * because this is an RCU protected lookup, we could find a recently
@@ -2528,9 +2531,9 @@ xfs_ifree_get_one_inode(
 	spin_lock(&ip->i_flags_lock);
 	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
 		spin_unlock(&ip->i_flags_lock);
-		goto out_rcu_unlock;
+		rcu_read_unlock();
+		return;
 	}
-	spin_unlock(&ip->i_flags_lock);
 
 	/*
 	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
@@ -2540,43 +2543,45 @@ xfs_ifree_get_one_inode(
 	 */
 	if (ip != free_ip) {
 		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
+			spin_unlock(&ip->i_flags_lock);
 			rcu_read_unlock();
 			delay(1);
 			goto retry;
 		}
-
-		/*
-		 * Check the inode number again in case we're racing with
-		 * freeing in xfs_reclaim_inode().  See the comments in that
-		 * function for more information as to why the initial check is
-		 * not sufficient.
-		 */
-		if (ip->i_ino != inum) {
-			xfs_iunlock(ip, XFS_ILOCK_EXCL);
-			goto out_rcu_unlock;
-		}
 	}
+	ip->i_flags |= XFS_ISTALE;
+	spin_unlock(&ip->i_flags_lock);
 	rcu_read_unlock();
 
-	xfs_iflock(ip);
-	xfs_iflags_set(ip, XFS_ISTALE);
+	/* If we can't get the flush lock, the inode is already attached */
+	if (!xfs_iflock_nowait(ip))
+		goto out_iunlock;
 
 	/*
-	 * We don't need to attach clean inodes or those only with unlogged
-	 * changes (which we throw away, anyway).
+	 * Inodes not attached to the buffer can be released immediately.
+	 * Everything else has to go through xfs_iflush_abort() on journal
+	 * commit as the flock synchronises removal of the inode from the
+	 * cluster buffer against inode reclaim.
 	 */
-	if (!ip->i_itemp || xfs_inode_clean(ip)) {
-		ASSERT(ip != free_ip);
+	if (!ip->i_itemp || list_empty(&ip->i_itemp->ili_item.li_bio_list)) {
 		xfs_ifunlock(ip);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		goto out_no_inode;
+		goto out_iunlock;
 	}
-	return ip;
 
-out_rcu_unlock:
-	rcu_read_unlock();
-out_no_inode:
-	return NULL;
+	/* we have a dirty inode in memory that has not yet been flushed. */
+	iip = ip->i_itemp;
+	ASSERT(!list_empty(&iip->ili_item.li_bio_list));
+
+	spin_lock(&iip->ili_lock);
+	iip->ili_last_fields = iip->ili_fields;
+	iip->ili_fields = 0;
+	iip->ili_fsync_fields = 0;
+	spin_unlock(&iip->ili_lock);
+	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
+				&iip->ili_item.li_lsn);
+out_iunlock:
+	if (ip != free_ip)
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 }
 
 /*
@@ -2586,25 +2591,21 @@ xfs_ifree_get_one_inode(
  */
 STATIC int
 xfs_ifree_cluster(
-	xfs_inode_t		*free_ip,
-	xfs_trans_t		*tp,
+	struct xfs_inode	*free_ip,
+	struct xfs_trans	*tp,
 	struct xfs_icluster	*xic)
 {
-	xfs_mount_t		*mp = free_ip->i_mount;
+	struct xfs_mount	*mp = free_ip->i_mount;
+	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
+	struct xfs_perag	*pag;
+	struct xfs_buf		*bp;
+	xfs_daddr_t		blkno;
+	xfs_ino_t		inum = xic->first_ino;
 	int			nbufs;
 	int			i, j;
 	int			ioffset;
-	xfs_daddr_t		blkno;
-	xfs_buf_t		*bp;
-	xfs_inode_t		*ip;
-	struct xfs_inode_log_item *iip;
-	struct xfs_log_item	*lip;
-	struct xfs_perag	*pag;
-	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
-	xfs_ino_t		inum;
 	int			error;
 
-	inum = xic->first_ino;
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inum));
 	nbufs = igeo->ialloc_blks / igeo->blocks_per_cluster;
 
@@ -2649,53 +2650,12 @@ xfs_ifree_cluster(
 		bp->b_ops = &xfs_inode_buf_ops;
 
 		/*
-		 * Walk the inodes already attached to the buffer and mark them
-		 * stale. These will all have the flush locks held, so an
-		 * in-memory inode walk can't lock them. By marking them all
-		 * stale first, we will not attempt to lock them in the loop
-		 * below as the XFS_ISTALE flag will be set.
-		 */
-		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
-			if (lip->li_type == XFS_LI_INODE) {
-				iip = (struct xfs_inode_log_item *)lip;
-				xfs_trans_ail_copy_lsn(mp->m_ail,
-							&iip->ili_flush_lsn,
-							&iip->ili_item.li_lsn);
-				xfs_iflags_set(iip->ili_inode, XFS_ISTALE);
-			}
-		}
-
-
-		/*
-		 * For each inode in memory attempt to add it to the inode
-		 * buffer and set it up for being staled on buffer IO
-		 * completion.  This is safe as we've locked out tail pushing
-		 * and flushing by locking the buffer.
-		 *
-		 * We have already marked every inode that was part of a
-		 * transaction stale above, which means there is no point in
-		 * even trying to lock them.
+		 * Now we need to set all the cached clean inodes as XFS_ISTALE,
+		 * too. This requires lookups, and will skip inodes that we've
+		 * already marked XFS_ISTALE.
 		 */
-		for (i = 0; i < igeo->inodes_per_cluster; i++) {
-			ip = xfs_ifree_get_one_inode(pag, free_ip, inum + i);
-			if (!ip)
-				continue;
-
-			iip = ip->i_itemp;
-			spin_lock(&iip->ili_lock);
-			iip->ili_last_fields = iip->ili_fields;
-			iip->ili_fields = 0;
-			iip->ili_fsync_fields = 0;
-			spin_unlock(&iip->ili_lock);
-			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
-						&iip->ili_item.li_lsn);
-
-			list_add_tail(&iip->ili_item.li_bio_list,
-						&bp->b_li_list);
-
-			if (ip != free_ip)
-				xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		}
+		for (i = 0; i < igeo->inodes_per_cluster; i++)
+			xfs_ifree_mark_inode_stale(pag, free_ip, inum + i);
 
 		xfs_trans_stale_inode_buf(tp, bp);
 		xfs_trans_binval(tp, bp);
@@ -3831,22 +3791,12 @@ xfs_iflush_int(
 	 * Store the current LSN of the inode so that we can tell whether the
 	 * item has moved in the AIL from xfs_iflush_done().
 	 */
+	ASSERT(iip->ili_item.li_lsn);
 	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 				&iip->ili_item.li_lsn);
 
-	/*
-	 * Attach the inode item callback to the buffer whether the flush
-	 * succeeded or not. If not, the caller will shut down and fail I/O
-	 * completion on the buffer to remove the inode from the AIL and release
-	 * the flush lock.
-	 */
-	bp->b_flags |= _XBF_INODES;
-	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
-
 	/* generate the checksum. */
 	xfs_dinode_calc_crc(mp, dip);
-
-	ASSERT(!list_empty(&bp->b_li_list));
 	return error;
 }
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 86173a52526fe..7718cfc39eec8 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -683,10 +683,7 @@ xfs_inode_item_destroy(
  *
  * Note: Now that we attach the log item to the buffer when we first log the
  * inode in memory, we can have unflushed inodes on the buffer list here. These
- * inodes will have a zero ili_last_fields, so skip over them here. We do
- * this check -after- we've checked for stale inodes, because we're guaranteed
- * to have XFS_ISTALE set in the case that dirty inodes are in the CIL and have
- * not yet had their dirtying transactions committed to disk.
+ * inodes will have a zero ili_last_fields, so skip over them here.
  */
 void
 xfs_iflush_done(
@@ -704,15 +701,14 @@ xfs_iflush_done(
 	 */
 	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
 		iip = INODE_ITEM(lip);
+		if (!iip->ili_last_fields)
+			continue;
+
 		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
-			list_del_init(&lip->li_bio_list);
 			xfs_iflush_abort(iip->ili_inode);
 			continue;
 		}
 
-		if (!iip->ili_last_fields)
-			continue;
-
 		list_move_tail(&lip->li_bio_list, &tmp);
 
 		/* Do an unlocked check for needing the AIL lock. */
@@ -762,12 +758,16 @@ xfs_iflush_done(
 		/*
 		 * Remove the reference to the cluster buffer if the inode is
 		 * clean in memory. Drop the buffer reference once we've dropped
-		 * the locks we hold.
+		 * the locks we hold. If the inode is dirty in memory, we need
+		 * to put the inode item back on the buffer list for another
+		 * pass through the flush machinery.
 		 */
 		ASSERT(iip->ili_item.li_buf == bp);
 		if (!iip->ili_fields) {
 			iip->ili_item.li_buf = NULL;
 			drop_buffer = true;
+		} else {
+			list_add(&lip->li_bio_list, &bp->b_li_list);
 		}
 		spin_unlock(&iip->ili_lock);
 		xfs_ifunlock(iip->ili_inode);
@@ -802,6 +802,7 @@ xfs_iflush_abort(
 		iip->ili_flush_lsn = 0;
 		bp = iip->ili_item.li_buf;
 		iip->ili_item.li_buf = NULL;
+		list_del_init(&iip->ili_item.li_bio_list);
 		spin_unlock(&iip->ili_lock);
 	}
 	xfs_ifunlock(ip);
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 20/24] xfs: xfs_iflush() is no longer necessary
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (18 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 19/24] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 23:54   ` Darrick J. Wong
  2020-05-22  3:50 ` [PATCH 21/24] xfs: rename xfs_iflush_int() Dave Chinner
                   ` (5 subsequent siblings)
  25 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now we have a cached buffer on inode log items, we don't need
to do buffer lookups when flushing inodes anymore - all we need
to do is lock the buffer and we are ready to go.

This largely gets rid of the need for xfs_iflush(), which is
essentially just a mechanism to look up the buffer and flush the
inode to it. Instead, we can just call xfs_iflush_cluster() with a
few modifications to ensure it also flushes the inode we already
hold locked.

This allows the AIL inode item pushing to be almost entirely
non-blocking in XFS - we won't block unless memory allocation
for the cluster inode lookup blocks or the block device queues are
full.

Writeback during inode reclaim becomes a little more complex because
we now have to lock the buffer ourselves, but otherwise this change
is largely a functional no-op that removes a whole lot of code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c      | 106 ++++++----------------------------------
 fs/xfs/xfs_inode.h      |   2 +-
 fs/xfs/xfs_inode_item.c |  51 +++++++------------
 3 files changed, 32 insertions(+), 127 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 961eef2ce8c4a..a94528d26328b 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3429,7 +3429,18 @@ xfs_rename(
 	return error;
 }
 
-STATIC int
+/*
+ * Non-blocking flush of dirty inode metadata into the backing buffer.
+ *
+ * The caller must have a reference to the inode and hold the cluster buffer
+ * locked. The function will walk across all the inodes on the cluster buffer it
+ * can find and lock without blocking, and flush them to the cluster buffer.
+ *
+ * On success, the caller must write out the buffer returned in *bp and
+ * release it. On failure, the filesystem will be shut down, the buffer will
+ * have been unlocked and released, and EFSCORRUPTED will be returned.
+ */
+int
 xfs_iflush_cluster(
 	struct xfs_inode	*ip,
 	struct xfs_buf		*bp)
@@ -3464,8 +3475,6 @@ xfs_iflush_cluster(
 
 	for (i = 0; i < nr_found; i++) {
 		cip = cilist[i];
-		if (cip == ip)
-			continue;
 
 		/*
 		 * because this is an RCU protected lookup, we could find a
@@ -3556,99 +3565,11 @@ xfs_iflush_cluster(
 	kmem_free(cilist);
 out_put:
 	xfs_perag_put(pag);
-	return error;
-}
-
-/*
- * Flush dirty inode metadata into the backing buffer.
- *
- * The caller must have the inode lock and the inode flush lock held.  The
- * inode lock will still be held upon return to the caller, and the inode
- * flush lock will be released after the inode has reached the disk.
- *
- * The caller must write out the buffer returned in *bpp and release it.
- */
-int
-xfs_iflush(
-	struct xfs_inode	*ip,
-	struct xfs_buf		**bpp)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_buf		*bp = NULL;
-	struct xfs_dinode	*dip;
-	int			error;
-
-	XFS_STATS_INC(mp, xs_iflush_count);
-
-	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
-	ASSERT(xfs_isiflocked(ip));
-	ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_BTREE ||
-	       ip->i_df.if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
-
-	*bpp = NULL;
-
-	xfs_iunpin_wait(ip);
-
-	/*
-	 * For stale inodes we cannot rely on the backing buffer remaining
-	 * stale in cache for the remaining life of the stale inode and so
-	 * xfs_imap_to_bp() below may give us a buffer that no longer contains
-	 * inodes below. We have to check this after ensuring the inode is
-	 * unpinned so that it is safe to reclaim the stale inode after the
-	 * flush call.
-	 */
-	if (xfs_iflags_test(ip, XFS_ISTALE)) {
-		xfs_ifunlock(ip);
-		return 0;
-	}
-
-	/*
-	 * Get the buffer containing the on-disk inode. We are doing a try-lock
-	 * operation here, so we may get an EAGAIN error. In that case, return
-	 * leaving the inode dirty.
-	 *
-	 * If we get any other error, we effectively have a corruption situation
-	 * and we cannot flush the inode. Abort the flush and shut down.
-	 */
-	error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &bp, XBF_TRYLOCK);
-	if (error == -EAGAIN) {
-		xfs_ifunlock(ip);
-		return error;
-	}
-	if (error)
-		goto abort;
-
-	/*
-	 * If the buffer is pinned then push on the log now so we won't
-	 * get stuck waiting in the write for too long.
-	 */
-	if (xfs_buf_ispinned(bp))
-		xfs_log_force(mp, 0);
-
-	/*
-	 * Flush the provided inode then attempt to gather others from the
-	 * cluster into the write.
-	 *
-	 * Note: Once we attempt to flush an inode, we must run buffer
-	 * completion callbacks on any failure. If this fails, simulate an I/O
-	 * failure on the buffer and shut down.
-	 */
-	error = xfs_iflush_int(ip, bp);
-	if (!error)
-		error = xfs_iflush_cluster(ip, bp);
 	if (error) {
 		bp->b_flags |= XBF_ASYNC;
 		xfs_buf_ioend_fail(bp);
-		goto shutdown;
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 	}
-
-	*bpp = bp;
-	return 0;
-
-abort:
-	xfs_iflush_abort(ip);
-shutdown:
-	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 	return error;
 }
 
@@ -3667,6 +3588,7 @@ xfs_iflush_int(
 	ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_BTREE ||
 	       ip->i_df.if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
 	ASSERT(iip != NULL && iip->ili_fields != 0);
+	ASSERT(iip->ili_item.li_buf == bp);
 
 	dip = xfs_buf_offset(bp, ip->i_imap.im_boffset);
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index dadcf19458960..d1109eb13ba2e 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -427,7 +427,7 @@ int		xfs_log_force_inode(struct xfs_inode *ip);
 void		xfs_iunpin_wait(xfs_inode_t *);
 #define xfs_ipincount(ip)	((unsigned int) atomic_read(&ip->i_pincount))
 
-int		xfs_iflush(struct xfs_inode *, struct xfs_buf **);
+int		xfs_iflush_cluster(struct xfs_inode *, struct xfs_buf *);
 void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
 				struct xfs_inode *ip1, uint ip1_mode);
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 7718cfc39eec8..42b4b5fe1e2a9 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -504,53 +504,35 @@ xfs_inode_item_push(
 	uint			rval = XFS_ITEM_SUCCESS;
 	int			error;
 
-	if (xfs_ipincount(ip) > 0)
-		return XFS_ITEM_PINNED;
+	ASSERT(iip->ili_item.li_buf);
 
-	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
-		return XFS_ITEM_LOCKED;
+	if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp) ||
+	    (ip->i_flags & XFS_ISTALE))
+		return XFS_ITEM_PINNED;
 
-	/*
-	 * Re-check the pincount now that we stabilized the value by
-	 * taking the ilock.
-	 */
-	if (xfs_ipincount(ip) > 0) {
-		rval = XFS_ITEM_PINNED;
-		goto out_unlock;
-	}
+	/* If the inode is already flush locked, we're already flushing. */
+	if (xfs_isiflocked(ip))
+		return XFS_ITEM_FLUSHING;
 
-	/*
-	 * Stale inode items should force out the iclog.
-	 */
-	if (ip->i_flags & XFS_ISTALE) {
-		rval = XFS_ITEM_PINNED;
-		goto out_unlock;
-	}
+	if (!xfs_buf_trylock(bp))
+		return XFS_ITEM_LOCKED;
 
-	/*
-	 * Someone else is already flushing the inode.  Nothing we can do
-	 * here but wait for the flush to finish and remove the item from
-	 * the AIL.
-	 */
-	if (!xfs_iflock_nowait(ip)) {
-		rval = XFS_ITEM_FLUSHING;
-		goto out_unlock;
+	if (bp->b_flags & _XBF_DELWRI_Q) {
+		xfs_buf_unlock(bp);
+		return XFS_ITEM_FLUSHING;
 	}
-
-	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
 	spin_unlock(&lip->li_ailp->ail_lock);
 
-	error = xfs_iflush(ip, &bp);
+	error = xfs_iflush_cluster(ip, bp);
 	if (!error) {
 		if (!xfs_buf_delwri_queue(bp, buffer_list))
 			rval = XFS_ITEM_FLUSHING;
-		xfs_buf_relse(bp);
-	} else if (error == -EAGAIN)
+		xfs_buf_unlock(bp);
+	} else {
 		rval = XFS_ITEM_LOCKED;
+	}
 
 	spin_lock(&lip->li_ailp->ail_lock);
-out_unlock:
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 	return rval;
 }
 
@@ -567,6 +549,7 @@ xfs_inode_item_release(
 
 	ASSERT(ip->i_itemp != NULL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+	ASSERT(lip->li_buf || !test_bit(XFS_LI_DIRTY, &lip->li_flags));
 
 	lock_flags = iip->ili_lock_flags;
 	iip->ili_lock_flags = 0;
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 21/24] xfs: rename xfs_iflush_int()
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (19 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 20/24] xfs: xfs_iflush() is no longer necessary Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-22 12:33   ` Amir Goldstein
  2020-05-22 23:57   ` Darrick J. Wong
  2020-05-22  3:50 ` [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
                   ` (4 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

with xfs_iflush() gone, we can rename xfs_iflush_int() back to
xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't
need the forward definition any more.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 293 ++++++++++++++++++++++-----------------------
 1 file changed, 146 insertions(+), 147 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index a94528d26328b..cbf8edf62d102 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -44,7 +44,6 @@ kmem_zone_t *xfs_inode_zone;
  */
 #define	XFS_ITRUNC_MAX_EXTENTS	2
 
-STATIC int xfs_iflush_int(struct xfs_inode *, struct xfs_buf *);
 STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *);
 STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *);
 
@@ -3429,152 +3428,8 @@ xfs_rename(
 	return error;
 }
 
-/*
- * Non-blocking flush of dirty inode metadata into the backing buffer.
- *
- * The caller must have a reference to the inode and hold the cluster buffer
- * locked. The function will walk across all the inodes on the cluster buffer it
- * can find and lock without blocking, and flush them to the cluster buffer.
- *
- * On success, the caller must write out the buffer returned in *bp and
- * release it. On failure, the filesystem will be shut down, the buffer will
- * have been unlocked and released, and EFSCORRUPTED will be returned.
- */
-int
-xfs_iflush_cluster(
-	struct xfs_inode	*ip,
-	struct xfs_buf		*bp)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_perag	*pag;
-	unsigned long		first_index, mask;
-	int			cilist_size;
-	struct xfs_inode	**cilist;
-	struct xfs_inode	*cip;
-	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
-	int			error = 0;
-	int			nr_found;
-	int			clcount = 0;
-	int			i;
-
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-
-	cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *);
-	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
-	if (!cilist)
-		goto out_put;
-
-	mask = ~(igeo->inodes_per_cluster - 1);
-	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
-	rcu_read_lock();
-	/* really need a gang lookup range call here */
-	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
-					first_index, igeo->inodes_per_cluster);
-	if (nr_found == 0)
-		goto out_free;
-
-	for (i = 0; i < nr_found; i++) {
-		cip = cilist[i];
-
-		/*
-		 * because this is an RCU protected lookup, we could find a
-		 * recently freed or even reallocated inode during the lookup.
-		 * We need to check under the i_flags_lock for a valid inode
-		 * here. Skip it if it is not valid or the wrong inode.
-		 */
-		spin_lock(&cip->i_flags_lock);
-		if (!cip->i_ino ||
-		    __xfs_iflags_test(cip, XFS_ISTALE)) {
-			spin_unlock(&cip->i_flags_lock);
-			continue;
-		}
-
-		/*
-		 * Once we fall off the end of the cluster, no point checking
-		 * any more inodes in the list because they will also all be
-		 * outside the cluster.
-		 */
-		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
-			spin_unlock(&cip->i_flags_lock);
-			break;
-		}
-		spin_unlock(&cip->i_flags_lock);
-
-		/*
-		 * Do an un-protected check to see if the inode is dirty and
-		 * is a candidate for flushing.  These checks will be repeated
-		 * later after the appropriate locks are acquired.
-		 */
-		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
-			continue;
-
-		/*
-		 * Try to get locks.  If any are unavailable or it is pinned,
-		 * then this inode cannot be flushed and is skipped.
-		 */
-
-		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
-			continue;
-		if (!xfs_iflock_nowait(cip)) {
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
-			continue;
-		}
-		if (xfs_ipincount(cip)) {
-			xfs_ifunlock(cip);
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
-			continue;
-		}
-
-
-		/*
-		 * Check the inode number again, just to be certain we are not
-		 * racing with freeing in xfs_reclaim_inode(). See the comments
-		 * in that function for more information as to why the initial
-		 * check is not sufficient.
-		 */
-		if (!cip->i_ino) {
-			xfs_ifunlock(cip);
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
-			continue;
-		}
-
-		/*
-		 * arriving here means that this inode can be flushed.  First
-		 * re-check that it's dirty before flushing.
-		 */
-		if (!xfs_inode_clean(cip)) {
-			error = xfs_iflush_int(cip, bp);
-			if (error) {
-				xfs_iunlock(cip, XFS_ILOCK_SHARED);
-				goto out_free;
-			}
-			clcount++;
-		} else {
-			xfs_ifunlock(cip);
-		}
-		xfs_iunlock(cip, XFS_ILOCK_SHARED);
-	}
-
-	if (clcount) {
-		XFS_STATS_INC(mp, xs_icluster_flushcnt);
-		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
-	}
-
-out_free:
-	rcu_read_unlock();
-	kmem_free(cilist);
-out_put:
-	xfs_perag_put(pag);
-	if (error) {
-		bp->b_flags |= XBF_ASYNC;
-		xfs_buf_ioend_fail(bp);
-		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-	}
-	return error;
-}
-
-STATIC int
-xfs_iflush_int(
+static int
+xfs_iflush(
 	struct xfs_inode	*ip,
 	struct xfs_buf		*bp)
 {
@@ -3722,6 +3577,150 @@ xfs_iflush_int(
 	return error;
 }
 
+/*
+ * Non-blocking flush of dirty inode metadata into the backing buffer.
+ *
+ * The caller must have a reference to the inode and hold the cluster buffer
+ * locked. The function will walk across all the inodes on the cluster buffer it
+ * can find and lock without blocking, and flush them to the cluster buffer.
+ *
+ * On success, the caller must write out the buffer returned in *bp and
+ * release it. On failure, the filesystem will be shut down, the buffer will
+ * have been unlocked and released, and EFSCORRUPTED will be returned.
+ */
+int
+xfs_iflush_cluster(
+	struct xfs_inode	*ip,
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_perag	*pag;
+	unsigned long		first_index, mask;
+	int			cilist_size;
+	struct xfs_inode	**cilist;
+	struct xfs_inode	*cip;
+	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
+	int			error = 0;
+	int			nr_found;
+	int			clcount = 0;
+	int			i;
+
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
+
+	cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *);
+	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
+	if (!cilist)
+		goto out_put;
+
+	mask = ~(igeo->inodes_per_cluster - 1);
+	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
+	rcu_read_lock();
+	/* really need a gang lookup range call here */
+	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
+					first_index, igeo->inodes_per_cluster);
+	if (nr_found == 0)
+		goto out_free;
+
+	for (i = 0; i < nr_found; i++) {
+		cip = cilist[i];
+
+		/*
+		 * because this is an RCU protected lookup, we could find a
+		 * recently freed or even reallocated inode during the lookup.
+		 * We need to check under the i_flags_lock for a valid inode
+		 * here. Skip it if it is not valid or the wrong inode.
+		 */
+		spin_lock(&cip->i_flags_lock);
+		if (!cip->i_ino ||
+		    __xfs_iflags_test(cip, XFS_ISTALE)) {
+			spin_unlock(&cip->i_flags_lock);
+			continue;
+		}
+
+		/*
+		 * Once we fall off the end of the cluster, no point checking
+		 * any more inodes in the list because they will also all be
+		 * outside the cluster.
+		 */
+		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
+			spin_unlock(&cip->i_flags_lock);
+			break;
+		}
+		spin_unlock(&cip->i_flags_lock);
+
+		/*
+		 * Do an un-protected check to see if the inode is dirty and
+		 * is a candidate for flushing.  These checks will be repeated
+		 * later after the appropriate locks are acquired.
+		 */
+		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
+			continue;
+
+		/*
+		 * Try to get locks.  If any are unavailable or it is pinned,
+		 * then this inode cannot be flushed and is skipped.
+		 */
+
+		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
+			continue;
+		if (!xfs_iflock_nowait(cip)) {
+			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+			continue;
+		}
+		if (xfs_ipincount(cip)) {
+			xfs_ifunlock(cip);
+			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+			continue;
+		}
+
+
+		/*
+		 * Check the inode number again, just to be certain we are not
+		 * racing with freeing in xfs_reclaim_inode(). See the comments
+		 * in that function for more information as to why the initial
+		 * check is not sufficient.
+		 */
+		if (!cip->i_ino) {
+			xfs_ifunlock(cip);
+			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+			continue;
+		}
+
+		/*
+		 * arriving here means that this inode can be flushed.  First
+		 * re-check that it's dirty before flushing.
+		 */
+		if (!xfs_inode_clean(cip)) {
+			error = xfs_iflush(cip, bp);
+			if (error) {
+				xfs_iunlock(cip, XFS_ILOCK_SHARED);
+				goto out_free;
+			}
+			clcount++;
+		} else {
+			xfs_ifunlock(cip);
+		}
+		xfs_iunlock(cip, XFS_ILOCK_SHARED);
+	}
+
+	if (clcount) {
+		XFS_STATS_INC(mp, xs_icluster_flushcnt);
+		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
+	}
+
+out_free:
+	rcu_read_unlock();
+	kmem_free(cilist);
+out_put:
+	xfs_perag_put(pag);
+	if (error) {
+		bp->b_flags |= XBF_ASYNC;
+		xfs_buf_ioend_fail(bp);
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+	}
+	return error;
+}
+
 /* Release an inode. */
 void
 xfs_irele(
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (20 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 21/24] xfs: rename xfs_iflush_int() Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-23  0:13   ` Darrick J. Wong
                     ` (2 more replies)
  2020-05-22  3:50 ` [PATCH 23/24] xfs: factor xfs_iflush_done Dave Chinner
                   ` (3 subsequent siblings)
  25 siblings, 3 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we have all the dirty inodes attached to the cluster
buffer, we don't actually have to do radix tree lookups to find
them. Sure, the radix tree is efficient, but walking a linked list
of just the dirty inodes attached to the buffer is much better.

We are also no longer dependent on having a locked inode passed into
the function to determine where to start the lookup. This means we
can drop it from the function call and treat all inodes the same.

We also make xfs_iflush_cluster skip inodes marked with
XFS_IRECLAIM. This we avoid races with inodes that reclaim is
actively referencing or are being re-initialised by inode lookup. If
they are actually dirty, they'll get written by a future cluster
flush....

We also add a shutdown check after obtaining the flush lock so that
we catch inodes that are dirty in memory and may have inconsistent
state due to the shutdown in progress. We abort these inodes
directly and so they remove themselves directly from the buffer list
and the AIL rather than having to wait for the buffer to be failed
and callbacks run to be processed correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c      | 150 ++++++++++++++++------------------------
 fs/xfs/xfs_inode.h      |   2 +-
 fs/xfs/xfs_inode_item.c |  15 +++-
 3 files changed, 74 insertions(+), 93 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index cbf8edf62d102..7db0f97e537e3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3428,7 +3428,7 @@ xfs_rename(
 	return error;
 }
 
-static int
+int
 xfs_iflush(
 	struct xfs_inode	*ip,
 	struct xfs_buf		*bp)
@@ -3590,117 +3590,94 @@ xfs_iflush(
  */
 int
 xfs_iflush_cluster(
-	struct xfs_inode	*ip,
 	struct xfs_buf		*bp)
 {
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_perag	*pag;
-	unsigned long		first_index, mask;
-	int			cilist_size;
-	struct xfs_inode	**cilist;
-	struct xfs_inode	*cip;
-	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
-	int			error = 0;
-	int			nr_found;
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_log_item	*lip, *n;
+	struct xfs_inode	*ip;
+	struct xfs_inode_log_item *iip;
 	int			clcount = 0;
-	int			i;
-
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-
-	cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *);
-	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
-	if (!cilist)
-		goto out_put;
-
-	mask = ~(igeo->inodes_per_cluster - 1);
-	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
-	rcu_read_lock();
-	/* really need a gang lookup range call here */
-	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
-					first_index, igeo->inodes_per_cluster);
-	if (nr_found == 0)
-		goto out_free;
+	int			error = 0;
 
-	for (i = 0; i < nr_found; i++) {
-		cip = cilist[i];
+	/*
+	 * We must use the safe variant here as on shutdown xfs_iflush_abort()
+	 * can remove itself from the list.
+	 */
+	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
+		iip = (struct xfs_inode_log_item *)lip;
+		ip = iip->ili_inode;
 
 		/*
-		 * because this is an RCU protected lookup, we could find a
-		 * recently freed or even reallocated inode during the lookup.
-		 * We need to check under the i_flags_lock for a valid inode
-		 * here. Skip it if it is not valid or the wrong inode.
+		 * Quick and dirty check to avoid locks if possible.
 		 */
-		spin_lock(&cip->i_flags_lock);
-		if (!cip->i_ino ||
-		    __xfs_iflags_test(cip, XFS_ISTALE)) {
-			spin_unlock(&cip->i_flags_lock);
+		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK))
+			continue;
+		if (xfs_ipincount(ip))
 			continue;
-		}
 
 		/*
-		 * Once we fall off the end of the cluster, no point checking
-		 * any more inodes in the list because they will also all be
-		 * outside the cluster.
+		 * The inode is still attached to the buffer, which means it is
+		 * dirty but reclaim might try to grab it. Check carefully for
+		 * that, and grab the ilock while still holding the i_flags_lock
+		 * to guarantee reclaim will not be able to reclaim this inode
+		 * once we drop the i_flags_lock.
 		 */
-		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
-			spin_unlock(&cip->i_flags_lock);
-			break;
+		spin_lock(&ip->i_flags_lock);
+		ASSERT(!__xfs_iflags_test(ip, XFS_ISTALE));
+		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK)) {
+			spin_unlock(&ip->i_flags_lock);
+			continue;
 		}
-		spin_unlock(&cip->i_flags_lock);
 
 		/*
-		 * Do an un-protected check to see if the inode is dirty and
-		 * is a candidate for flushing.  These checks will be repeated
-		 * later after the appropriate locks are acquired.
+		 * ILOCK will pin the inode against reclaim and prevent
+		 * concurrent transactions modifying the inode while we are
+		 * flushing the inode.
 		 */
-		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
+		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) {
+			spin_unlock(&ip->i_flags_lock);
 			continue;
+		}
+		spin_unlock(&ip->i_flags_lock);
 
 		/*
-		 * Try to get locks.  If any are unavailable or it is pinned,
-		 * then this inode cannot be flushed and is skipped.
+		 * Skip inodes that are already flush locked as they have
+		 * already been written to the buffer.
 		 */
-
-		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
-			continue;
-		if (!xfs_iflock_nowait(cip)) {
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+		if (!xfs_iflock_nowait(ip)) {
+			xfs_iunlock(ip, XFS_ILOCK_SHARED);
 			continue;
 		}
-		if (xfs_ipincount(cip)) {
-			xfs_ifunlock(cip);
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
-			continue;
-		}
-
 
 		/*
-		 * Check the inode number again, just to be certain we are not
-		 * racing with freeing in xfs_reclaim_inode(). See the comments
-		 * in that function for more information as to why the initial
-		 * check is not sufficient.
+		 * If we are shut down, unpin and abort the inode now as there
+		 * is no point in flushing it to the buffer just to get an IO
+		 * completion to abort the buffer and remove it from the AIL.
 		 */
-		if (!cip->i_ino) {
-			xfs_ifunlock(cip);
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+		if (XFS_FORCED_SHUTDOWN(mp)) {
+			xfs_iunpin_wait(ip);
+			/* xfs_iflush_abort() drops the flush lock */
+			xfs_iflush_abort(ip);
+			xfs_iunlock(ip, XFS_ILOCK_SHARED);
+			error = EIO;
 			continue;
 		}
 
-		/*
-		 * arriving here means that this inode can be flushed.  First
-		 * re-check that it's dirty before flushing.
-		 */
-		if (!xfs_inode_clean(cip)) {
-			error = xfs_iflush(cip, bp);
-			if (error) {
-				xfs_iunlock(cip, XFS_ILOCK_SHARED);
-				goto out_free;
-			}
-			clcount++;
-		} else {
-			xfs_ifunlock(cip);
+		/* don't block waiting on a log force to unpin dirty inodes */
+		if (xfs_ipincount(ip)) {
+			xfs_ifunlock(ip);
+			xfs_iunlock(ip, XFS_ILOCK_SHARED);
+			continue;
 		}
-		xfs_iunlock(cip, XFS_ILOCK_SHARED);
+
+		if (!xfs_inode_clean(ip))
+			error = xfs_iflush(ip, bp);
+		else
+			xfs_ifunlock(ip);
+		xfs_iunlock(ip, XFS_ILOCK_SHARED);
+		if (error)
+			break;
+		clcount++;
 	}
 
 	if (clcount) {
@@ -3708,11 +3685,6 @@ xfs_iflush_cluster(
 		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
 	}
 
-out_free:
-	rcu_read_unlock();
-	kmem_free(cilist);
-out_put:
-	xfs_perag_put(pag);
 	if (error) {
 		bp->b_flags |= XBF_ASYNC;
 		xfs_buf_ioend_fail(bp);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index d1109eb13ba2e..b93cf9076df8a 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -427,7 +427,7 @@ int		xfs_log_force_inode(struct xfs_inode *ip);
 void		xfs_iunpin_wait(xfs_inode_t *);
 #define xfs_ipincount(ip)	((unsigned int) atomic_read(&ip->i_pincount))
 
-int		xfs_iflush_cluster(struct xfs_inode *, struct xfs_buf *);
+int		xfs_iflush_cluster(struct xfs_buf *);
 void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
 				struct xfs_inode *ip1, uint ip1_mode);
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 42b4b5fe1e2a9..af4764f97a339 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -523,11 +523,20 @@ xfs_inode_item_push(
 	}
 	spin_unlock(&lip->li_ailp->ail_lock);
 
-	error = xfs_iflush_cluster(ip, bp);
+	/*
+	 * We need to hold a reference for flushing the cluster buffer as it may
+	 * fail the buffer without IO submission. In which case, we better have
+	 * a reference for that completion as otherwise we don't get a reference
+	 * for IO until we queue it for delwri submission.
+	 */
+	xfs_buf_hold(bp);
+	error = xfs_iflush_cluster(bp);
 	if (!error) {
-		if (!xfs_buf_delwri_queue(bp, buffer_list))
+		if (!xfs_buf_delwri_queue(bp, buffer_list)) {
+			ASSERT(0);
 			rval = XFS_ITEM_FLUSHING;
-		xfs_buf_unlock(bp);
+		}
+		xfs_buf_relse(bp);
 	} else {
 		rval = XFS_ITEM_LOCKED;
 	}
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 23/24] xfs: factor xfs_iflush_done
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (21 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-23  0:20   ` Darrick J. Wong
  2020-05-23 11:35   ` Christoph Hellwig
  2020-05-22  3:50 ` [PATCH 24/24] xfs: remove xfs_inobp_check() Dave Chinner
                   ` (2 subsequent siblings)
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_iflush_done() does 3 distinct operations to the inodes attached
to the buffer. Separate these operations out into functions so that
it is easier to modify these operations independently in future.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode_item.c | 156 +++++++++++++++++++++-------------------
 1 file changed, 82 insertions(+), 74 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index af4764f97a339..4dd4f45dcc46e 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -662,104 +662,67 @@ xfs_inode_item_destroy(
 
 
 /*
- * This is the inode flushing I/O completion routine.  It is called
- * from interrupt level when the buffer containing the inode is
- * flushed to disk.  It is responsible for removing the inode item
- * from the AIL if it has not been re-logged, and unlocking the inode's
- * flush lock.
- *
- * To reduce AIL lock traffic as much as possible, we scan the buffer log item
- * list for other inodes that will run this function. We remove them from the
- * buffer list so we can process all the inode IO completions in one AIL lock
- * traversal.
- *
- * Note: Now that we attach the log item to the buffer when we first log the
- * inode in memory, we can have unflushed inodes on the buffer list here. These
- * inodes will have a zero ili_last_fields, so skip over them here.
+ * We only want to pull the item from the AIL if it is actually there
+ * and its location in the log has not changed since we started the
+ * flush.  Thus, we only bother if the inode's lsn has not changed.
  */
 void
-xfs_iflush_done(
-	struct xfs_buf		*bp)
+xfs_iflush_ail_updates(
+	struct xfs_ail		*ailp,
+	struct list_head	*list)
 {
-	struct xfs_inode_log_item *iip;
-	struct xfs_log_item	*lip, *n;
-	struct xfs_ail		*ailp = bp->b_mount->m_ail;
-	int			need_ail = 0;
-	LIST_HEAD(tmp);
+	struct xfs_log_item	*lip;
+	xfs_lsn_t		tail_lsn = 0;
 
-	/*
-	 * Pull the attached inodes from the buffer one at a time and take the
-	 * appropriate action on them.
-	 */
-	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
-		iip = INODE_ITEM(lip);
-		if (!iip->ili_last_fields)
-			continue;
+	/* this is an opencoded batch version of xfs_trans_ail_delete */
+	spin_lock(&ailp->ail_lock);
+	list_for_each_entry(lip, list, li_bio_list) {
+		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
+		xfs_lsn_t	lsn;
 
-		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
-			xfs_iflush_abort(iip->ili_inode);
+		if (iip->ili_flush_lsn != lip->li_lsn) {
+			xfs_clear_li_failed(lip);
 			continue;
 		}
 
-		list_move_tail(&lip->li_bio_list, &tmp);
-
-		/* Do an unlocked check for needing the AIL lock. */
-		if (iip->ili_flush_lsn == lip->li_lsn ||
-		    test_bit(XFS_LI_FAILED, &lip->li_flags))
-			need_ail++;
+		lsn = xfs_ail_delete_one(ailp, lip);
+		if (!tail_lsn && lsn)
+			tail_lsn = lsn;
 	}
+	xfs_ail_update_finish(ailp, tail_lsn);
+}
 
-	/*
-	 * We only want to pull the item from the AIL if it is actually there
-	 * and its location in the log has not changed since we started the
-	 * flush.  Thus, we only bother if the inode's lsn has not changed.
-	 */
-	if (need_ail) {
-		xfs_lsn_t	tail_lsn = 0;
-
-		/* this is an opencoded batch version of xfs_trans_ail_delete */
-		spin_lock(&ailp->ail_lock);
-		list_for_each_entry(lip, &tmp, li_bio_list) {
-			iip = INODE_ITEM(lip);
-			if (iip->ili_flush_lsn == lip->li_lsn) {
-				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
-				if (!tail_lsn && lsn)
-					tail_lsn = lsn;
-			} else {
-				xfs_clear_li_failed(lip);
-			}
-		}
-		xfs_ail_update_finish(ailp, tail_lsn);
-	}
+/*
+ * Walk the list of inodes that have completed their IOs. If they are clean
+ * remove them from the list and dissociate them from the buffer. Buffers that
+ * are still dirty remain linked to the buffer and on the list. Caller must
+ * handle them appropriately.
+ */
+void
+xfs_iflush_finish(
+	struct xfs_buf		*bp,
+	struct list_head	*list)
+{
+	struct xfs_log_item	*lip, *n;
 
-	/*
-	 * Clean up and unlock the flush lock now we are done. We can clear the
-	 * ili_last_fields bits now that we know that the data corresponding to
-	 * them is safely on disk.
-	 */
-	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
+	list_for_each_entry_safe(lip, n, list, li_bio_list) {
+		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
 		bool	drop_buffer = false;
 
-		list_del_init(&lip->li_bio_list);
-		iip = INODE_ITEM(lip);
-
 		spin_lock(&iip->ili_lock);
 		iip->ili_last_fields = 0;
 		iip->ili_flush_lsn = 0;
 
 		/*
 		 * Remove the reference to the cluster buffer if the inode is
-		 * clean in memory. Drop the buffer reference once we've dropped
-		 * the locks we hold. If the inode is dirty in memory, we need
-		 * to put the inode item back on the buffer list for another
-		 * pass through the flush machinery.
+		 * clean in memory and drop the buffer reference once we've
+		 * dropped the locks we hold.
 		 */
 		ASSERT(iip->ili_item.li_buf == bp);
 		if (!iip->ili_fields) {
 			iip->ili_item.li_buf = NULL;
+			list_del_init(&lip->li_bio_list);
 			drop_buffer = true;
-		} else {
-			list_add(&lip->li_bio_list, &bp->b_li_list);
 		}
 		spin_unlock(&iip->ili_lock);
 		xfs_ifunlock(iip->ili_inode);
@@ -768,6 +731,51 @@ xfs_iflush_done(
 	}
 }
 
+/*
+ * Inode buffer IO completion routine.  It is responsible for removing inodes
+ * attached to the buffer from the AIL if they have not been re-logged, as well
+ * as completing the flush and unlocking the inode.
+ */
+void
+xfs_iflush_done(
+	struct xfs_buf		*bp)
+{
+	struct xfs_log_item	*lip, *n;
+	LIST_HEAD(flushed_inodes);
+	LIST_HEAD(ail_updates);
+
+	/*
+	 * Pull the attached inodes from the buffer one at a time and take the
+	 * appropriate action on them.
+	 */
+	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
+		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
+		if (!iip->ili_last_fields)
+			continue;
+
+		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
+			xfs_iflush_abort(iip->ili_inode);
+			continue;
+		}
+
+		/* Do an unlocked check for needing the AIL lock. */
+		if (iip->ili_flush_lsn == lip->li_lsn ||
+		    test_bit(XFS_LI_FAILED, &lip->li_flags))
+			list_move_tail(&lip->li_bio_list, &ail_updates);
+		else
+			list_move_tail(&lip->li_bio_list, &flushed_inodes);
+	}
+
+	if (!list_empty(&ail_updates)) {
+		xfs_iflush_ail_updates(bp->b_mount->m_ail, &ail_updates);
+		list_splice_tail(&ail_updates, &flushed_inodes);
+	}
+
+	xfs_iflush_finish(bp, &flushed_inodes);
+	if (!list_empty(&flushed_inodes))
+		list_splice_tail(&flushed_inodes, &bp->b_li_list);
+}
+
 /*
  * This is the inode flushing abort routine.  It is called from xfs_iflush when
  * the filesystem is shutting down to clean up the inode state.  It is
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 24/24] xfs: remove xfs_inobp_check()
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (22 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 23/24] xfs: factor xfs_iflush_done Dave Chinner
@ 2020-05-22  3:50 ` Dave Chinner
  2020-05-23  0:16   ` Darrick J. Wong
  2020-05-23 11:36   ` Christoph Hellwig
  2020-05-22  4:04 ` [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
  2020-05-22  6:18 ` Amir Goldstein
  25 siblings, 2 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  3:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

This debug code is called on every xfs_iflush() call, which then
checks every inode in the buffer for non-zero unlinked list field.
Hence it checks every inode in the cluster buffer every time a
single inode on that cluster it flushed. This is resulting in:

-   38.91%     5.33%  [kernel]  [k] xfs_iflush                                                                                                              ▒
   - 17.70% xfs_iflush                                                                                                                                      ▒
      - 9.93% xfs_inobp_check                                                                                                                                   ▒
           4.36% xfs_buf_offset                                                                                                                                 ▒

10% of the CPU time spent flushing inodes is repeatedly checking
unlinked fields in the buffer. We don't need to do this.

The other place we call xfs_inobp_check() is
xfs_iunlink_update_dinode(), and this is after we've done this
assert for the agino we are about to write into that inode:

	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));

which means we've already checked that the agino we are about to
write is not 0 on debug kernels. The inode buffer verifiers do
everything else we need, so let's just remove this debug code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c | 24 ------------------------
 fs/xfs/libxfs/xfs_inode_buf.h |  6 ------
 fs/xfs/xfs_inode.c            |  2 --
 3 files changed, 32 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 1af97235785c8..6b6f67595bf4e 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -20,30 +20,6 @@
 
 #include <linux/iversion.h>
 
-/*
- * Check that none of the inode's in the buffer have a next
- * unlinked field of 0.
- */
-#if defined(DEBUG)
-void
-xfs_inobp_check(
-	xfs_mount_t	*mp,
-	xfs_buf_t	*bp)
-{
-	int		i;
-	xfs_dinode_t	*dip;
-
-	for (i = 0; i < M_IGEO(mp)->inodes_per_cluster; i++) {
-		dip = xfs_buf_offset(bp, i * mp->m_sb.sb_inodesize);
-		if (!dip->di_next_unlinked)  {
-			xfs_alert(mp,
-	"Detected bogus zero next_unlinked field in inode %d buffer 0x%llx.",
-				i, (long long)bp->b_bn);
-		}
-	}
-}
-#endif
-
 /*
  * If we are doing readahead on an inode buffer, we might be in log recovery
  * reading an inode allocation buffer that hasn't yet been replayed, and hence
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 865ac493c72a2..6b08b9d060c2e 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -52,12 +52,6 @@ int	xfs_inode_from_disk(struct xfs_inode *ip, struct xfs_dinode *from);
 void	xfs_log_dinode_to_disk(struct xfs_log_dinode *from,
 			       struct xfs_dinode *to);
 
-#if defined(DEBUG)
-void	xfs_inobp_check(struct xfs_mount *, struct xfs_buf *);
-#else
-#define	xfs_inobp_check(mp, bp)
-#endif /* DEBUG */
-
 xfs_failaddr_t xfs_dinode_verify(struct xfs_mount *mp, xfs_ino_t ino,
 			   struct xfs_dinode *dip);
 xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7db0f97e537e3..98a494e42aa6a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2146,7 +2146,6 @@ xfs_iunlink_update_dinode(
 	xfs_dinode_calc_crc(mp, dip);
 	xfs_trans_inode_buf(tp, ibp);
 	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
-	xfs_inobp_check(mp, ibp);
 }
 
 /* Set an in-core inode's unlinked pointer and return the old value. */
@@ -3538,7 +3537,6 @@ xfs_iflush(
 	xfs_iflush_fork(ip, dip, iip, XFS_DATA_FORK);
 	if (XFS_IFORK_Q(ip))
 		xfs_iflush_fork(ip, dip, iip, XFS_ATTR_FORK);
-	xfs_inobp_check(mp, bp);
 
 	/*
 	 * We've recorded everything logged in the inode, so we'd like to clear
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (23 preceding siblings ...)
  2020-05-22  3:50 ` [PATCH 24/24] xfs: remove xfs_inobp_check() Dave Chinner
@ 2020-05-22  4:04 ` Dave Chinner
  2020-05-23 16:18   ` Darrick J. Wong
  2020-05-22  6:18 ` Amir Goldstein
  25 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2020-05-22  4:04 UTC (permalink / raw)
  To: linux-xfs


FWIW, I forgot to put it in the original description - the series
can be pulled from my git tree here:

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim

Cheers,

Dave.

On Fri, May 22, 2020 at 01:50:05PM +1000, Dave Chinner wrote:
> Hi folks,
> 
> Inode flushing requires that we first lock an inode, then check it,
> then lock the underlying buffer, flush the inode to the buffer and
> finally add the inode to the buffer to be unlocked on IO completion.
> We then walk all the other cached inodes in the buffer range and
> optimistically lock and flush them to the buffer without blocking.
> 
> This cluster write effectively repeats the same code we do with the
> initial inode, except now it has to special case that initial inode
> that is already locked. Hence we have multiple copies of very
> similar code, and it is a result of inode cluster flushing being
> based on a specific inode rather than grabbing the buffer and
> flushing all available inodes to it.
> 
> The problem with this at the moment is that we we can't look up the
> buffer until we have guaranteed that an inode is held exclusively
> and it's not going away while we get the buffer through an imap
> lookup. Hence we are kinda stuck locking an inode before we can look
> up the buffer.
> 
> This is also a result of inodes being detached from the cluster
> buffer except when IO is being done. This has the further problem
> that the cluster buffer can be reclaimed from memory and then the
> inode can be dirtied. At this point cleaning the inode requires a
> read-modify-write cycle on the cluster buffer. If we then are put
> under memory pressure, cleaning that dirty inode to reclaim it
> requires allocating memory for the cluster buffer and this leads to
> all sorts of problems.
> 
> We used synchronous inode writeback in reclaim as a throttle that
> provided a forwards progress mechanism when RMW cycles were required
> to clean inodes. Async writeback of inodes (e.g. via the AIL) would
> immediately exhaust remaining memory reserves trying to allocate
> inode cluster after inode cluster. The synchronous writeback of an
> inode cluster allowed reclaim to release the inode cluster and have
> it freed almost immediately which could then be used to allocate the
> next inode cluster buffer. Hence the IO based throttling mechanism
> largely guaranteed forwards progress in inode reclaim. By removing
> the requirement for require memory allocation for inode writeback
> filesystem level, we can issue writeback asynchrnously and not have
> to worry about the memory exhaustion anymore.
> 
> Another issue is that if we have slow disks, we can build up dirty
> inodes in memory that can then take hours for an operation like
> unmount to flush. A RMW cycle per inode on a slow RAID6 device can
> mean we only clean 50 inodes a second, and when there are hundreds
> of thousands of dirty inodes that need to be cleaned this can take a
> long time. PInning the cluster buffers will greatly speed up inode
> writeback on slow storage systems like this.
> 
> These limitations all stem from the same source: inode writeback is
> inode centric, And they are largely solved by the same architectural
> change: make inode writeback cluster buffer centric.  This series is
> makes that architectural change.
> 
> Firstly, we start by pinning the inode backing buffer in memory
> when an inode is marked dirty (i.e. when it is logged). By tracking
> the number of dirty inodes on a buffer as a counter rather than a
> flag, we avoid the problem of overlapping inode dirtying and buffer
> flushing racing to set/clear the dirty flag. Hence as long as there
> is a dirty inode in memory, the buffer will not be able to be
> reclaimed. We can safely do this inode cluster buffer lookup when we
> dirty an inode as we do not hold the buffer locked - we merely take
> a reference to it and then release it - and hence we don't cause any
> new lock order issues.
> 
> When the inode is finally cleaned, the reference to the buffer can
> be removed from the inode log item and the buffer released. This is
> done from the inode completion callbacks that are attached to the
> buffer when the inode is flushed.
> 
> Pinning the cluster buffer in this way immediately avoids the RMW
> problem in inode writeback and reclaim contexts by moving the memory
> allocation and the blocking buffer read into the transaction context
> that dirties the inode.  This inverts our dirty inode throttling
> mechanism - we now throttle the rate at which we can dirty inodes to
> rate at which we can allocate memory and read inode cluster buffers
> into memory rather than via throttling reclaim to rate at which we
> can clean dirty inodes.
> 
> Hence if we are under memory pressure, we'll block on memory
> allocation when trying to dirty the referenced inode, rather than in
> the memory reclaim path where we are trying to clean unreferenced
> inodes to free memory.  Hence we no longer have to guarantee
> forwards progress in inode reclaim as we aren't doing memory
> allocation, and that means we can remove inode writeback from the
> XFS inode shrinker completely without changing the system tolerance
> for low memory operation.
> 
> Tracking the buffers via the inode log item also allows us to
> completely rework the inode flushing mechanism. While the inode log
> item is in the AIL, it is safe for the AIL to access any member of
> the log item. Hence the AIL push mechanisms can access the buffer
> attached to the inode without first having to lock the inode.
> 
> This means we can essentially lock the buffer directly and then
> call xfs_iflush_cluster() without first going through xfs_iflush()
> to find the buffer. Hence we can remove xfs_iflush() altogether,
> because the two places that call it - the inode item push code and
> inode reclaim - no longer need to flush inodes directly.
> 
> This can be further optimised by attaching the inode to the cluster
> buffer when the inode is dirtied. i.e. when we add the buffer
> reference to the inode log item, we also attach the inode to the
> buffer for IO processing. This leads to the dirty inodes always
> being attached to the buffer and hence we no longer need to add them
> when we flush the inode and remove them when IO completes. Instead
> the inodes are attached when the node log item is dirtied, and
> removed when the inode log item is cleaned.
> 
> With this structure in place, we no longer need to do
> lookups to find the dirty inodes in the cache to attach to the
> buffer in xfs_iflush_cluster() - they are already attached to the
> buffer. Hence when the AIL pushes an inode, we just grab the buffer
> from the log item, and then walk the buffer log item list to lock
> and flush the dirty inodes attached to the buffer.
> 
> This greatly simplifies inode writeback, and removes another memory
> allocation from the inode writeback path (the array used for the
> radix tree gang lookup). And while the radix tree lookups are fast,
> walking the linked list of dirty inodes is faster.
> 
> There is followup work I am doing that uses the inode cluster buffer
> as a replacement in the AIL for tracking dirty inodes. This part of
> the series is not ready yet as it has some intricate locking
> requirements. That is an optimisation, so I've left that out because
> solving the inode reclaim blocking problems is the important part of
> this work.
> 
> In short, this series simplifies inode writeback and fixes the long
> standing inode reclaim blocking issues without requiring any changes
> to the memory reclaim infrastructure.
> 
> Note: dquots should probably be converted to cluster flushing in a
> similar way, as they have many of the same issues as inode flushing.
> 
> Thoughts, comments and improvemnts welcome.
> 
> -Dave.
> 
> 
> 
> Dave Chinner (24):
>   xfs: remove logged flag from inode log item
>   xfs: add an inode item lock
>   xfs: mark inode buffers in cache
>   xfs: mark dquot buffers in cache
>   xfs: mark log recovery buffers for completion
>   xfs: call xfs_buf_iodone directly
>   xfs: clean up whacky buffer log item list reinit
>   xfs: fold xfs_istale_done into xfs_iflush_done
>   xfs: use direct calls for dquot IO completion
>   xfs: clean up the buffer iodone callback functions
>   xfs: get rid of log item callbacks
>   xfs: pin inode backing buffer to the inode log item
>   xfs: make inode reclaim almost non-blocking
>   xfs: remove IO submission from xfs_reclaim_inode()
>   xfs: allow multiple reclaimers per AG
>   xfs: don't block inode reclaim on the ILOCK
>   xfs: remove SYNC_TRYLOCK from inode reclaim
>   xfs: clean up inode reclaim comments
>   xfs: attach inodes to the cluster buffer when dirtied
>   xfs: xfs_iflush() is no longer necessary
>   xfs: rename xfs_iflush_int()
>   xfs: rework xfs_iflush_cluster() dirty inode iteration
>   xfs: factor xfs_iflush_done
>   xfs: remove xfs_inobp_check()
> 
>  fs/xfs/libxfs/xfs_inode_buf.c   |  27 +-
>  fs/xfs/libxfs/xfs_inode_buf.h   |   6 -
>  fs/xfs/libxfs/xfs_trans_inode.c | 108 +++++--
>  fs/xfs/xfs_buf.c                |  44 ++-
>  fs/xfs/xfs_buf.h                |  49 +--
>  fs/xfs/xfs_buf_item.c           | 205 +++++--------
>  fs/xfs/xfs_buf_item.h           |   8 +-
>  fs/xfs/xfs_buf_item_recover.c   |   5 +-
>  fs/xfs/xfs_dquot.c              |  32 +-
>  fs/xfs/xfs_dquot.h              |   1 +
>  fs/xfs/xfs_dquot_item_recover.c |   2 +-
>  fs/xfs/xfs_file.c               |   9 +-
>  fs/xfs/xfs_icache.c             | 293 +++++-------------
>  fs/xfs/xfs_inode.c              | 515 +++++++++++---------------------
>  fs/xfs/xfs_inode.h              |   2 +-
>  fs/xfs/xfs_inode_item.c         | 281 ++++++++---------
>  fs/xfs/xfs_inode_item.h         |   9 +-
>  fs/xfs/xfs_inode_item_recover.c |   2 +-
>  fs/xfs/xfs_log_recover.c        |   5 +-
>  fs/xfs/xfs_mount.c              |   4 -
>  fs/xfs/xfs_mount.h              |   1 -
>  fs/xfs/xfs_trans.h              |   3 -
>  fs/xfs/xfs_trans_buf.c          |  15 +-
>  fs/xfs/xfs_trans_priv.h         |  12 +-
>  24 files changed, 680 insertions(+), 958 deletions(-)
> 
> -- 
> 2.26.2.761.g0e0b3e54be
> 
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous
  2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (24 preceding siblings ...)
  2020-05-22  4:04 ` [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
@ 2020-05-22  6:18 ` Amir Goldstein
  2020-05-22 12:01   ` Amir Goldstein
  25 siblings, 1 reply; 91+ messages in thread
From: Amir Goldstein @ 2020-05-22  6:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 6:51 AM Dave Chinner <david@fromorbit.com> wrote:
>
> Hi folks,
>
> Inode flushing requires that we first lock an inode, then check it,
> then lock the underlying buffer, flush the inode to the buffer and
> finally add the inode to the buffer to be unlocked on IO completion.
> We then walk all the other cached inodes in the buffer range and
> optimistically lock and flush them to the buffer without blocking.
>
> This cluster write effectively repeats the same code we do with the
> initial inode, except now it has to special case that initial inode
> that is already locked. Hence we have multiple copies of very
> similar code, and it is a result of inode cluster flushing being
> based on a specific inode rather than grabbing the buffer and
> flushing all available inodes to it.
>
> The problem with this at the moment is that we we can't look up the
> buffer until we have guaranteed that an inode is held exclusively
> and it's not going away while we get the buffer through an imap
> lookup. Hence we are kinda stuck locking an inode before we can look
> up the buffer.
>
> This is also a result of inodes being detached from the cluster
> buffer except when IO is being done. This has the further problem
> that the cluster buffer can be reclaimed from memory and then the
> inode can be dirtied. At this point cleaning the inode requires a
> read-modify-write cycle on the cluster buffer. If we then are put
> under memory pressure, cleaning that dirty inode to reclaim it
> requires allocating memory for the cluster buffer and this leads to
> all sorts of problems.
>
> We used synchronous inode writeback in reclaim as a throttle that
> provided a forwards progress mechanism when RMW cycles were required
> to clean inodes. Async writeback of inodes (e.g. via the AIL) would
> immediately exhaust remaining memory reserves trying to allocate
> inode cluster after inode cluster. The synchronous writeback of an
> inode cluster allowed reclaim to release the inode cluster and have
> it freed almost immediately which could then be used to allocate the
> next inode cluster buffer. Hence the IO based throttling mechanism
> largely guaranteed forwards progress in inode reclaim. By removing
> the requirement for require memory allocation for inode writeback
> filesystem level, we can issue writeback asynchrnously and not have
> to worry about the memory exhaustion anymore.
>
> Another issue is that if we have slow disks, we can build up dirty
> inodes in memory that can then take hours for an operation like
> unmount to flush. A RMW cycle per inode on a slow RAID6 device can
> mean we only clean 50 inodes a second, and when there are hundreds
> of thousands of dirty inodes that need to be cleaned this can take a
> long time. PInning the cluster buffers will greatly speed up inode
> writeback on slow storage systems like this.
>
> These limitations all stem from the same source: inode writeback is
> inode centric, And they are largely solved by the same architectural
> change: make inode writeback cluster buffer centric.  This series is
> makes that architectural change.
>
> Firstly, we start by pinning the inode backing buffer in memory
> when an inode is marked dirty (i.e. when it is logged). By tracking
> the number of dirty inodes on a buffer as a counter rather than a
> flag, we avoid the problem of overlapping inode dirtying and buffer
> flushing racing to set/clear the dirty flag. Hence as long as there
> is a dirty inode in memory, the buffer will not be able to be
> reclaimed. We can safely do this inode cluster buffer lookup when we
> dirty an inode as we do not hold the buffer locked - we merely take
> a reference to it and then release it - and hence we don't cause any
> new lock order issues.
>
> When the inode is finally cleaned, the reference to the buffer can
> be removed from the inode log item and the buffer released. This is
> done from the inode completion callbacks that are attached to the
> buffer when the inode is flushed.
>
> Pinning the cluster buffer in this way immediately avoids the RMW
> problem in inode writeback and reclaim contexts by moving the memory
> allocation and the blocking buffer read into the transaction context
> that dirties the inode.  This inverts our dirty inode throttling
> mechanism - we now throttle the rate at which we can dirty inodes to
> rate at which we can allocate memory and read inode cluster buffers
> into memory rather than via throttling reclaim to rate at which we
> can clean dirty inodes.
>
> Hence if we are under memory pressure, we'll block on memory
> allocation when trying to dirty the referenced inode, rather than in
> the memory reclaim path where we are trying to clean unreferenced
> inodes to free memory.  Hence we no longer have to guarantee
> forwards progress in inode reclaim as we aren't doing memory
> allocation, and that means we can remove inode writeback from the
> XFS inode shrinker completely without changing the system tolerance
> for low memory operation.
>
> Tracking the buffers via the inode log item also allows us to
> completely rework the inode flushing mechanism. While the inode log
> item is in the AIL, it is safe for the AIL to access any member of
> the log item. Hence the AIL push mechanisms can access the buffer
> attached to the inode without first having to lock the inode.
>
> This means we can essentially lock the buffer directly and then
> call xfs_iflush_cluster() without first going through xfs_iflush()
> to find the buffer. Hence we can remove xfs_iflush() altogether,
> because the two places that call it - the inode item push code and
> inode reclaim - no longer need to flush inodes directly.
>
> This can be further optimised by attaching the inode to the cluster
> buffer when the inode is dirtied. i.e. when we add the buffer
> reference to the inode log item, we also attach the inode to the
> buffer for IO processing. This leads to the dirty inodes always
> being attached to the buffer and hence we no longer need to add them
> when we flush the inode and remove them when IO completes. Instead
> the inodes are attached when the node log item is dirtied, and
> removed when the inode log item is cleaned.
>
> With this structure in place, we no longer need to do
> lookups to find the dirty inodes in the cache to attach to the
> buffer in xfs_iflush_cluster() - they are already attached to the
> buffer. Hence when the AIL pushes an inode, we just grab the buffer
> from the log item, and then walk the buffer log item list to lock
> and flush the dirty inodes attached to the buffer.
>
> This greatly simplifies inode writeback, and removes another memory
> allocation from the inode writeback path (the array used for the
> radix tree gang lookup). And while the radix tree lookups are fast,
> walking the linked list of dirty inodes is faster.
>
> There is followup work I am doing that uses the inode cluster buffer
> as a replacement in the AIL for tracking dirty inodes. This part of
> the series is not ready yet as it has some intricate locking
> requirements. That is an optimisation, so I've left that out because
> solving the inode reclaim blocking problems is the important part of
> this work.
>
> In short, this series simplifies inode writeback and fixes the long
> standing inode reclaim blocking issues without requiring any changes
> to the memory reclaim infrastructure.
>
> Note: dquots should probably be converted to cluster flushing in a
> similar way, as they have many of the same issues as inode flushing.
>
> Thoughts,

Thank you for this wonderful write up!
Can't wait to deploy this on production machines :)

> comments

Since this is a "long standing issue" I am sure you have tests.
Can you please share the tests you used (simoop?) with some
encouraging performance numbers - flattening the stall curve ;-).

> and improvements

I wonder if the cluster buffer inode counter can be mixed as another
metric for choosing the best inodes to reclaim. If my math is correct
and depending on the specific system page/struct inode size,
reclaiming a cluster buffer with single dirty inode reclaims at least
1 page + 1 xfs_inode with low probability to free a slab page.
Reclaiming a cluster buffer with 8 dirty inodes reclaims at least
1 page + 8 xfs_inode with much higher probability to freeing one or two
slab pages (in the best case where adjacent inodes where allocated
consequently).

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 02/24] xfs: add an inode item lock
  2020-05-22  3:50 ` [PATCH 02/24] xfs: add an inode item lock Dave Chinner
@ 2020-05-22  6:45   ` Amir Goldstein
  2020-05-22 21:24   ` Darrick J. Wong
  2020-05-23  8:45   ` Christoph Hellwig
  2 siblings, 0 replies; 91+ messages in thread
From: Amir Goldstein @ 2020-05-22  6:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 6:51 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> The inode log item is kind of special in that it can be aggregating
> new changes in memory at the same time time existing changes are
> being written back to disk. This means there are fields in the log
> item that are accessed concurrently from contexts that don't share
> any locking at all.
>
> e.g. updating ili_last_fields occurs at flush time under the
> ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
> completion time, and is read under the ILOCK_EXCL when the inode is
> logged.  Hence there is no actual serialisation between reading the
> field during logging of the inode in transactions vs clearing the
> field in IO completion.
>
> We currently get away with this by the fact that we are only
> clearing fields in IO completion, and nothing bad happens if we
> accidentally log more of the inode than we actually modify. Worst
> case is we consume a tiny bit more memory and log bandwidth.
>
> However, if we want to do more complex state manipulations on the
> log item that requires updates at all three of these potential
> locations, we need to have some mechanism of serialising those
> operations. To do this, introduce a spinlock into the log item to
> serialise internal state.
>
> This could be done via the xfs_inode i_flags_lock, but this then
> leads to potential lock inversion issues where inode flag updates
> need to occur inside locks that best nest inside the inode log item
> locks (e.g. marking inodes stale during inode cluster freeing).
> Using a separate spinlock avoids these sorts of problems and
> simplifies future code.
>
> This does not touch the use of ili_fields in the item formatting
> code - that is entirely protected by the ILOCK_EXCL at this point in
> time, so it remains untouched.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 44 ++++++++++++++++++---------------
>  fs/xfs/xfs_file.c               |  9 ++++---
>  fs/xfs/xfs_inode.c              | 20 +++++++++------
>  fs/xfs/xfs_inode_item.c         |  7 ++++++
>  fs/xfs/xfs_inode_item.h         |  3 ++-
>  5 files changed, 51 insertions(+), 32 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index b5dfb66548422..510b996008221 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -81,15 +81,19 @@ xfs_trans_ichgtime(
>   */
>  void
>  xfs_trans_log_inode(
> -       xfs_trans_t     *tp,
> -       xfs_inode_t     *ip,
> -       uint            flags)
> +       struct xfs_trans        *tp,
> +       struct xfs_inode        *ip,
> +       uint                    flags)
>  {
> -       struct inode    *inode = VFS_I(ip);
> +       struct xfs_inode_log_item *iip = ip->i_itemp;
> +       struct inode            *inode = VFS_I(ip);
> +       uint                    iversion_flags = 0;
>
>         ASSERT(ip->i_itemp != NULL);
>         ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>
> +       tp->t_flags |= XFS_TRANS_DIRTY;
> +
>         /*
>          * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
>          * don't matter - we either will need an extra transaction in 24 hours
> @@ -102,15 +106,6 @@ xfs_trans_log_inode(
>                 spin_unlock(&inode->i_lock);
>         }
>
> -       /*
> -        * Record the specific change for fdatasync optimisation. This
> -        * allows fdatasync to skip log forces for inodes that are only
> -        * timestamp dirty. We do this before the change count so that
> -        * the core being logged in this case does not impact on fdatasync
> -        * behaviour.
> -        */
> -       ip->i_itemp->ili_fsync_fields |= flags;
> -
>         /*
>          * First time we log the inode in a transaction, bump the inode change
>          * counter if it is configured for this to occur. While we have the
> @@ -120,13 +115,21 @@ xfs_trans_log_inode(
>          * set however, then go ahead and bump the i_version counter
>          * unconditionally.
>          */
> -       if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
> -           IS_I_VERSION(VFS_I(ip))) {
> -               if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
> -                       flags |= XFS_ILOG_CORE;
> +       if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags) &&
> +           IS_I_VERSION(inode)) {
> +               if (inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> +                       iversion_flags = XFS_ILOG_CORE;
>         }
>
> -       tp->t_flags |= XFS_TRANS_DIRTY;
> +       /*
> +        * Record the specific change for fdatasync optimisation. This
> +        * allows fdatasync to skip log forces for inodes that are only
> +        * timestamp dirty. We do this before the change count so that
> +        * the core being logged in this case does not impact on fdatasync
> +        * behaviour.
> +        */
> +       spin_lock(&iip->ili_lock);
> +       iip->ili_fsync_fields |= flags;
>
>         /*
>          * Always OR in the bits from the ili_last_fields field.
> @@ -135,8 +138,9 @@ xfs_trans_log_inode(
>          * See the big comment in xfs_iflush() for an explanation of
>          * this coordination mechanism.
>          */
> -       flags |= ip->i_itemp->ili_last_fields;
> -       ip->i_itemp->ili_fields |= flags;
> +       flags |= iip->ili_last_fields | iversion_flags;
> +       iip->ili_fields |= flags;
> +       spin_unlock(&iip->ili_lock);
>  }
>
>  int
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 403c90309a8ff..0abf770b77498 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -94,6 +94,7 @@ xfs_file_fsync(
>  {
>         struct inode            *inode = file->f_mapping->host;
>         struct xfs_inode        *ip = XFS_I(inode);
> +       struct xfs_inode_log_item *iip = ip->i_itemp;
>         struct xfs_mount        *mp = ip->i_mount;
>         int                     error = 0;
>         int                     log_flushed = 0;
> @@ -137,13 +138,15 @@ xfs_file_fsync(
>         xfs_ilock(ip, XFS_ILOCK_SHARED);
>         if (xfs_ipincount(ip)) {
>                 if (!datasync ||
> -                   (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> -                       lsn = ip->i_itemp->ili_last_lsn;
> +                   (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> +                       lsn = iip->ili_last_lsn;
>         }
>
>         if (lsn) {
>                 error = xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
> -               ip->i_itemp->ili_fsync_fields = 0;
> +               spin_lock(&iip->ili_lock);
> +               iip->ili_fsync_fields = 0;
> +               spin_unlock(&iip->ili_lock);
>         }
>         xfs_iunlock(ip, XFS_ILOCK_SHARED);
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ca9f2688b745d..57781c0dbbec5 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2683,9 +2683,11 @@ xfs_ifree_cluster(
>                                 continue;
>
>                         iip = ip->i_itemp;
> +                       spin_lock(&iip->ili_lock);
>                         iip->ili_last_fields = iip->ili_fields;
>                         iip->ili_fields = 0;
>                         iip->ili_fsync_fields = 0;
> +                       spin_unlock(&iip->ili_lock);
>                         xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>                                                 &iip->ili_item.li_lsn);
>
> @@ -2721,6 +2723,7 @@ xfs_ifree(
>  {
>         int                     error;
>         struct xfs_icluster     xic = { 0 };
> +       struct xfs_inode_log_item *iip = ip->i_itemp;
>
>         ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>         ASSERT(VFS_I(ip)->i_nlink == 0);
> @@ -2758,7 +2761,9 @@ xfs_ifree(
>         ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS;
>
>         /* Don't attempt to replay owner changes for a deleted inode */
> -       ip->i_itemp->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);
> +       spin_lock(&iip->ili_lock);
> +       iip->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);
> +       spin_unlock(&iip->ili_lock);
>
>         /*
>          * Bump the generation count so no one will be confused
> @@ -3814,20 +3819,19 @@ xfs_iflush_int(
>          * know that the information those bits represent is permanently on
>          * disk.  As long as the flush completes before the inode is logged
>          * again, then both ili_fields and ili_last_fields will be cleared.
> -        *
> -        * We can play with the ili_fields bits here, because the inode lock
> -        * must be held exclusively in order to set bits there and the flush
> -        * lock protects the ili_last_fields bits.  Store the current LSN of the
> -        * inode so that we can tell whether the item has moved in the AIL from
> -        * xfs_iflush_done().  In order to read the lsn we need the AIL lock,
> -        * because it is a 64 bit value that cannot be read atomically.
>          */
>         error = 0;
>  flush_out:
> +       spin_lock(&iip->ili_lock);
>         iip->ili_last_fields = iip->ili_fields;
>         iip->ili_fields = 0;
>         iip->ili_fsync_fields = 0;
> +       spin_unlock(&iip->ili_lock);
>
> +       /*
> +        * Store the current LSN of the inode so that we can tell whether the
> +        * item has moved in the AIL from xfs_iflush_done().
> +        */
>         xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>                                 &iip->ili_item.li_lsn);
>
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index b17384aa8df40..6ef9cbcfc94a7 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -637,6 +637,7 @@ xfs_inode_item_init(
>         iip = ip->i_itemp = kmem_zone_zalloc(xfs_ili_zone, 0);
>
>         iip->ili_inode = ip;
> +       spin_lock_init(&iip->ili_lock);
>         xfs_log_item_init(mp, &iip->ili_item, XFS_LI_INODE,
>                                                 &xfs_inode_item_ops);
>  }
> @@ -738,7 +739,11 @@ xfs_iflush_done(
>         list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
>                 list_del_init(&blip->li_bio_list);
>                 iip = INODE_ITEM(blip);
> +
> +               spin_lock(&iip->ili_lock);
>                 iip->ili_last_fields = 0;
> +               spin_unlock(&iip->ili_lock);
> +
>                 xfs_ifunlock(iip->ili_inode);
>         }
>         list_del(&tmp);
> @@ -762,9 +767,11 @@ xfs_iflush_abort(
>                  * Clear the inode logging fields so no more flushes are
>                  * attempted.
>                  */
> +               spin_lock(&iip->ili_lock);
>                 iip->ili_last_fields = 0;
>                 iip->ili_fields = 0;
>                 iip->ili_fsync_fields = 0;
> +               spin_unlock(&iip->ili_lock);
>         }
>         /*
>          * Release the inode's flush lock since we're done with it.
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 4de5070e07655..1234e8cd3726d 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -18,7 +18,8 @@ struct xfs_inode_log_item {
>         struct xfs_inode        *ili_inode;        /* inode ptr */
>         xfs_lsn_t               ili_flush_lsn;     /* lsn at last flush */
>         xfs_lsn_t               ili_last_lsn;      /* lsn at last transaction */
> -       unsigned short          ili_lock_flags;    /* lock flags */
> +       spinlock_t              ili_lock;          /* internal state lock */

"internal state" is a fluid term.
It would be more useful to document "Protects ..." as in i_lock/f_lock.

For verifying unchanged logic from code re-organization and locking
balance:

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 01/24] xfs: remove logged flag from inode log item
  2020-05-22  3:50 ` [PATCH 01/24] xfs: remove logged flag from inode log item Dave Chinner
@ 2020-05-22  7:25   ` Christoph Hellwig
  2020-05-22 21:13   ` Darrick J. Wong
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-22  7:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:06PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> This was used to track if the item had logged fields being flushed
> to disk. We log everything in the inode these days, so this logic is
> no longer needed. Remove it.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/24] xfs: mark log recovery buffers for completion
  2020-05-22  3:50 ` [PATCH 05/24] xfs: mark log recovery buffers for completion Dave Chinner
@ 2020-05-22  7:41   ` Amir Goldstein
  2020-05-24 23:54     ` Dave Chinner
  2020-05-22 21:41   ` Darrick J. Wong
  1 sibling, 1 reply; 91+ messages in thread
From: Amir Goldstein @ 2020-05-22  7:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 6:51 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Log recovery has it's own buffer write completion handler for
> buffers that it directly recovers. Convert these to direct calls by
> flagging these buffers as being log recovery buffers. The flag will
> get cleared by the log recovery IO completion routine, so it will
> never leak out of log recovery.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c                | 10 ++++++++++
>  fs/xfs/xfs_buf.h                |  2 ++
>  fs/xfs/xfs_buf_item_recover.c   |  5 ++---
>  fs/xfs/xfs_dquot_item_recover.c |  2 +-
>  fs/xfs/xfs_inode_item_recover.c |  2 +-
>  fs/xfs/xfs_log_recover.c        |  5 ++---
>  6 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 77d40eb4a11db..b89685ce8519d 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -14,6 +14,7 @@
>  #include "xfs_mount.h"
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
> +#include "xfs_log_recover.h"
>  #include "xfs_trans.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_errortag.h"
> @@ -1207,6 +1208,15 @@ xfs_buf_ioend(
>         if (read)
>                 goto out_finish;
>
> +       /*
> +        * If this is a log recovery buffer, we aren't doing transactional IO
> +        * yet so we need to let it handle IO completions.
> +        */
> +       if (bp->b_flags & _XBF_LOGRCVY) {
> +               xlog_recover_iodone(bp);
> +               return;
> +       }
> +
>         /* inodes always have a callback on write */
>         if (bp->b_flags & _XBF_INODES) {
>                 xfs_buf_inode_iodone(bp);

This turns out to be a "static calls" pattern.
I think it would look nicer as a switch statement on
(bp->b_flags & _XBF_BUFFER_TYPE_MASK)
It would be also nicer to document near flag definition
that the type flags are mutually exclusive.

> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index cbde44ecb3963..c5fe4c48c9080 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -33,6 +33,7 @@
>  /* buffer type flags for write callbacks */
>  #define _XBF_INODES     (1 << 16)/* inode buffer */
>  #define _XBF_DQUOTS     (1 << 17)/* dquot buffer */
> +#define _XBF_LOGRCVY    (1 << 18)/* log recovery buffer */
>
>  /* flags used only internally */
>  #define _XBF_PAGES      (1 << 20)/* backed by refcounted pages */
> @@ -57,6 +58,7 @@ typedef unsigned int xfs_buf_flags_t;
>         { XBF_WRITE_FAIL,       "WRITE_FAIL" }, \
>         { _XBF_INODES,          "INODES" }, \
>         { _XBF_DQUOTS,          "DQUOTS" }, \
> +       { _XBF_LOGRCVY,         "LOG_RECOVERY" }, \
>         { _XBF_PAGES,           "PAGES" }, \
>         { _XBF_KMEM,            "KMEM" }, \
>         { _XBF_DELWRI_Q,        "DELWRI_Q" }, \
> diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
> index 04faa7310c4f0..bfd50daa16606 100644
> --- a/fs/xfs/xfs_buf_item_recover.c
> +++ b/fs/xfs/xfs_buf_item_recover.c
> @@ -419,8 +419,7 @@ xlog_recover_validate_buf_type(
>         if (bp->b_ops) {
>                 struct xfs_buf_log_item *bip;
>
> -               ASSERT(!bp->b_iodone || bp->b_iodone == xlog_recover_iodone);
> -               bp->b_iodone = xlog_recover_iodone;
> +               bp->b_flags |= _XBF_LOGRCVY;
>                 xfs_buf_item_init(bp, mp);
>                 bip = bp->b_log_item;
>                 bip->bli_item.li_lsn = current_lsn;
> @@ -963,7 +962,7 @@ xlog_recover_buf_commit_pass2(
>                 error = xfs_bwrite(bp);
>         } else {
>                 ASSERT(bp->b_mount == mp);
> -               bp->b_iodone = xlog_recover_iodone;
> +               bp->b_flags |= _XBF_LOGRCVY;
>                 xfs_buf_delwri_queue(bp, buffer_list);
>         }
>
> diff --git a/fs/xfs/xfs_dquot_item_recover.c b/fs/xfs/xfs_dquot_item_recover.c
> index 3400be4c88f08..a0a4b089e0cdd 100644
> --- a/fs/xfs/xfs_dquot_item_recover.c
> +++ b/fs/xfs/xfs_dquot_item_recover.c
> @@ -153,7 +153,7 @@ xlog_recover_dquot_commit_pass2(
>
>         ASSERT(dq_f->qlf_size == 2);
>         ASSERT(bp->b_mount == mp);
> -       bp->b_iodone = xlog_recover_iodone;
> +       bp->b_flags |= _XBF_LOGRCVY;
>         xfs_buf_delwri_queue(bp, buffer_list);
>
>  out_release:
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index dc3e26ff16c90..b67f1b7c5b65f 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -376,7 +376,7 @@ xlog_recover_inode_commit_pass2(
>         xfs_dinode_calc_crc(log->l_mp, dip);
>
>         ASSERT(bp->b_mount == mp);
> -       bp->b_iodone = xlog_recover_iodone;
> +       bp->b_flags |= _XBF_LOGRCVY;
>         xfs_buf_delwri_queue(bp, buffer_list);
>
>  out_release:
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index ec015df55b77a..0aa823aeafca9 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -287,9 +287,8 @@ xlog_recover_iodone(
>         if (bp->b_log_item)
>                 xfs_buf_item_relse(bp);
>         ASSERT(bp->b_log_item == NULL);
> -
> -       bp->b_iodone = NULL;
> -       xfs_buf_ioend(bp);
> +       bp->b_flags &= ~_XBF_LOGRCVY;
> +       xfs_buf_ioend_finish(bp);


For someone like me who does not know all the assumptions
about buffers, why is this fag leak prevention needed for log recovery
buffers and not for inode/dquote buffers?

Wouldn't it be better to have:
       bp->b_flags &= ~_XBF_BUFFER_TYPE_MASK;

inside xfs_buf_ioend_finish()?

For not changing logic by rearranging code:

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/24] xfs: mark inode buffers in cache
  2020-05-22  3:50 ` [PATCH 03/24] xfs: mark inode buffers in cache Dave Chinner
@ 2020-05-22  7:45   ` Amir Goldstein
  2020-05-22 21:35   ` Darrick J. Wong
  2020-05-23  8:48   ` Christoph Hellwig
  2 siblings, 0 replies; 91+ messages in thread
From: Amir Goldstein @ 2020-05-22  7:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 6:51 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Inode buffers always have write IO callbacks, so by marking them
> directly we can avoid needing to attach ->b_iodone functions to
> them. This avoids an indirect call, and makes future modifications
> much simpler.
>
> This is largely a rearrangement of the code at this point - no IO
> completion functionality changes at this point, just how the
> code is run is modified.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c       | 18 +++++++++++++-----
>  fs/xfs/xfs_buf.h       | 39 ++++++++++++++++++++++++++-------------
>  fs/xfs/xfs_buf_item.c  | 42 +++++++++++++++++++++++++++++++-----------
>  fs/xfs/xfs_buf_item.h  |  1 +
>  fs/xfs/xfs_inode.c     |  2 +-
>  fs/xfs/xfs_trans_buf.c |  3 +++
>  6 files changed, 75 insertions(+), 30 deletions(-)
>
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 9c2fbb6bbf89d..6105b97028d6a 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -14,6 +14,8 @@
>  #include "xfs_mount.h"
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
> +#include "xfs_trans.h"
> +#include "xfs_buf_item.h"
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>
> @@ -1202,12 +1204,18 @@ xfs_buf_ioend(
>                 bp->b_flags |= XBF_DONE;
>         }
>
> -       if (bp->b_iodone)
> +       /* inodes always have a callback on write */
> +       if (!read && (bp->b_flags & _XBF_INODES)) {
> +               xfs_buf_inode_iodone(bp);
> +               return;
> +       }
> +
> +       if (bp->b_iodone) {
>                 (*(bp->b_iodone))(bp);
> -       else if (bp->b_flags & XBF_ASYNC)
> -               xfs_buf_relse(bp);
> -       else
> -               complete(&bp->b_iowait);
> +               return;
> +       }
> +
> +       xfs_buf_ioend_finish(bp);
>  }
>
>  static void
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 050c53b739e24..b3e5d653d09f1 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -30,15 +30,19 @@
>  #define XBF_STALE       (1 << 6) /* buffer has been staled, do not find it */
>  #define XBF_WRITE_FAIL  (1 << 7) /* async writes have failed on this buffer */
>
> -/* flags used only as arguments to access routines */
> -#define XBF_TRYLOCK     (1 << 16)/* lock requested, but do not wait */
> -#define XBF_UNMAPPED    (1 << 17)/* do not map the buffer */
> +/* buffer type flags for write callbacks */
> +#define _XBF_INODES     (1 << 16)/* inode buffer */

As I wrote on review of another type flag, best add a definition
of XBF_BUFFER_TYPE_MASK and document that buffer type
flags are mutually exclusive, maybe even ASSERT it in some
places.

For not changing logic by rearranging code:

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 04/24] xfs: mark dquot buffers in cache
  2020-05-22  3:50 ` [PATCH 04/24] xfs: mark dquot " Dave Chinner
@ 2020-05-22  7:46   ` Amir Goldstein
  2020-05-22 21:38   ` Darrick J. Wong
  1 sibling, 0 replies; 91+ messages in thread
From: Amir Goldstein @ 2020-05-22  7:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 6:51 AM Dave Chinner <david@fromorbit.com> wrote:
>
> dquot buffers always have write IO callbacks, so by marking them
> directly we can avoid needing to attach ->b_iodone functions to
> them. This avoids an indirect call, and makes future modifications
> much simpler.
>
> This is largely a rearrangement of the code at this point - no IO
> completion functionality changes at this point, just how the
> code is run is modified.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c       | 12 +++++++++++-
>  fs/xfs/xfs_buf.h       |  2 ++
>  fs/xfs/xfs_buf_item.c  | 10 ++++++++++
>  fs/xfs/xfs_buf_item.h  |  1 +
>  fs/xfs/xfs_dquot.c     |  1 +
>  fs/xfs/xfs_trans_buf.c |  1 +
>  6 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 6105b97028d6a..77d40eb4a11db 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1204,17 +1204,27 @@ xfs_buf_ioend(
>                 bp->b_flags |= XBF_DONE;
>         }
>
> +       if (read)
> +               goto out_finish;
> +
>         /* inodes always have a callback on write */
> -       if (!read && (bp->b_flags & _XBF_INODES)) {
> +       if (bp->b_flags & _XBF_INODES) {
>                 xfs_buf_inode_iodone(bp);
>                 return;
>         }
>
> +       /* dquots always have a callback on write */
> +       if (bp->b_flags & _XBF_DQUOTS) {
> +               xfs_buf_dquot_iodone(bp);
> +               return;
> +       }
> +

As commented on another patch, this would look better as
a switch statement.

For not changing logic by rearranging code:

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/24] xfs: call xfs_buf_iodone directly
  2020-05-22  3:50 ` [PATCH 06/24] xfs: call xfs_buf_iodone directly Dave Chinner
@ 2020-05-22  7:56   ` Amir Goldstein
  2020-05-22 21:53   ` Darrick J. Wong
  1 sibling, 0 replies; 91+ messages in thread
From: Amir Goldstein @ 2020-05-22  7:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 6:51 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> All unmarked dirty buffers should be in the AIL and have log items
> attached to them. Hence when they are written, we will run a
> callback to remove the item from the AIL if appropriate. Now that
> we've handled inode and dquot buffers, all remaining calls are to
> xfs_buf_iodone() and so we can hard code this rather than use an
> indirect call.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
[...]

>  /*
> - * Inode buffer iodone callback function.
> + * Dquot buffer iodone callback function.
>   */
>  void
> -xfs_buf_inode_iodone(
> +xfs_buf_dquot_iodone(
>         struct xfs_buf          *bp)
>  {
>         xfs_buf_run_callbacks(bp);
> @@ -1211,10 +1191,10 @@ xfs_buf_inode_iodone(
>  }
>
>  /*
> - * Dquot buffer iodone callback function.
> + * Dirty buffer iodone callback function.
>   */
>  void
> -xfs_buf_dquot_iodone(
> +xfs_buf_dirty_iodone(

Nice cleanup!
Minor nit - if you added the new helper at the location where old helper
was removed, it would have avoided this strange looking diff.

For not changing logic by rearranging code:

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous
  2020-05-22  6:18 ` Amir Goldstein
@ 2020-05-22 12:01   ` Amir Goldstein
  0 siblings, 0 replies; 91+ messages in thread
From: Amir Goldstein @ 2020-05-22 12:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

>
> Since this is a "long standing issue" I am sure you have tests.
> Can you please share the tests you used (simoop?) with some
> encouraging performance numbers - flattening the stall curve ;-).
>

Found the numbers and OOM story in patch 13.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/24] xfs: make inode reclaim almost non-blocking
  2020-05-22  3:50 ` [PATCH 13/24] xfs: make inode reclaim almost non-blocking Dave Chinner
@ 2020-05-22 12:19   ` Amir Goldstein
  2020-05-22 22:48   ` Darrick J. Wong
  1 sibling, 0 replies; 91+ messages in thread
From: Amir Goldstein @ 2020-05-22 12:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 6:51 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Now that dirty inode writeback doesn't cause read-modify-write
> cycles on the inode cluster buffer under memory pressure, the need
> to throttle memory reclaim to the rate at which we can clean dirty
> inodes goes away. That is due to the fact that we no longer thrash
> inode cluster buffers under memory pressure to clean dirty inodes.
>
> This means inode writeback no longer stalls on memory allocation
> or read IO, and hence can be done asynchrnously without generating
> memory pressure. As a result, blocking inode writeback in reclaim is
> no longer necessary to prevent reclaim priority windup as cleaning
> dirty inodes is no longer dependent on having memory reserves
> available for the filesystem to make progress reclaiming inodes.
>
> Hence we can convert inode reclaim to be non-blocking for shrinker
> callouts, both for direct reclaim and kswapd.
>
> On a vanilla kernel, running a 16-way fsmark create workload on a
> 4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
> userspace mlock(). The OOM killer gets invoked at 15GB of
> pinned RAM.
>
> With this patch alone, pinning memory triggers premature OOM
> killer invocation, sometimes with as much as 45% of RAM being free.
> It's trivially easy to trigger the OOM killer when reclaim does not
> block.
>
> With pinning inode clusters in RAM adn then adding this patch, I can

typo: adn

> reliably pin 14.5GB of RAM and still have the fsmark workload run to
> completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
> is only a small amount of memory less than the vanilla kernel. It is
> much more reliable than just with async reclaim alone.
>
> simoops shows that allocation stalls go away when async reclaim is
> used. Vanilla kernel:
>
> Run time: 1924 seconds
> Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
> Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
> Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
> work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
> alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
>
> With inode cluster pinning and async reclaim:
>
> Run time: 1924 seconds
> Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
> Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
> Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
> work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
> alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)
>
> Latencies don't really change much, nor does the work rate. However,
> allocation almost never stalls with these changes, whilst the
> vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
> period. This difference is due to inode reclaim being largely
> non-blocking now.
>
> IOWs, once we have pinned inode cluster buffers, we can make inode
> reclaim non-blocking without a major risk of premature and/or
> spurious OOM killer invocation, and without any changes to memory
> reclaim infrastructure.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

I can confirm observing the stalls on production servers and that
this patch alone fixes the stalls.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 21/24] xfs: rename xfs_iflush_int()
  2020-05-22  3:50 ` [PATCH 21/24] xfs: rename xfs_iflush_int() Dave Chinner
@ 2020-05-22 12:33   ` Amir Goldstein
  2020-05-22 23:57   ` Darrick J. Wong
  1 sibling, 0 replies; 91+ messages in thread
From: Amir Goldstein @ 2020-05-22 12:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 6:52 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> with xfs_iflush() gone, we can rename xfs_iflush_int() back to
> xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't
> need the forward definition any more.
>

If it were up to me, I would avoid the code churn involved with moving code
around for the sole purpose of getting rid of forward definition.

BTW, next patch removed the static from xfs_iflush(), but it looks like
a mistake or part of your follow up series. If you are going to make
xfs_iflush() non static it is really pointless to move it around.

But for correctness:

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 01/24] xfs: remove logged flag from inode log item
  2020-05-22  3:50 ` [PATCH 01/24] xfs: remove logged flag from inode log item Dave Chinner
  2020-05-22  7:25   ` Christoph Hellwig
@ 2020-05-22 21:13   ` Darrick J. Wong
  1 sibling, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 21:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:06PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> This was used to track if the item had logged fields being flushed
> to disk. We log everything in the inode these days, so this logic is
> no longer needed. Remove it.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

AFAICT the only dinode field that isn't tracked through the inode item
is di_next_unlinked, which has always been applied separately (via the
inode cluster buffer).

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode.c      | 13 ++++---------
>  fs/xfs/xfs_inode_item.c | 35 ++++++++++-------------------------
>  fs/xfs/xfs_inode_item.h |  1 -
>  3 files changed, 14 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 64f5f9a440aed..ca9f2688b745d 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2658,7 +2658,6 @@ xfs_ifree_cluster(
>  		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
>  			if (lip->li_type == XFS_LI_INODE) {
>  				iip = (struct xfs_inode_log_item *)lip;
> -				ASSERT(iip->ili_logged == 1);
>  				lip->li_cb = xfs_istale_done;
>  				xfs_trans_ail_copy_lsn(mp->m_ail,
>  							&iip->ili_flush_lsn,
> @@ -2687,7 +2686,6 @@ xfs_ifree_cluster(
>  			iip->ili_last_fields = iip->ili_fields;
>  			iip->ili_fields = 0;
>  			iip->ili_fsync_fields = 0;
> -			iip->ili_logged = 1;
>  			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  						&iip->ili_item.li_lsn);
>  
> @@ -3819,19 +3817,16 @@ xfs_iflush_int(
>  	 *
>  	 * We can play with the ili_fields bits here, because the inode lock
>  	 * must be held exclusively in order to set bits there and the flush
> -	 * lock protects the ili_last_fields bits.  Set ili_logged so the flush
> -	 * done routine can tell whether or not to look in the AIL.  Also, store
> -	 * the current LSN of the inode so that we can tell whether the item has
> -	 * moved in the AIL from xfs_iflush_done().  In order to read the lsn we
> -	 * need the AIL lock, because it is a 64 bit value that cannot be read
> -	 * atomically.
> +	 * lock protects the ili_last_fields bits.  Store the current LSN of the
> +	 * inode so that we can tell whether the item has moved in the AIL from
> +	 * xfs_iflush_done().  In order to read the lsn we need the AIL lock,
> +	 * because it is a 64 bit value that cannot be read atomically.
>  	 */
>  	error = 0;
>  flush_out:
>  	iip->ili_last_fields = iip->ili_fields;
>  	iip->ili_fields = 0;
>  	iip->ili_fsync_fields = 0;
> -	iip->ili_logged = 1;
>  
>  	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  				&iip->ili_item.li_lsn);
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index ba47bf65b772b..b17384aa8df40 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -528,8 +528,6 @@ xfs_inode_item_push(
>  	}
>  
>  	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
> -	ASSERT(iip->ili_logged == 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
> -
>  	spin_unlock(&lip->li_ailp->ail_lock);
>  
>  	error = xfs_iflush(ip, &bp);
> @@ -690,30 +688,24 @@ xfs_iflush_done(
>  			continue;
>  
>  		list_move_tail(&blip->li_bio_list, &tmp);
> -		/*
> -		 * while we have the item, do the unlocked check for needing
> -		 * the AIL lock.
> -		 */
> +
> +		/* Do an unlocked check for needing the AIL lock. */
>  		iip = INODE_ITEM(blip);
> -		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
> +		if (blip->li_lsn == iip->ili_flush_lsn ||
>  		    test_bit(XFS_LI_FAILED, &blip->li_flags))
>  			need_ail++;
>  	}
>  
>  	/* make sure we capture the state of the initial inode. */
>  	iip = INODE_ITEM(lip);
> -	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
> +	if (lip->li_lsn == iip->ili_flush_lsn ||
>  	    test_bit(XFS_LI_FAILED, &lip->li_flags))
>  		need_ail++;
>  
>  	/*
> -	 * We only want to pull the item from the AIL if it is
> -	 * actually there and its location in the log has not
> -	 * changed since we started the flush.  Thus, we only bother
> -	 * if the ili_logged flag is set and the inode's lsn has not
> -	 * changed.  First we check the lsn outside
> -	 * the lock since it's cheaper, and then we recheck while
> -	 * holding the lock before removing the inode from the AIL.
> +	 * We only want to pull the item from the AIL if it is actually there
> +	 * and its location in the log has not changed since we started the
> +	 * flush.  Thus, we only bother if the inode's lsn has not changed.
>  	 */
>  	if (need_ail) {
>  		xfs_lsn_t	tail_lsn = 0;
> @@ -721,8 +713,7 @@ xfs_iflush_done(
>  		/* this is an opencoded batch version of xfs_trans_ail_delete */
>  		spin_lock(&ailp->ail_lock);
>  		list_for_each_entry(blip, &tmp, li_bio_list) {
> -			if (INODE_ITEM(blip)->ili_logged &&
> -			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
> +			if (blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
>  				/*
>  				 * xfs_ail_update_finish() only cares about the
>  				 * lsn of the first tail item removed, any
> @@ -740,14 +731,13 @@ xfs_iflush_done(
>  	}
>  
>  	/*
> -	 * clean up and unlock the flush lock now we are done. We can clear the
> +	 * Clean up and unlock the flush lock now we are done. We can clear the
>  	 * ili_last_fields bits now that we know that the data corresponding to
>  	 * them is safely on disk.
>  	 */
>  	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
>  		list_del_init(&blip->li_bio_list);
>  		iip = INODE_ITEM(blip);
> -		iip->ili_logged = 0;
>  		iip->ili_last_fields = 0;
>  		xfs_ifunlock(iip->ili_inode);
>  	}
> @@ -768,16 +758,11 @@ xfs_iflush_abort(
>  
>  	if (iip) {
>  		xfs_trans_ail_delete(&iip->ili_item, 0);
> -		iip->ili_logged = 0;
> -		/*
> -		 * Clear the ili_last_fields bits now that we know that the
> -		 * data corresponding to them is safely on disk.
> -		 */
> -		iip->ili_last_fields = 0;
>  		/*
>  		 * Clear the inode logging fields so no more flushes are
>  		 * attempted.
>  		 */
> +		iip->ili_last_fields = 0;
>  		iip->ili_fields = 0;
>  		iip->ili_fsync_fields = 0;
>  	}
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 60b34bb66e8ed..4de5070e07655 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -19,7 +19,6 @@ struct xfs_inode_log_item {
>  	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
>  	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
>  	unsigned short		ili_lock_flags;	   /* lock flags */
> -	unsigned short		ili_logged;	   /* flushed logged data */
>  	unsigned int		ili_last_fields;   /* fields when flushed */
>  	unsigned int		ili_fields;	   /* fields to be logged */
>  	unsigned int		ili_fsync_fields;  /* logged since last fsync */
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 02/24] xfs: add an inode item lock
  2020-05-22  3:50 ` [PATCH 02/24] xfs: add an inode item lock Dave Chinner
  2020-05-22  6:45   ` Amir Goldstein
@ 2020-05-22 21:24   ` Darrick J. Wong
  2020-05-23  8:45   ` Christoph Hellwig
  2 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 21:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:07PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The inode log item is kind of special in that it can be aggregating
> new changes in memory at the same time time existing changes are
> being written back to disk. This means there are fields in the log
> item that are accessed concurrently from contexts that don't share
> any locking at all.
> 
> e.g. updating ili_last_fields occurs at flush time under the
> ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
> completion time, and is read under the ILOCK_EXCL when the inode is
> logged.  Hence there is no actual serialisation between reading the
> field during logging of the inode in transactions vs clearing the
> field in IO completion.
> 
> We currently get away with this by the fact that we are only
> clearing fields in IO completion, and nothing bad happens if we
> accidentally log more of the inode than we actually modify. Worst
> case is we consume a tiny bit more memory and log bandwidth.
> 
> However, if we want to do more complex state manipulations on the
> log item that requires updates at all three of these potential
> locations, we need to have some mechanism of serialising those
> operations. To do this, introduce a spinlock into the log item to
> serialise internal state.
> 
> This could be done via the xfs_inode i_flags_lock, but this then
> leads to potential lock inversion issues where inode flag updates
> need to occur inside locks that best nest inside the inode log item
> locks (e.g. marking inodes stale during inode cluster freeing).
> Using a separate spinlock avoids these sorts of problems and
> simplifies future code.
> 
> This does not touch the use of ili_fields in the item formatting
> code - that is entirely protected by the ILOCK_EXCL at this point in
> time, so it remains untouched.

...at first lgance I thought it was odd that this patch then takes
ili_lock before clearing ili_fields, but then I realized that yes indeed
we also hold the ILOCK any time we do that.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 44 ++++++++++++++++++---------------
>  fs/xfs/xfs_file.c               |  9 ++++---
>  fs/xfs/xfs_inode.c              | 20 +++++++++------
>  fs/xfs/xfs_inode_item.c         |  7 ++++++
>  fs/xfs/xfs_inode_item.h         |  3 ++-
>  5 files changed, 51 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index b5dfb66548422..510b996008221 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -81,15 +81,19 @@ xfs_trans_ichgtime(
>   */
>  void
>  xfs_trans_log_inode(
> -	xfs_trans_t	*tp,
> -	xfs_inode_t	*ip,
> -	uint		flags)
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip,
> +	uint			flags)
>  {
> -	struct inode	*inode = VFS_I(ip);
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
> +	struct inode		*inode = VFS_I(ip);
> +	uint			iversion_flags = 0;
>  
>  	ASSERT(ip->i_itemp != NULL);
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +
>  	/*
>  	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
>  	 * don't matter - we either will need an extra transaction in 24 hours
> @@ -102,15 +106,6 @@ xfs_trans_log_inode(
>  		spin_unlock(&inode->i_lock);
>  	}
>  
> -	/*
> -	 * Record the specific change for fdatasync optimisation. This
> -	 * allows fdatasync to skip log forces for inodes that are only
> -	 * timestamp dirty. We do this before the change count so that
> -	 * the core being logged in this case does not impact on fdatasync
> -	 * behaviour.
> -	 */
> -	ip->i_itemp->ili_fsync_fields |= flags;
> -
>  	/*
>  	 * First time we log the inode in a transaction, bump the inode change
>  	 * counter if it is configured for this to occur. While we have the
> @@ -120,13 +115,21 @@ xfs_trans_log_inode(
>  	 * set however, then go ahead and bump the i_version counter
>  	 * unconditionally.
>  	 */
> -	if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
> -	    IS_I_VERSION(VFS_I(ip))) {
> -		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
> -			flags |= XFS_ILOG_CORE;
> +	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags) &&
> +	    IS_I_VERSION(inode)) {
> +		if (inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> +			iversion_flags = XFS_ILOG_CORE;
>  	}
>  
> -	tp->t_flags |= XFS_TRANS_DIRTY;
> +	/*
> +	 * Record the specific change for fdatasync optimisation. This
> +	 * allows fdatasync to skip log forces for inodes that are only
> +	 * timestamp dirty. We do this before the change count so that
> +	 * the core being logged in this case does not impact on fdatasync
> +	 * behaviour.
> +	 */
> +	spin_lock(&iip->ili_lock);
> +	iip->ili_fsync_fields |= flags;
>  
>  	/*
>  	 * Always OR in the bits from the ili_last_fields field.
> @@ -135,8 +138,9 @@ xfs_trans_log_inode(
>  	 * See the big comment in xfs_iflush() for an explanation of
>  	 * this coordination mechanism.
>  	 */
> -	flags |= ip->i_itemp->ili_last_fields;
> -	ip->i_itemp->ili_fields |= flags;
> +	flags |= iip->ili_last_fields | iversion_flags;
> +	iip->ili_fields |= flags;
> +	spin_unlock(&iip->ili_lock);
>  }
>  
>  int
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 403c90309a8ff..0abf770b77498 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -94,6 +94,7 @@ xfs_file_fsync(
>  {
>  	struct inode		*inode = file->f_mapping->host;
>  	struct xfs_inode	*ip = XFS_I(inode);
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
>  	struct xfs_mount	*mp = ip->i_mount;
>  	int			error = 0;
>  	int			log_flushed = 0;
> @@ -137,13 +138,15 @@ xfs_file_fsync(
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
>  	if (xfs_ipincount(ip)) {
>  		if (!datasync ||
> -		    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> -			lsn = ip->i_itemp->ili_last_lsn;
> +		    (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> +			lsn = iip->ili_last_lsn;
>  	}
>  
>  	if (lsn) {
>  		error = xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
> -		ip->i_itemp->ili_fsync_fields = 0;
> +		spin_lock(&iip->ili_lock);
> +		iip->ili_fsync_fields = 0;
> +		spin_unlock(&iip->ili_lock);
>  	}
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ca9f2688b745d..57781c0dbbec5 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2683,9 +2683,11 @@ xfs_ifree_cluster(
>  				continue;
>  
>  			iip = ip->i_itemp;
> +			spin_lock(&iip->ili_lock);
>  			iip->ili_last_fields = iip->ili_fields;
>  			iip->ili_fields = 0;
>  			iip->ili_fsync_fields = 0;
> +			spin_unlock(&iip->ili_lock);
>  			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  						&iip->ili_item.li_lsn);
>  
> @@ -2721,6 +2723,7 @@ xfs_ifree(
>  {
>  	int			error;
>  	struct xfs_icluster	xic = { 0 };
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
>  
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  	ASSERT(VFS_I(ip)->i_nlink == 0);
> @@ -2758,7 +2761,9 @@ xfs_ifree(
>  	ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS;
>  
>  	/* Don't attempt to replay owner changes for a deleted inode */
> -	ip->i_itemp->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);
> +	spin_lock(&iip->ili_lock);
> +	iip->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);

Need spaces around the '|'...

> +	spin_unlock(&iip->ili_lock);
>  
>  	/*
>  	 * Bump the generation count so no one will be confused
> @@ -3814,20 +3819,19 @@ xfs_iflush_int(
>  	 * know that the information those bits represent is permanently on
>  	 * disk.  As long as the flush completes before the inode is logged
>  	 * again, then both ili_fields and ili_last_fields will be cleared.
> -	 *
> -	 * We can play with the ili_fields bits here, because the inode lock
> -	 * must be held exclusively in order to set bits there and the flush
> -	 * lock protects the ili_last_fields bits.  Store the current LSN of the
> -	 * inode so that we can tell whether the item has moved in the AIL from
> -	 * xfs_iflush_done().  In order to read the lsn we need the AIL lock,
> -	 * because it is a 64 bit value that cannot be read atomically.
>  	 */
>  	error = 0;
>  flush_out:
> +	spin_lock(&iip->ili_lock);
>  	iip->ili_last_fields = iip->ili_fields;
>  	iip->ili_fields = 0;
>  	iip->ili_fsync_fields = 0;
> +	spin_unlock(&iip->ili_lock);
>  
> +	/*
> +	 * Store the current LSN of the inode so that we can tell whether the
> +	 * item has moved in the AIL from xfs_iflush_done().
> +	 */
>  	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  				&iip->ili_item.li_lsn);
>  
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index b17384aa8df40..6ef9cbcfc94a7 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -637,6 +637,7 @@ xfs_inode_item_init(
>  	iip = ip->i_itemp = kmem_zone_zalloc(xfs_ili_zone, 0);
>  
>  	iip->ili_inode = ip;
> +	spin_lock_init(&iip->ili_lock);
>  	xfs_log_item_init(mp, &iip->ili_item, XFS_LI_INODE,
>  						&xfs_inode_item_ops);
>  }
> @@ -738,7 +739,11 @@ xfs_iflush_done(
>  	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
>  		list_del_init(&blip->li_bio_list);
>  		iip = INODE_ITEM(blip);
> +
> +		spin_lock(&iip->ili_lock);
>  		iip->ili_last_fields = 0;
> +		spin_unlock(&iip->ili_lock);
> +
>  		xfs_ifunlock(iip->ili_inode);
>  	}
>  	list_del(&tmp);
> @@ -762,9 +767,11 @@ xfs_iflush_abort(
>  		 * Clear the inode logging fields so no more flushes are
>  		 * attempted.
>  		 */
> +		spin_lock(&iip->ili_lock);
>  		iip->ili_last_fields = 0;
>  		iip->ili_fields = 0;
>  		iip->ili_fsync_fields = 0;
> +		spin_unlock(&iip->ili_lock);
>  	}
>  	/*
>  	 * Release the inode's flush lock since we're done with it.
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 4de5070e07655..1234e8cd3726d 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -18,7 +18,8 @@ struct xfs_inode_log_item {
>  	struct xfs_inode	*ili_inode;	   /* inode ptr */
>  	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
>  	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
> -	unsigned short		ili_lock_flags;	   /* lock flags */
> +	spinlock_t		ili_lock;	   /* internal state lock */

"internal state lock" is pretty vague to me.  I interpret the comment to
mean that ili_lock protects everything in this structure, when in fact
it doesn't cover ili_fields despite us holding ili_lock for *some* of
the ili_fields updates.

The comment for this lock ought to point out which fields it protects,
since it doesn't cover the entire structure.

--D

> +	unsigned short		ili_lock_flags;	   /* inode lock flags */
>  	unsigned int		ili_last_fields;   /* fields when flushed */
>  	unsigned int		ili_fields;	   /* fields to be logged */
>  	unsigned int		ili_fsync_fields;  /* logged since last fsync */
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/24] xfs: mark inode buffers in cache
  2020-05-22  3:50 ` [PATCH 03/24] xfs: mark inode buffers in cache Dave Chinner
  2020-05-22  7:45   ` Amir Goldstein
@ 2020-05-22 21:35   ` Darrick J. Wong
  2020-05-24 23:41     ` Dave Chinner
  2020-05-23  8:48   ` Christoph Hellwig
  2 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 21:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:08PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Inode buffers always have write IO callbacks, so by marking them
> directly we can avoid needing to attach ->b_iodone functions to
> them. This avoids an indirect call, and makes future modifications
> much simpler.
> 
> This is largely a rearrangement of the code at this point - no IO
> completion functionality changes at this point, just how the
> code is run is modified.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c       | 18 +++++++++++++-----
>  fs/xfs/xfs_buf.h       | 39 ++++++++++++++++++++++++++-------------
>  fs/xfs/xfs_buf_item.c  | 42 +++++++++++++++++++++++++++++++-----------
>  fs/xfs/xfs_buf_item.h  |  1 +
>  fs/xfs/xfs_inode.c     |  2 +-
>  fs/xfs/xfs_trans_buf.c |  3 +++
>  6 files changed, 75 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 9c2fbb6bbf89d..6105b97028d6a 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -14,6 +14,8 @@
>  #include "xfs_mount.h"
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
> +#include "xfs_trans.h"
> +#include "xfs_buf_item.h"
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  
> @@ -1202,12 +1204,18 @@ xfs_buf_ioend(
>  		bp->b_flags |= XBF_DONE;
>  	}
>  
> -	if (bp->b_iodone)
> +	/* inodes always have a callback on write */
> +	if (!read && (bp->b_flags & _XBF_INODES)) {

I think this changes in the next patch.

> +		xfs_buf_inode_iodone(bp);
> +		return;
> +	}
> +
> +	if (bp->b_iodone) {
>  		(*(bp->b_iodone))(bp);
> -	else if (bp->b_flags & XBF_ASYNC)
> -		xfs_buf_relse(bp);
> -	else
> -		complete(&bp->b_iowait);
> +		return;
> +	}
> +
> +	xfs_buf_ioend_finish(bp);
>  }
>  
>  static void
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 050c53b739e24..b3e5d653d09f1 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -30,15 +30,19 @@
>  #define XBF_STALE	 (1 << 6) /* buffer has been staled, do not find it */
>  #define XBF_WRITE_FAIL	 (1 << 7) /* async writes have failed on this buffer */
>  
> -/* flags used only as arguments to access routines */
> -#define XBF_TRYLOCK	 (1 << 16)/* lock requested, but do not wait */
> -#define XBF_UNMAPPED	 (1 << 17)/* do not map the buffer */
> +/* buffer type flags for write callbacks */
> +#define _XBF_INODES	 (1 << 16)/* inode buffer */
>  
>  /* flags used only internally */
>  #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
>  #define _XBF_KMEM	 (1 << 21)/* backed by heap memory */
>  #define _XBF_DELWRI_Q	 (1 << 22)/* buffer on a delwri queue */
>  
> +/* flags used only as arguments to access routines */
> +#define XBF_TRYLOCK	 (1 << 30)/* lock requested, but do not wait */
> +#define XBF_UNMAPPED	 (1 << 31)/* do not map the buffer */
> +
> +

Double newline?

>  typedef unsigned int xfs_buf_flags_t;
>  
>  #define XFS_BUF_FLAGS \
> @@ -50,12 +54,13 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ XBF_DONE,		"DONE" }, \
>  	{ XBF_STALE,		"STALE" }, \
>  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
> -	{ XBF_TRYLOCK,		"TRYLOCK" },	/* should never be set */\
> -	{ XBF_UNMAPPED,		"UNMAPPED" },	/* ditto */\
> +	{ _XBF_INODES,		"INODES" }, \

This a toughie.  On the one hand if you're going to go introducing what
amounts to two-bit buffer io completion type in the middle of b_flags
then (like Amir says) this ideally would have a mask and switch
statements and whatnot.

I also wonder if we could tell the buffer type given all the
xfs_trans_buf_set_type calls, but I think the answer is that not every
buffer is guaranteed to have a buffer log item attached and a type code
set correctly?

OTOH there's only three states, so who cares, maybe this is fine...

--D

>  	{ _XBF_PAGES,		"PAGES" }, \
>  	{ _XBF_KMEM,		"KMEM" }, \
> -	{ _XBF_DELWRI_Q,	"DELWRI_Q" }
> -
> +	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
> +	/* The following interface flags should never be set */ \
> +	{ XBF_TRYLOCK,		"TRYLOCK" }, \
> +	{ XBF_UNMAPPED,		"UNMAPPED" }
>  
>  /*
>   * Internal state flags.
> @@ -257,9 +262,23 @@ extern void xfs_buf_unlock(xfs_buf_t *);
>  #define xfs_buf_islocked(bp) \
>  	((bp)->b_sema.count <= 0)
>  
> +static inline void xfs_buf_relse(xfs_buf_t *bp)
> +{
> +	xfs_buf_unlock(bp);
> +	xfs_buf_rele(bp);
> +}
> +
>  /* Buffer Read and Write Routines */
>  extern int xfs_bwrite(struct xfs_buf *bp);
>  extern void xfs_buf_ioend(struct xfs_buf *bp);
> +static inline void xfs_buf_ioend_finish(struct xfs_buf *bp)
> +{
> +	if (bp->b_flags & XBF_ASYNC)
> +		xfs_buf_relse(bp);
> +	else
> +		complete(&bp->b_iowait);
> +}
> +
>  extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error,
>  		xfs_failaddr_t failaddr);
>  #define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address)
> @@ -324,12 +343,6 @@ static inline int xfs_buf_ispinned(struct xfs_buf *bp)
>  	return atomic_read(&bp->b_pin_count);
>  }
>  
> -static inline void xfs_buf_relse(xfs_buf_t *bp)
> -{
> -	xfs_buf_unlock(bp);
> -	xfs_buf_rele(bp);
> -}
> -
>  static inline int
>  xfs_buf_verify_cksum(struct xfs_buf *bp, unsigned long cksum_offset)
>  {
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 9e75e8d6042ec..8659cf4282a64 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -1158,20 +1158,15 @@ xfs_buf_iodone_callback_error(
>  	return false;
>  }
>  
> -/*
> - * This is the iodone() function for buffers which have had callbacks attached
> - * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
> - * callback list, mark the buffer as having no more callbacks and then push the
> - * buffer through IO completion processing.
> - */
> -void
> -xfs_buf_iodone_callbacks(
> +static void
> +xfs_buf_run_callbacks(
>  	struct xfs_buf		*bp)
>  {
> +
>  	/*
> -	 * If there is an error, process it. Some errors require us
> -	 * to run callbacks after failure processing is done so we
> -	 * detect that and take appropriate action.
> +	 * If there is an error, process it. Some errors require us to run
> +	 * callbacks after failure processing is done so we detect that and take
> +	 * appropriate action.
>  	 */
>  	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
>  		return;
> @@ -1188,9 +1183,34 @@ xfs_buf_iodone_callbacks(
>  	bp->b_log_item = NULL;
>  	list_del_init(&bp->b_li_list);
>  	bp->b_iodone = NULL;
> +}
> +
> +/*
> + * This is the iodone() function for buffers which have had callbacks attached
> + * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
> + * callback list, mark the buffer as having no more callbacks and then push the
> + * buffer through IO completion processing.
> + */
> +void
> +xfs_buf_iodone_callbacks(
> +	struct xfs_buf		*bp)
> +{
> +	xfs_buf_run_callbacks(bp);
>  	xfs_buf_ioend(bp);
>  }
>  
> +/*
> + * Inode buffer iodone callback function.
> + */
> +void
> +xfs_buf_inode_iodone(
> +	struct xfs_buf		*bp)
> +{
> +	xfs_buf_run_callbacks(bp);
> +	xfs_buf_ioend_finish(bp);
> +}
> +
> +
>  /*
>   * This is the iodone() function for buffers which have been
>   * logged.  It is called when they are eventually flushed out.
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index c9c57e2da9327..a342933ad9b8d 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -59,6 +59,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
>  			      struct xfs_log_item *);
>  void	xfs_buf_iodone_callbacks(struct xfs_buf *);
>  void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
> +void	xfs_buf_inode_iodone(struct xfs_buf *);
>  bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
>  
>  extern kmem_zone_t	*xfs_buf_item_zone;
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 57781c0dbbec5..607c9d9bb2b40 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3841,13 +3841,13 @@ xfs_iflush_int(
>  	 * completion on the buffer to remove the inode from the AIL and release
>  	 * the flush lock.
>  	 */
> +	bp->b_flags |= _XBF_INODES;
>  	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);
>  
>  	/* generate the checksum. */
>  	xfs_dinode_calc_crc(mp, dip);
>  
>  	ASSERT(!list_empty(&bp->b_li_list));
> -	ASSERT(bp->b_iodone != NULL);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 08174ffa21189..552d0869aa0fe 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -626,6 +626,7 @@ xfs_trans_inode_buf(
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  
>  	bip->bli_flags |= XFS_BLI_INODE_BUF;
> +	bp->b_flags |= _XBF_INODES;
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
>  }
>  
> @@ -651,6 +652,7 @@ xfs_trans_stale_inode_buf(
>  
>  	bip->bli_flags |= XFS_BLI_STALE_INODE;
>  	bip->bli_item.li_cb = xfs_buf_iodone;
> +	bp->b_flags |= _XBF_INODES;
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
>  }
>  
> @@ -675,6 +677,7 @@ xfs_trans_inode_alloc_buf(
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  
>  	bip->bli_flags |= XFS_BLI_INODE_ALLOC_BUF;
> +	bp->b_flags |= _XBF_INODES;
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
>  }
>  
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 04/24] xfs: mark dquot buffers in cache
  2020-05-22  3:50 ` [PATCH 04/24] xfs: mark dquot " Dave Chinner
  2020-05-22  7:46   ` Amir Goldstein
@ 2020-05-22 21:38   ` Darrick J. Wong
  1 sibling, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 21:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:09PM +1000, Dave Chinner wrote:
> dquot buffers always have write IO callbacks, so by marking them
> directly we can avoid needing to attach ->b_iodone functions to
> them. This avoids an indirect call, and makes future modifications
> much simpler.
> 
> This is largely a rearrangement of the code at this point - no IO
> completion functionality changes at this point, just how the
> code is run is modified.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c       | 12 +++++++++++-
>  fs/xfs/xfs_buf.h       |  2 ++
>  fs/xfs/xfs_buf_item.c  | 10 ++++++++++
>  fs/xfs/xfs_buf_item.h  |  1 +
>  fs/xfs/xfs_dquot.c     |  1 +
>  fs/xfs/xfs_trans_buf.c |  1 +
>  6 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 6105b97028d6a..77d40eb4a11db 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1204,17 +1204,27 @@ xfs_buf_ioend(
>  		bp->b_flags |= XBF_DONE;
>  	}
>  
> +	if (read)
> +		goto out_finish;
> +
>  	/* inodes always have a callback on write */
> -	if (!read && (bp->b_flags & _XBF_INODES)) {
> +	if (bp->b_flags & _XBF_INODES) {

Heh, yes, this does seem like it belongs in the previous patch.

With that moved,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

>  		xfs_buf_inode_iodone(bp);
>  		return;
>  	}
>  
> +	/* dquots always have a callback on write */
> +	if (bp->b_flags & _XBF_DQUOTS) {
> +		xfs_buf_dquot_iodone(bp);
> +		return;
> +	}
> +
>  	if (bp->b_iodone) {
>  		(*(bp->b_iodone))(bp);
>  		return;
>  	}
>  
> +out_finish:
>  	xfs_buf_ioend_finish(bp);
>  }
>  
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index b3e5d653d09f1..cbde44ecb3963 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -32,6 +32,7 @@
>  
>  /* buffer type flags for write callbacks */
>  #define _XBF_INODES	 (1 << 16)/* inode buffer */
> +#define _XBF_DQUOTS	 (1 << 17)/* dquot buffer */
>  
>  /* flags used only internally */
>  #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
> @@ -55,6 +56,7 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ XBF_STALE,		"STALE" }, \
>  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
>  	{ _XBF_INODES,		"INODES" }, \
> +	{ _XBF_DQUOTS,		"DQUOTS" }, \
>  	{ _XBF_PAGES,		"PAGES" }, \
>  	{ _XBF_KMEM,		"KMEM" }, \
>  	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 8659cf4282a64..a42cdf9ccc47d 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -1210,6 +1210,16 @@ xfs_buf_inode_iodone(
>  	xfs_buf_ioend_finish(bp);
>  }
>  
> +/*
> + * Dquot buffer iodone callback function.
> + */
> +void
> +xfs_buf_dquot_iodone(
> +	struct xfs_buf		*bp)
> +{
> +	xfs_buf_run_callbacks(bp);
> +	xfs_buf_ioend_finish(bp);
> +}
>  
>  /*
>   * This is the iodone() function for buffers which have been
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index a342933ad9b8d..27d13d29b5bbb 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -60,6 +60,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
>  void	xfs_buf_iodone_callbacks(struct xfs_buf *);
>  void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
>  void	xfs_buf_inode_iodone(struct xfs_buf *);
> +void	xfs_buf_dquot_iodone(struct xfs_buf *);
>  bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
>  
>  extern kmem_zone_t	*xfs_buf_item_zone;
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index 55b95d45303b8..25592b701db40 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -1174,6 +1174,7 @@ xfs_qm_dqflush(
>  	 * Attach an iodone routine so that we can remove this dquot from the
>  	 * AIL and release the flush lock once the dquot is synced to disk.
>  	 */
> +	bp->b_flags |= _XBF_DQUOTS;
>  	xfs_buf_attach_iodone(bp, xfs_qm_dqflush_done,
>  				  &dqp->q_logitem.qli_item);
>  
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 552d0869aa0fe..93d62cb864c15 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -788,5 +788,6 @@ xfs_trans_dquot_buf(
>  		break;
>  	}
>  
> +	bp->b_flags |= _XBF_DQUOTS;
>  	xfs_trans_buf_set_type(tp, bp, type);
>  }
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/24] xfs: mark log recovery buffers for completion
  2020-05-22  3:50 ` [PATCH 05/24] xfs: mark log recovery buffers for completion Dave Chinner
  2020-05-22  7:41   ` Amir Goldstein
@ 2020-05-22 21:41   ` Darrick J. Wong
  1 sibling, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 21:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:10PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Log recovery has it's own buffer write completion handler for
> buffers that it directly recovers. Convert these to direct calls by
> flagging these buffers as being log recovery buffers. The flag will
> get cleared by the log recovery IO completion routine, so it will
> never leak out of log recovery.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c                | 10 ++++++++++
>  fs/xfs/xfs_buf.h                |  2 ++
>  fs/xfs/xfs_buf_item_recover.c   |  5 ++---
>  fs/xfs/xfs_dquot_item_recover.c |  2 +-
>  fs/xfs/xfs_inode_item_recover.c |  2 +-
>  fs/xfs/xfs_log_recover.c        |  5 ++---
>  6 files changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 77d40eb4a11db..b89685ce8519d 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -14,6 +14,7 @@
>  #include "xfs_mount.h"
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
> +#include "xfs_log_recover.h"
>  #include "xfs_trans.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_errortag.h"
> @@ -1207,6 +1208,15 @@ xfs_buf_ioend(
>  	if (read)
>  		goto out_finish;
>  
> +	/*
> +	 * If this is a log recovery buffer, we aren't doing transactional IO
> +	 * yet so we need to let it handle IO completions.
> +	 */
> +	if (bp->b_flags & _XBF_LOGRCVY) {
> +		xlog_recover_iodone(bp);
> +		return;
> +	}
> +
>  	/* inodes always have a callback on write */
>  	if (bp->b_flags & _XBF_INODES) {
>  		xfs_buf_inode_iodone(bp);
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index cbde44ecb3963..c5fe4c48c9080 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -33,6 +33,7 @@
>  /* buffer type flags for write callbacks */
>  #define _XBF_INODES	 (1 << 16)/* inode buffer */
>  #define _XBF_DQUOTS	 (1 << 17)/* dquot buffer */
> +#define _XBF_LOGRCVY	 (1 << 18)/* log recovery buffer */
>  
>  /* flags used only internally */
>  #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
> @@ -57,6 +58,7 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
>  	{ _XBF_INODES,		"INODES" }, \
>  	{ _XBF_DQUOTS,		"DQUOTS" }, \
> +	{ _XBF_LOGRCVY,		"LOG_RECOVERY" }, \

Pat Sajak would be very displeased with you for cheating him out of
the purchase of three vowels. :P

_XBF_LOGRECOVERY, please...

Otherwise this looks like a pretty straightforward change.

Ahah, I guess I was wrong, there are *four* states.

--D

>  	{ _XBF_PAGES,		"PAGES" }, \
>  	{ _XBF_KMEM,		"KMEM" }, \
>  	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
> diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
> index 04faa7310c4f0..bfd50daa16606 100644
> --- a/fs/xfs/xfs_buf_item_recover.c
> +++ b/fs/xfs/xfs_buf_item_recover.c
> @@ -419,8 +419,7 @@ xlog_recover_validate_buf_type(
>  	if (bp->b_ops) {
>  		struct xfs_buf_log_item	*bip;
>  
> -		ASSERT(!bp->b_iodone || bp->b_iodone == xlog_recover_iodone);
> -		bp->b_iodone = xlog_recover_iodone;
> +		bp->b_flags |= _XBF_LOGRCVY;
>  		xfs_buf_item_init(bp, mp);
>  		bip = bp->b_log_item;
>  		bip->bli_item.li_lsn = current_lsn;
> @@ -963,7 +962,7 @@ xlog_recover_buf_commit_pass2(
>  		error = xfs_bwrite(bp);
>  	} else {
>  		ASSERT(bp->b_mount == mp);
> -		bp->b_iodone = xlog_recover_iodone;
> +		bp->b_flags |= _XBF_LOGRCVY;
>  		xfs_buf_delwri_queue(bp, buffer_list);
>  	}
>  
> diff --git a/fs/xfs/xfs_dquot_item_recover.c b/fs/xfs/xfs_dquot_item_recover.c
> index 3400be4c88f08..a0a4b089e0cdd 100644
> --- a/fs/xfs/xfs_dquot_item_recover.c
> +++ b/fs/xfs/xfs_dquot_item_recover.c
> @@ -153,7 +153,7 @@ xlog_recover_dquot_commit_pass2(
>  
>  	ASSERT(dq_f->qlf_size == 2);
>  	ASSERT(bp->b_mount == mp);
> -	bp->b_iodone = xlog_recover_iodone;
> +	bp->b_flags |= _XBF_LOGRCVY;
>  	xfs_buf_delwri_queue(bp, buffer_list);
>  
>  out_release:
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index dc3e26ff16c90..b67f1b7c5b65f 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -376,7 +376,7 @@ xlog_recover_inode_commit_pass2(
>  	xfs_dinode_calc_crc(log->l_mp, dip);
>  
>  	ASSERT(bp->b_mount == mp);
> -	bp->b_iodone = xlog_recover_iodone;
> +	bp->b_flags |= _XBF_LOGRCVY;
>  	xfs_buf_delwri_queue(bp, buffer_list);
>  
>  out_release:
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index ec015df55b77a..0aa823aeafca9 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -287,9 +287,8 @@ xlog_recover_iodone(
>  	if (bp->b_log_item)
>  		xfs_buf_item_relse(bp);
>  	ASSERT(bp->b_log_item == NULL);
> -
> -	bp->b_iodone = NULL;
> -	xfs_buf_ioend(bp);
> +	bp->b_flags &= ~_XBF_LOGRCVY;
> +	xfs_buf_ioend_finish(bp);
>  }
>  
>  /*
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/24] xfs: call xfs_buf_iodone directly
  2020-05-22  3:50 ` [PATCH 06/24] xfs: call xfs_buf_iodone directly Dave Chinner
  2020-05-22  7:56   ` Amir Goldstein
@ 2020-05-22 21:53   ` Darrick J. Wong
  1 sibling, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 21:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:11PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> All unmarked dirty buffers should be in the AIL and have log items
> attached to them. Hence when they are written, we will run a
> callback to remove the item from the AIL if appropriate. Now that
> we've handled inode and dquot buffers, all remaining calls are to
> xfs_buf_iodone() and so we can hard code this rather than use an
> indirect call.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c       | 24 +++++++++---------------
>  fs/xfs/xfs_buf.h       |  6 +-----
>  fs/xfs/xfs_buf_item.c  | 38 +++++++++-----------------------------
>  fs/xfs/xfs_buf_item.h  |  2 +-
>  fs/xfs/xfs_trans_buf.c |  9 +--------
>  5 files changed, 21 insertions(+), 58 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index b89685ce8519d..5fc83637f7aff 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -658,7 +658,6 @@ xfs_buf_find(
>  	 */
>  	if (bp->b_flags & XBF_STALE) {
>  		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
> -		ASSERT(bp->b_iodone == NULL);
>  		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
>  		bp->b_ops = NULL;
>  	}
> @@ -1194,10 +1193,13 @@ xfs_buf_ioend(
>  	if (!bp->b_error && bp->b_io_error)
>  		xfs_buf_ioerror(bp, bp->b_io_error);
>  
> -	/* Only validate buffers that were read without errors */
> -	if (read && !bp->b_error && bp->b_ops) {
> -		ASSERT(!bp->b_iodone);
> -		bp->b_ops->verify_read(bp);
> +	if (read) {
> +		if (!bp->b_error && bp->b_ops)
> +			bp->b_ops->verify_read(bp);
> +		if (!bp->b_error)
> +			bp->b_flags |= XBF_DONE;
> +		xfs_buf_ioend_finish(bp);
> +		return;
>  	}
>  
>  	if (!bp->b_error) {
> @@ -1205,9 +1207,6 @@ xfs_buf_ioend(
>  		bp->b_flags |= XBF_DONE;
>  	}
>  
> -	if (read)
> -		goto out_finish;
> -
>  	/*
>  	 * If this is a log recovery buffer, we aren't doing transactional IO
>  	 * yet so we need to let it handle IO completions.
> @@ -1229,13 +1228,8 @@ xfs_buf_ioend(
>  		return;
>  	}
>  
> -	if (bp->b_iodone) {
> -		(*(bp->b_iodone))(bp);
> -		return;
> -	}
> -
> -out_finish:
> -	xfs_buf_ioend_finish(bp);
> +	/* dirty buffers always need to be removed from the AIL */
> +	xfs_buf_dirty_iodone(bp);
>  }
>  
>  static void
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index c5fe4c48c9080..a127d56d5eff0 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -18,6 +18,7 @@
>  /*
>   *	Base types
>   */
> +struct xfs_buf;
>  
>  #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
>  
> @@ -103,10 +104,6 @@ typedef struct xfs_buftarg {
>  	struct ratelimit_state	bt_ioerror_rl;
>  } xfs_buftarg_t;
>  
> -struct xfs_buf;
> -typedef void (*xfs_buf_iodone_t)(struct xfs_buf *);
> -
> -
>  #define XB_PAGES	2
>  
>  struct xfs_buf_map {
> @@ -159,7 +156,6 @@ typedef struct xfs_buf {
>  	xfs_buftarg_t		*b_target;	/* buffer target (device) */
>  	void			*b_addr;	/* virtual address of buffer */
>  	struct work_struct	b_ioend_work;
> -	xfs_buf_iodone_t	b_iodone;	/* I/O completion function */
>  	struct completion	b_iowait;	/* queue for I/O waiters */
>  	struct xfs_buf_log_item	*b_log_item;
>  	struct list_head	b_li_list;	/* Log items list head */
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index a42cdf9ccc47d..c2e7d14e35c66 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -460,7 +460,6 @@ xfs_buf_item_unpin(
>  			xfs_buf_do_callbacks(bp);
>  			bp->b_log_item = NULL;
>  			list_del_init(&bp->b_li_list);
> -			bp->b_iodone = NULL;
>  		} else {
>  			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
>  			xfs_buf_item_relse(bp);
> @@ -936,11 +935,7 @@ xfs_buf_item_free(
>  }
>  
>  /*
> - * This is called when the buf log item is no longer needed.  It should
> - * free the buf log item associated with the given buffer and clear
> - * the buffer's pointer to the buf log item.  If there are no more
> - * items in the list, clear the b_iodone field of the buffer (see
> - * xfs_buf_attach_iodone() below).
> + * xfs_buf_item_relse() is called when the buf log item is no longer needed.
>   */
>  void
>  xfs_buf_item_relse(
> @@ -952,9 +947,6 @@ xfs_buf_item_relse(
>  	ASSERT(!test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags));
>  
>  	bp->b_log_item = NULL;
> -	if (list_empty(&bp->b_li_list))
> -		bp->b_iodone = NULL;
> -
>  	xfs_buf_rele(bp);
>  	xfs_buf_item_free(bip);
>  }
> @@ -962,10 +954,7 @@ xfs_buf_item_relse(
>  
>  /*
>   * Add the given log item with its callback to the list of callbacks
> - * to be called when the buffer's I/O completes.  If it is not set
> - * already, set the buffer's b_iodone() routine to be
> - * xfs_buf_iodone_callbacks() and link the log item into the list of
> - * items rooted at b_li_list.
> + * to be called when the buffer's I/O completes.
>   */
>  void
>  xfs_buf_attach_iodone(
> @@ -977,10 +966,6 @@ xfs_buf_attach_iodone(
>  
>  	lip->li_cb = cb;
>  	list_add_tail(&lip->li_bio_list, &bp->b_li_list);
> -
> -	ASSERT(bp->b_iodone == NULL ||
> -	       bp->b_iodone == xfs_buf_iodone_callbacks);
> -	bp->b_iodone = xfs_buf_iodone_callbacks;
>  }
>  
>  /*
> @@ -1096,7 +1081,6 @@ xfs_buf_iodone_callback_error(
>  		goto out_stale;
>  
>  	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
> -	ASSERT(bp->b_iodone != NULL);
>  
>  	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
>  
> @@ -1182,28 +1166,24 @@ xfs_buf_run_callbacks(
>  	xfs_buf_do_callbacks(bp);
>  	bp->b_log_item = NULL;
>  	list_del_init(&bp->b_li_list);
> -	bp->b_iodone = NULL;
>  }
>  
>  /*
> - * This is the iodone() function for buffers which have had callbacks attached
> - * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
> - * callback list, mark the buffer as having no more callbacks and then push the
> - * buffer through IO completion processing.
> + * Inode buffer iodone callback function.
>   */
>  void
> -xfs_buf_iodone_callbacks(
> +xfs_buf_inode_iodone(
>  	struct xfs_buf		*bp)
>  {
>  	xfs_buf_run_callbacks(bp);
> -	xfs_buf_ioend(bp);
> +	xfs_buf_ioend_finish(bp);
>  }
>  
>  /*
> - * Inode buffer iodone callback function.
> + * Dquot buffer iodone callback function.
>   */
>  void
> -xfs_buf_inode_iodone(
> +xfs_buf_dquot_iodone(
>  	struct xfs_buf		*bp)
>  {
>  	xfs_buf_run_callbacks(bp);
> @@ -1211,10 +1191,10 @@ xfs_buf_inode_iodone(
>  }
>  
>  /*
> - * Dquot buffer iodone callback function.
> + * Dirty buffer iodone callback function.
>   */
>  void
> -xfs_buf_dquot_iodone(
> +xfs_buf_dirty_iodone(

git diff --patience would make the relatively little change going on
here much clearer:

@@ -1182,21 +1166,6 @@ xfs_buf_run_callbacks(
        xfs_buf_do_callbacks(bp);
        bp->b_log_item = NULL;
        list_del_init(&bp->b_li_list);
-       bp->b_iodone = NULL;
-}
-
-/*
- * This is the iodone() function for buffers which have had callbacks attached
- * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
- * callback list, mark the buffer as having no more callbacks and then push the
- * buffer through IO completion processing.
- */
-void
-xfs_buf_iodone_callbacks(
-       struct xfs_buf          *bp)
-{
-       xfs_buf_run_callbacks(bp);
-       xfs_buf_ioend(bp);
 }
 
 /*
@@ -1221,6 +1190,17 @@ xfs_buf_dquot_iodone(
        xfs_buf_ioend_finish(bp);
 }
 
+/*
+ * Dirty buffer iodone callback function.
+ */
+void
+xfs_buf_dirty_iodone(
+       struct xfs_buf          *bp)
+{
+       xfs_buf_run_callbacks(bp);
+       xfs_buf_ioend_finish(bp);
+}
+
 /*
  * This is the iodone() function for buffers which have been
  * logged.  It is called when they are eventually flushed out.

OTOH all the searching I had to do to figure out what was really going
on here persuaded me to go look at what this looked like at the end, so
I guess I'm not that worried.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D


>  	struct xfs_buf		*bp)
>  {
>  	xfs_buf_run_callbacks(bp);
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index 27d13d29b5bbb..96f994ec90915 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -57,10 +57,10 @@ bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
>  void	xfs_buf_attach_iodone(struct xfs_buf *,
>  			      void(*)(struct xfs_buf *, struct xfs_log_item *),
>  			      struct xfs_log_item *);
> -void	xfs_buf_iodone_callbacks(struct xfs_buf *);
>  void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
>  void	xfs_buf_inode_iodone(struct xfs_buf *);
>  void	xfs_buf_dquot_iodone(struct xfs_buf *);
> +void	xfs_buf_dirty_iodone(struct xfs_buf *);
>  bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
>  
>  extern kmem_zone_t	*xfs_buf_item_zone;
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 93d62cb864c15..69e0ebe94a915 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -465,23 +465,16 @@ xfs_trans_dirty_buf(
>  
>  	ASSERT(bp->b_transp == tp);
>  	ASSERT(bip != NULL);
> -	ASSERT(bp->b_iodone == NULL ||
> -	       bp->b_iodone == xfs_buf_iodone_callbacks);
>  
>  	/*
>  	 * Mark the buffer as needing to be written out eventually,
>  	 * and set its iodone function to remove the buffer's buf log
>  	 * item from the AIL and free it when the buffer is flushed
> -	 * to disk.  See xfs_buf_attach_iodone() for more details
> -	 * on li_cb and xfs_buf_iodone_callbacks().
> -	 * If we end up aborting this transaction, we trap this buffer
> -	 * inside the b_bdstrat callback so that this won't get written to
> -	 * disk.
> +	 * to disk.
>  	 */
>  	bp->b_flags |= XBF_DONE;
>  
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
> -	bp->b_iodone = xfs_buf_iodone_callbacks;
>  	bip->bli_item.li_cb = xfs_buf_iodone;
>  
>  	/*
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/24] xfs: clean up whacky buffer log item list reinit
  2020-05-22  3:50 ` [PATCH 07/24] xfs: clean up whacky buffer log item list reinit Dave Chinner
@ 2020-05-22 22:01   ` Darrick J. Wong
  2020-05-23  8:50   ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 22:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:12PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we've emptied the buffer log item list, it does a list_del_init
> on itself to reset it's pointers to itself. This is unnecessary as
> the list is already empty at this point. I'm guessing this was a
> bandaid for a iodone item leak or list corruption at some point in
> the past, and we've carried it ever since. Get rid of it.

It's not even a bandaid, it's merely an unnecessary wart from the
conversion to list_head... that I RVBd.  Sigh.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D


> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf_item.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index c2e7d14e35c66..b7ffb117e141e 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -459,7 +459,6 @@ xfs_buf_item_unpin(
>  		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
>  			xfs_buf_do_callbacks(bp);
>  			bp->b_log_item = NULL;
> -			list_del_init(&bp->b_li_list);
>  		} else {
>  			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
>  			xfs_buf_item_relse(bp);
> @@ -1165,7 +1164,6 @@ xfs_buf_run_callbacks(
>  
>  	xfs_buf_do_callbacks(bp);
>  	bp->b_log_item = NULL;
> -	list_del_init(&bp->b_li_list);
>  }
>  
>  /*
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 08/24] xfs: fold xfs_istale_done into xfs_iflush_done
  2020-05-22  3:50 ` [PATCH 08/24] xfs: fold xfs_istale_done into xfs_iflush_done Dave Chinner
@ 2020-05-22 22:10   ` Darrick J. Wong
  2020-05-23  9:12   ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 22:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:13PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Having different io completion callbacks for different inode states
> makes things complex. We can detect if the inode is stale via the
> XFS_ISTALE flag in IO completion, so we don't need a special
> callback just for this.
> 
> This means inodes only have a single iodone callback, and inode IO
> completion is entirely buffer centric at this point. Hence we no
> longer need to use a log item callback at all as we can just call
> xfs_iflush_done() directly from the buffer completions and walk the
> buffer log item list to complete the all inodes under IO.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looking through the iflush machinery always makes my eyes hurt, this
cleans things up nicely.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_buf_item.c   | 35 ++++++++++++++++++----
>  fs/xfs/xfs_inode.c      |  6 ++--
>  fs/xfs/xfs_inode_item.c | 65 ++++++++++++++---------------------------
>  fs/xfs/xfs_inode_item.h |  5 ++--
>  4 files changed, 56 insertions(+), 55 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index b7ffb117e141e..e376f778bf57c 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -13,6 +13,8 @@
>  #include "xfs_mount.h"
>  #include "xfs_trans.h"
>  #include "xfs_buf_item.h"
> +#include "xfs_inode.h"
> +#include "xfs_inode_item.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
> @@ -457,7 +459,8 @@ xfs_buf_item_unpin(
>  		 * the AIL lock.
>  		 */
>  		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> -			xfs_buf_do_callbacks(bp);
> +			lip->li_cb(bp, lip);
> +			xfs_iflush_done(bp);
>  			bp->b_log_item = NULL;
>  		} else {
>  			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> @@ -1141,8 +1144,8 @@ xfs_buf_iodone_callback_error(
>  	return false;
>  }
>  
> -static void
> -xfs_buf_run_callbacks(
> +static inline bool
> +xfs_buf_had_callback_errors(
>  	struct xfs_buf		*bp)
>  {
>  
> @@ -1152,7 +1155,7 @@ xfs_buf_run_callbacks(
>  	 * appropriate action.
>  	 */
>  	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
> -		return;
> +		return true;
>  
>  	/*
>  	 * Successful IO or permanent error. Either way, we can clear the
> @@ -1161,7 +1164,16 @@ xfs_buf_run_callbacks(
>  	bp->b_last_error = 0;
>  	bp->b_retries = 0;
>  	bp->b_first_retry_time = 0;
> +	return false;
> +}
>  
> +static void
> +xfs_buf_run_callbacks(
> +	struct xfs_buf		*bp)
> +{
> +
> +	if (xfs_buf_had_callback_errors(bp))
> +		return;
>  	xfs_buf_do_callbacks(bp);
>  	bp->b_log_item = NULL;
>  }
> @@ -1173,7 +1185,20 @@ void
>  xfs_buf_inode_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	xfs_buf_run_callbacks(bp);
> +	struct xfs_buf_log_item *blip = bp->b_log_item;
> +	struct xfs_log_item	*lip;
> +
> +	if (xfs_buf_had_callback_errors(bp))
> +		return;
> +
> +	/* If there is a buf_log_item attached, run its callback */
> +	if (blip) {
> +		lip = &blip->bli_item;
> +		lip->li_cb(bp, lip);
> +		bp->b_log_item = NULL;
> +	}
> +
> +	xfs_iflush_done(bp);
>  	xfs_buf_ioend_finish(bp);
>  }
>  
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 607c9d9bb2b40..c75d625de7945 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2658,7 +2658,6 @@ xfs_ifree_cluster(
>  		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
>  			if (lip->li_type == XFS_LI_INODE) {
>  				iip = (struct xfs_inode_log_item *)lip;
> -				lip->li_cb = xfs_istale_done;
>  				xfs_trans_ail_copy_lsn(mp->m_ail,
>  							&iip->ili_flush_lsn,
>  							&iip->ili_item.li_lsn);
> @@ -2691,8 +2690,7 @@ xfs_ifree_cluster(
>  			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  						&iip->ili_item.li_lsn);
>  
> -			xfs_buf_attach_iodone(bp, xfs_istale_done,
> -						  &iip->ili_item);
> +			xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
>  
>  			if (ip != free_ip)
>  				xfs_iunlock(ip, XFS_ILOCK_EXCL);
> @@ -3842,7 +3840,7 @@ xfs_iflush_int(
>  	 * the flush lock.
>  	 */
>  	bp->b_flags |= _XBF_INODES;
> -	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);
> +	xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
>  
>  	/* generate the checksum. */
>  	xfs_dinode_calc_crc(mp, dip);
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 6ef9cbcfc94a7..7049f2ae8d186 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -668,40 +668,34 @@ xfs_inode_item_destroy(
>   */
>  void
>  xfs_iflush_done(
> -	struct xfs_buf		*bp,
> -	struct xfs_log_item	*lip)
> +	struct xfs_buf		*bp)
>  {
>  	struct xfs_inode_log_item *iip;
> -	struct xfs_log_item	*blip, *n;
> -	struct xfs_ail		*ailp = lip->li_ailp;
> +	struct xfs_log_item	*lip, *n;
> +	struct xfs_ail		*ailp = bp->b_mount->m_ail;
>  	int			need_ail = 0;
>  	LIST_HEAD(tmp);
>  
>  	/*
> -	 * Scan the buffer IO completions for other inodes being completed and
> -	 * attach them to the current inode log item.
> +	 * Pull the attached inodes from the buffer one at a time and take the
> +	 * appropriate action on them.
>  	 */
> -
> -	list_add_tail(&lip->li_bio_list, &tmp);
> -
> -	list_for_each_entry_safe(blip, n, &bp->b_li_list, li_bio_list) {
> -		if (lip->li_cb != xfs_iflush_done)
> +	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
> +		iip = INODE_ITEM(lip);
> +		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
> +			list_del_init(&lip->li_bio_list);
> +			xfs_iflush_abort(iip->ili_inode);
>  			continue;
> +		}
>  
> -		list_move_tail(&blip->li_bio_list, &tmp);
> +		list_move_tail(&lip->li_bio_list, &tmp);
>  
>  		/* Do an unlocked check for needing the AIL lock. */
> -		iip = INODE_ITEM(blip);
> -		if (blip->li_lsn == iip->ili_flush_lsn ||
> -		    test_bit(XFS_LI_FAILED, &blip->li_flags))
> +		if (lip->li_lsn == iip->ili_flush_lsn ||
> +		    test_bit(XFS_LI_FAILED, &lip->li_flags))
>  			need_ail++;
>  	}
> -
> -	/* make sure we capture the state of the initial inode. */
> -	iip = INODE_ITEM(lip);
> -	if (lip->li_lsn == iip->ili_flush_lsn ||
> -	    test_bit(XFS_LI_FAILED, &lip->li_flags))
> -		need_ail++;
> +	ASSERT(list_empty(&bp->b_li_list));
>  
>  	/*
>  	 * We only want to pull the item from the AIL if it is actually there
> @@ -713,19 +707,13 @@ xfs_iflush_done(
>  
>  		/* this is an opencoded batch version of xfs_trans_ail_delete */
>  		spin_lock(&ailp->ail_lock);
> -		list_for_each_entry(blip, &tmp, li_bio_list) {
> -			if (blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
> -				/*
> -				 * xfs_ail_update_finish() only cares about the
> -				 * lsn of the first tail item removed, any
> -				 * others will be at the same or higher lsn so
> -				 * we just ignore them.
> -				 */
> -				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip);
> +		list_for_each_entry(lip, &tmp, li_bio_list) {
> +			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
> +				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
>  				if (!tail_lsn && lsn)
>  					tail_lsn = lsn;
>  			} else {
> -				xfs_clear_li_failed(blip);
> +				xfs_clear_li_failed(lip);
>  			}
>  		}
>  		xfs_ail_update_finish(ailp, tail_lsn);
> @@ -736,9 +724,9 @@ xfs_iflush_done(
>  	 * ili_last_fields bits now that we know that the data corresponding to
>  	 * them is safely on disk.
>  	 */
> -	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
> -		list_del_init(&blip->li_bio_list);
> -		iip = INODE_ITEM(blip);
> +	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
> +		list_del_init(&lip->li_bio_list);
> +		iip = INODE_ITEM(lip);
>  
>  		spin_lock(&iip->ili_lock);
>  		iip->ili_last_fields = 0;
> @@ -746,7 +734,6 @@ xfs_iflush_done(
>  
>  		xfs_ifunlock(iip->ili_inode);
>  	}
> -	list_del(&tmp);
>  }
>  
>  /*
> @@ -779,14 +766,6 @@ xfs_iflush_abort(
>  	xfs_ifunlock(ip);
>  }
>  
> -void
> -xfs_istale_done(
> -	struct xfs_buf		*bp,
> -	struct xfs_log_item	*lip)
> -{
> -	xfs_iflush_abort(INODE_ITEM(lip)->ili_inode);
> -}
> -
>  /*
>   * convert an xfs_inode_log_format struct from the old 32 bit version
>   * (which can have different field alignments) to the native 64 bit version
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 1234e8cd3726d..ef427cd03ffc3 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -25,15 +25,14 @@ struct xfs_inode_log_item {
>  	unsigned int		ili_fsync_fields;  /* logged since last fsync */
>  };
>  
> -static inline int xfs_inode_clean(xfs_inode_t *ip)
> +static inline int xfs_inode_clean(struct xfs_inode *ip)
>  {
>  	return !ip->i_itemp || !(ip->i_itemp->ili_fields & XFS_ILOG_ALL);
>  }
>  
>  extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *);
>  extern void xfs_inode_item_destroy(struct xfs_inode *);
> -extern void xfs_iflush_done(struct xfs_buf *, struct xfs_log_item *);
> -extern void xfs_istale_done(struct xfs_buf *, struct xfs_log_item *);
> +extern void xfs_iflush_done(struct xfs_buf *);
>  extern void xfs_iflush_abort(struct xfs_inode *);
>  extern int xfs_inode_item_format_convert(xfs_log_iovec_t *,
>  					 struct xfs_inode_log_format *);
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 09/24] xfs: use direct calls for dquot IO completion
  2020-05-22  3:50 ` [PATCH 09/24] xfs: use direct calls for dquot IO completion Dave Chinner
@ 2020-05-22 22:13   ` Darrick J. Wong
  2020-05-23  9:16   ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 22:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:14PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Similar to inodes, we can call the dquot IO completion functions
> directly from the buffer completion code, removing another user of
> log item callbacks for IO completion processing.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks pretty straightforward.  I bet li_cb goes away soon?

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_buf_item.c | 18 +++++++++++++++++-
>  fs/xfs/xfs_dquot.c    | 27 +++++++++++++++++++++++----
>  fs/xfs/xfs_dquot.h    |  1 +
>  3 files changed, 41 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index e376f778bf57c..57e5d37f6852e 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -15,6 +15,9 @@
>  #include "xfs_buf_item.h"
>  #include "xfs_inode.h"
>  #include "xfs_inode_item.h"
> +#include "xfs_quota.h"
> +#include "xfs_dquot_item.h"
> +#include "xfs_dquot.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
> @@ -1209,7 +1212,20 @@ void
>  xfs_buf_dquot_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	xfs_buf_run_callbacks(bp);
> +	struct xfs_buf_log_item *blip = bp->b_log_item;
> +	struct xfs_log_item	*lip;
> +
> +	if (xfs_buf_had_callback_errors(bp))
> +		return;
> +
> +	/* a newly allocated dquot buffer might have a log item attached */
> +	if (blip) {
> +		lip = &blip->bli_item;
> +		lip->li_cb(bp, lip);
> +		bp->b_log_item = NULL;
> +	}
> +
> +	xfs_dquot_done(bp);
>  	xfs_buf_ioend_finish(bp);
>  }
>  
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index 25592b701db40..1d7f34a9bc989 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -1043,9 +1043,8 @@ xfs_qm_dqrele(
>   * from the AIL if it has not been re-logged, and unlocking the dquot's
>   * flush lock. This behavior is very similar to that of inodes..
>   */
> -STATIC void
> +static void
>  xfs_qm_dqflush_done(
> -	struct xfs_buf		*bp,
>  	struct xfs_log_item	*lip)
>  {
>  	struct xfs_dq_logitem	*qip = (struct xfs_dq_logitem *)lip;
> @@ -1086,6 +1085,27 @@ xfs_qm_dqflush_done(
>  	xfs_dqfunlock(dqp);
>  }
>  
> +void
> +xfs_dquot_done(
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_log_item	*lip;
> +
> +	while (!list_empty(&bp->b_li_list)) {
> +		lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
> +				       li_bio_list);
> +
> +		/*
> +		 * Remove the item from the list, so we don't have any
> +		 * confusion if the item is added to another buf.
> +		 * Don't touch the log item after calling its
> +		 * callback, because it could have freed itself.
> +		 */
> +		list_del_init(&lip->li_bio_list);
> +		xfs_qm_dqflush_done(lip);
> +	}
> +}
> +
>  /*
>   * Write a modified dquot to disk.
>   * The dquot must be locked and the flush lock too taken by caller.
> @@ -1175,8 +1195,7 @@ xfs_qm_dqflush(
>  	 * AIL and release the flush lock once the dquot is synced to disk.
>  	 */
>  	bp->b_flags |= _XBF_DQUOTS;
> -	xfs_buf_attach_iodone(bp, xfs_qm_dqflush_done,
> -				  &dqp->q_logitem.qli_item);
> +	xfs_buf_attach_iodone(bp, NULL, &dqp->q_logitem.qli_item);
>  
>  	/*
>  	 * If the buffer is pinned then push on the log so we won't
> diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
> index fe3e46df604b4..f1924288986d3 100644
> --- a/fs/xfs/xfs_dquot.h
> +++ b/fs/xfs/xfs_dquot.h
> @@ -174,6 +174,7 @@ void		xfs_qm_dqput(struct xfs_dquot *dqp);
>  void		xfs_dqlock2(struct xfs_dquot *, struct xfs_dquot *);
>  
>  void		xfs_dquot_set_prealloc_limits(struct xfs_dquot *);
> +void		xfs_dquot_done(struct xfs_buf *);
>  
>  static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp)
>  {
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/24] xfs: clean up the buffer iodone callback functions
  2020-05-22  3:50 ` [PATCH 10/24] xfs: clean up the buffer iodone callback functions Dave Chinner
@ 2020-05-22 22:26   ` Darrick J. Wong
  2020-05-25  0:37     ` Dave Chinner
  2020-05-23  9:19   ` Christoph Hellwig
  1 sibling, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 22:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:15PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we've sorted inode and dquot buffers, we can apply the same
> cleanups to dirty buffers with buffer log items. They only have one
> callback, too, so we don't need the log item callback. Collapse the
> iodone functions and remove all the now unnecessary infrastructure
> around callback processing.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf_item.c  | 139 +++++++++--------------------------------
>  fs/xfs/xfs_buf_item.h  |   1 -
>  fs/xfs/xfs_trans_buf.c |   2 -
>  3 files changed, 29 insertions(+), 113 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 57e5d37f6852e..d44b3e3f46613 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -30,7 +30,7 @@ static inline struct xfs_buf_log_item *BUF_ITEM(struct xfs_log_item *lip)
>  	return container_of(lip, struct xfs_buf_log_item, bli_item);
>  }
>  
> -STATIC void	xfs_buf_do_callbacks(struct xfs_buf *bp);
> +static void xfs_buf_item_done(struct xfs_buf *bp);
>  
>  /* Is this log iovec plausibly large enough to contain the buffer log format? */
>  bool
> @@ -462,9 +462,8 @@ xfs_buf_item_unpin(
>  		 * the AIL lock.
>  		 */
>  		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> -			lip->li_cb(bp, lip);
> +			xfs_buf_item_done(bp);
>  			xfs_iflush_done(bp);
> -			bp->b_log_item = NULL;
>  		} else {
>  			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
>  			xfs_buf_item_relse(bp);
> @@ -973,46 +972,6 @@ xfs_buf_attach_iodone(
>  	list_add_tail(&lip->li_bio_list, &bp->b_li_list);
>  }
>  
> -/*
> - * We can have many callbacks on a buffer. Running the callbacks individually
> - * can cause a lot of contention on the AIL lock, so we allow for a single
> - * callback to be able to scan the remaining items in bp->b_li_list for other
> - * items of the same type and callback to be processed in the first call.
> - *
> - * As a result, the loop walking the callback list below will also modify the
> - * list. it removes the first item from the list and then runs the callback.
> - * The loop then restarts from the new first item int the list. This allows the
> - * callback to scan and modify the list attached to the buffer and we don't
> - * have to care about maintaining a next item pointer.
> - */
> -STATIC void
> -xfs_buf_do_callbacks(
> -	struct xfs_buf		*bp)
> -{
> -	struct xfs_buf_log_item *blip = bp->b_log_item;
> -	struct xfs_log_item	*lip;
> -
> -	/* If there is a buf_log_item attached, run its callback */
> -	if (blip) {
> -		lip = &blip->bli_item;
> -		lip->li_cb(bp, lip);
> -	}
> -
> -	while (!list_empty(&bp->b_li_list)) {
> -		lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
> -				       li_bio_list);
> -
> -		/*
> -		 * Remove the item from the list, so we don't have any
> -		 * confusion if the item is added to another buf.
> -		 * Don't touch the log item after calling its
> -		 * callback, because it could have freed itself.
> -		 */
> -		list_del_init(&lip->li_bio_list);
> -		lip->li_cb(bp, lip);
> -	}
> -}
> -
>  /*
>   * Invoke the error state callback for each log item affected by the failed I/O.
>   *
> @@ -1025,8 +984,8 @@ STATIC void
>  xfs_buf_do_callbacks_fail(
>  	struct xfs_buf		*bp)
>  {
> +	struct xfs_ail		*ailp = bp->b_mount->m_ail;
>  	struct xfs_log_item	*lip;
> -	struct xfs_ail		*ailp;
>  
>  	/*
>  	 * Buffer log item errors are handled directly by xfs_buf_item_push()
> @@ -1036,9 +995,6 @@ xfs_buf_do_callbacks_fail(
>  	if (list_empty(&bp->b_li_list))
>  		return;
>  
> -	lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
> -			li_bio_list);
> -	ailp = lip->li_ailp;
>  	spin_lock(&ailp->ail_lock);
>  	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
>  		if (lip->li_ops->iop_error)
> @@ -1051,22 +1007,11 @@ static bool
>  xfs_buf_iodone_callback_error(
>  	struct xfs_buf		*bp)
>  {
> -	struct xfs_buf_log_item	*bip = bp->b_log_item;
> -	struct xfs_log_item	*lip;
> -	struct xfs_mount	*mp;
> +	struct xfs_mount	*mp = bp->b_mount;
>  	static ulong		lasttime;
>  	static xfs_buftarg_t	*lasttarg;
>  	struct xfs_error_cfg	*cfg;
>  
> -	/*
> -	 * The failed buffer might not have a buf_log_item attached or the
> -	 * log_item list might be empty. Get the mp from the available
> -	 * xfs_log_item
> -	 */
> -	lip = list_first_entry_or_null(&bp->b_li_list, struct xfs_log_item,
> -				       li_bio_list);
> -	mp = lip ? lip->li_mountp : bip->bli_item.li_mountp;
> -
>  	/*
>  	 * If we've already decided to shutdown the filesystem because of
>  	 * I/O errors, there's no point in giving this a retry.
> @@ -1171,14 +1116,27 @@ xfs_buf_had_callback_errors(
>  }
>  
>  static void
> -xfs_buf_run_callbacks(
> +xfs_buf_item_done(
>  	struct xfs_buf		*bp)
>  {
> +	struct xfs_buf_log_item	*bip = bp->b_log_item;
>  
> -	if (xfs_buf_had_callback_errors(bp))
> +	if (!bip)
>  		return;
> -	xfs_buf_do_callbacks(bp);
> +
> +	/*
> +	 * If we are forcibly shutting down, this may well be off the AIL
> +	 * already. That's because we simulate the log-committed callbacks to
> +	 * unpin these buffers. Or we may never have put this item on AIL
> +	 * because of the transaction was aborted forcibly.
> +	 * xfs_trans_ail_delete() takes care of these.
> +	 *
> +	 * Either way, AIL is useless if we're forcing a shutdown.
> +	 */
> +	xfs_trans_ail_delete(&bip->bli_item, SHUTDOWN_CORRUPT_INCORE);
>  	bp->b_log_item = NULL;
> +	xfs_buf_item_free(bip);
> +	xfs_buf_rele(bp);
>  }
>  
>  /*
> @@ -1188,19 +1146,10 @@ void
>  xfs_buf_inode_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	struct xfs_buf_log_item *blip = bp->b_log_item;
> -	struct xfs_log_item	*lip;
> -
>  	if (xfs_buf_had_callback_errors(bp))
>  		return;
>  
> -	/* If there is a buf_log_item attached, run its callback */
> -	if (blip) {
> -		lip = &blip->bli_item;
> -		lip->li_cb(bp, lip);
> -		bp->b_log_item = NULL;
> -	}
> -
> +	xfs_buf_item_done(bp);
>  	xfs_iflush_done(bp);

Just out of curiosity, we still have a reference to bp here even if
xfs_buf_item_done calls xfs_buf_rele, right?  I think the answer is that yes
we do still have the reference because the inodes themselves hold references
to the cluster buffer, right?

If so,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

>  	xfs_buf_ioend_finish(bp);
>  }
> @@ -1212,59 +1161,29 @@ void
>  xfs_buf_dquot_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	struct xfs_buf_log_item *blip = bp->b_log_item;
> -	struct xfs_log_item	*lip;
> -
>  	if (xfs_buf_had_callback_errors(bp))
>  		return;
>  
>  	/* a newly allocated dquot buffer might have a log item attached */
> -	if (blip) {
> -		lip = &blip->bli_item;
> -		lip->li_cb(bp, lip);
> -		bp->b_log_item = NULL;
> -	}
> -
> +	xfs_buf_item_done(bp);
>  	xfs_dquot_done(bp);
>  	xfs_buf_ioend_finish(bp);
>  }
>  
>  /*
>   * Dirty buffer iodone callback function.
> + *
> + * Note that for things like remote attribute buffers, there may not be a buffer
> + * log item here, so processing the buffer log item must remain be optional.
>   */
>  void
>  xfs_buf_dirty_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	xfs_buf_run_callbacks(bp);
> +	if (xfs_buf_had_callback_errors(bp))
> +		return;
> +
> +	xfs_buf_item_done(bp);
>  	xfs_buf_ioend_finish(bp);
>  }
>  
> -/*
> - * This is the iodone() function for buffers which have been
> - * logged.  It is called when they are eventually flushed out.
> - * It should remove the buf item from the AIL, and free the buf item.
> - * It is called by xfs_buf_iodone_callbacks() above which will take
> - * care of cleaning up the buffer itself.
> - */
> -void
> -xfs_buf_iodone(
> -	struct xfs_buf		*bp,
> -	struct xfs_log_item	*lip)
> -{
> -	ASSERT(BUF_ITEM(lip)->bli_buf == bp);
> -
> -	xfs_buf_rele(bp);
> -
> -	/*
> -	 * If we are forcibly shutting down, this may well be off the AIL
> -	 * already. That's because we simulate the log-committed callbacks to
> -	 * unpin these buffers. Or we may never have put this item on AIL
> -	 * because of the transaction was aborted forcibly.
> -	 * xfs_trans_ail_delete() takes care of these.
> -	 *
> -	 * Either way, AIL is useless if we're forcing a shutdown.
> -	 */
> -	xfs_trans_ail_delete(lip, SHUTDOWN_CORRUPT_INCORE);
> -	xfs_buf_item_free(BUF_ITEM(lip));
> -}
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index 96f994ec90915..3f436efb0b67a 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -57,7 +57,6 @@ bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
>  void	xfs_buf_attach_iodone(struct xfs_buf *,
>  			      void(*)(struct xfs_buf *, struct xfs_log_item *),
>  			      struct xfs_log_item *);
> -void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
>  void	xfs_buf_inode_iodone(struct xfs_buf *);
>  void	xfs_buf_dquot_iodone(struct xfs_buf *);
>  void	xfs_buf_dirty_iodone(struct xfs_buf *);
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 69e0ebe94a915..11cd666cd99a6 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -475,7 +475,6 @@ xfs_trans_dirty_buf(
>  	bp->b_flags |= XBF_DONE;
>  
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
> -	bip->bli_item.li_cb = xfs_buf_iodone;
>  
>  	/*
>  	 * If we invalidated the buffer within this transaction, then
> @@ -644,7 +643,6 @@ xfs_trans_stale_inode_buf(
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  
>  	bip->bli_flags |= XFS_BLI_STALE_INODE;
> -	bip->bli_item.li_cb = xfs_buf_iodone;
>  	bp->b_flags |= _XBF_INODES;
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
>  }
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/24] xfs: get rid of log item callbacks
  2020-05-22  3:50 ` [PATCH 11/24] xfs: get rid of log item callbacks Dave Chinner
@ 2020-05-22 22:27   ` Darrick J. Wong
  2020-05-23  9:19   ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 22:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:16PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> They are not used anymore, so remove them from the log item and the
> buffer iodone attachment interfaces.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf_item.c | 17 -----------------
>  fs/xfs/xfs_buf_item.h |  3 ---
>  fs/xfs/xfs_dquot.c    |  6 +++---
>  fs/xfs/xfs_inode.c    |  5 +++--
>  fs/xfs/xfs_trans.h    |  3 ---
>  5 files changed, 6 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index d44b3e3f46613..d855a2b7486c5 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -955,23 +955,6 @@ xfs_buf_item_relse(
>  	xfs_buf_item_free(bip);
>  }
>  
> -
> -/*
> - * Add the given log item with its callback to the list of callbacks
> - * to be called when the buffer's I/O completes.
> - */
> -void
> -xfs_buf_attach_iodone(
> -	struct xfs_buf		*bp,
> -	void			(*cb)(struct xfs_buf *, struct xfs_log_item *),
> -	struct xfs_log_item	*lip)
> -{
> -	ASSERT(xfs_buf_islocked(bp));
> -
> -	lip->li_cb = cb;
> -	list_add_tail(&lip->li_bio_list, &bp->b_li_list);
> -}
> -
>  /*
>   * Invoke the error state callback for each log item affected by the failed I/O.
>   *
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index 3f436efb0b67a..ecfad1915a86b 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -54,9 +54,6 @@ void	xfs_buf_item_relse(struct xfs_buf *);
>  bool	xfs_buf_item_put(struct xfs_buf_log_item *);
>  void	xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint);
>  bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
> -void	xfs_buf_attach_iodone(struct xfs_buf *,
> -			      void(*)(struct xfs_buf *, struct xfs_log_item *),
> -			      struct xfs_log_item *);
>  void	xfs_buf_inode_iodone(struct xfs_buf *);
>  void	xfs_buf_dquot_iodone(struct xfs_buf *);
>  void	xfs_buf_dirty_iodone(struct xfs_buf *);
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index 1d7f34a9bc989..34fc1bcb1eefd 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -1191,11 +1191,11 @@ xfs_qm_dqflush(
>  	}
>  
>  	/*
> -	 * Attach an iodone routine so that we can remove this dquot from the
> -	 * AIL and release the flush lock once the dquot is synced to disk.
> +	 * Attach the dquot to the buffer so that we can remove this dquot from
> +	 * the AIL and release the flush lock once the dquot is synced to disk.
>  	 */
>  	bp->b_flags |= _XBF_DQUOTS;
> -	xfs_buf_attach_iodone(bp, NULL, &dqp->q_logitem.qli_item);
> +	list_add_tail(&dqp->q_logitem.qli_item.li_bio_list, &bp->b_li_list);
>  
>  	/*
>  	 * If the buffer is pinned then push on the log so we won't
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index c75d625de7945..c5529853f513c 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2690,7 +2690,8 @@ xfs_ifree_cluster(
>  			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  						&iip->ili_item.li_lsn);
>  
> -			xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
> +			list_add_tail(&iip->ili_item.li_bio_list,
> +						&bp->b_li_list);
>  
>  			if (ip != free_ip)
>  				xfs_iunlock(ip, XFS_ILOCK_EXCL);
> @@ -3840,7 +3841,7 @@ xfs_iflush_int(
>  	 * the flush lock.
>  	 */
>  	bp->b_flags |= _XBF_INODES;
> -	xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
> +	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
>  
>  	/* generate the checksum. */
>  	xfs_dinode_calc_crc(mp, dip);
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 8308bf6d7e404..d27429e3a82a6 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -37,9 +37,6 @@ struct xfs_log_item {
>  	unsigned long			li_flags;	/* misc flags */
>  	struct xfs_buf			*li_buf;	/* real buffer pointer */
>  	struct list_head		li_bio_list;	/* buffer item list */
> -	void				(*li_cb)(struct xfs_buf *,
> -						 struct xfs_log_item *);
> -							/* buffer item iodone */
>  							/* callback func */

You could get rid of this comment too.

With that fixed,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

>  	const struct xfs_item_ops	*li_ops;	/* function list */
>  
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 12/24] xfs: pin inode backing buffer to the inode log item
  2020-05-22  3:50 ` [PATCH 12/24] xfs: pin inode backing buffer to the inode log item Dave Chinner
@ 2020-05-22 22:39   ` Darrick J. Wong
  2020-05-23  9:34   ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 22:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:17PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we dirty an inode, we are going to have to write it disk at
> some point in the near future. This requires the inode cluster
> backing buffer to be present in memory. Unfortunately, under severe
> memory pressure we can reclaim the inode backing buffer while the
> inode is dirty in memory, resulting in stalling the AIL pushing
> because it has to do a read-modify-write cycle on the cluster
> buffer.

RMW, boooo....

> When we have no memory available, the read of the cluster buffer
> blocks the AIL pushing process, and this causes all sorts of issues
> for memory reclaim as it requires inode writeback to make forwards
> progress. Allocating a cluster buffer causes more memory pressure,
> and results in more cluster buffers to be reclaimed, resulting in
> more RMW cycles to be done in the AIL context and everything then
> backs up on AIL progress. Only the synchronous inode cluster
> writeback in the the inode reclaim code provides some level of
> forwards progress guarantees that prevent OOM-killer rampages in
> this situation.
> 
> Fix this by pinning the inode backing buffer to the inode log item
> when the inode is first dirtied (i.e. in xfs_trans_log_inode()).

I'm guessing this is where the "dquots should be converted to use
a similar cluster flushing strategy" in the cover letter applies?

> This may mean the first modification of an inode that has been held
> in cache for a long time may block on a cluster buffer read, but
> we can do that in transaction context and block safely until the
> buffer has been allocated and read.
> 
> Once we have the cluster buffer, the inode log item takes a
> reference to it, pinning it in memory, and attaches it to the log
> item for future reference. This means we can always grab the cluster
> buffer from the inode log item when we need it.
> 
> When the inode is finally cleaned and removed from the AIL, we can
> drop the reference the inode log item holds on the cluster buffer.
> Once all inodes on the cluster buffer are clean, the cluster buffer
> will be unpinned and it will be available for memory reclaim to
> reclaim again.
> 
> This avoids the issues with needing to do RMW cycles in the AIL
> pushing context, and hence allows complete non-blocking inode
> flushing to be performed by the AIL pushing context.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_inode_buf.c   |  3 +-
>  fs/xfs/libxfs/xfs_trans_inode.c | 71 +++++++++++++++++++++++++--------
>  fs/xfs/xfs_inode_item.c         | 63 ++++++++++++++++++++++++-----
>  fs/xfs/xfs_trans_priv.h         | 12 +-----
>  4 files changed, 112 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 6f84ea85fdd83..1af97235785c8 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -176,7 +176,8 @@ xfs_imap_to_bp(
>  	}
>  
>  	*bpp = bp;
> -	*dipp = xfs_buf_offset(bp, imap->im_boffset);
> +	if (dipp)
> +		*dipp = xfs_buf_offset(bp, imap->im_boffset);
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 510b996008221..e130eb2994156 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -8,6 +8,8 @@
>  #include "xfs_shared.h"
>  #include "xfs_format.h"
>  #include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
>  #include "xfs_inode.h"
>  #include "xfs_trans.h"
>  #include "xfs_trans_priv.h"
> @@ -71,13 +73,19 @@ xfs_trans_ichgtime(
>  }
>  
>  /*
> - * This is called to mark the fields indicated in fieldmask as needing
> - * to be logged when the transaction is committed.  The inode must
> - * already be associated with the given transaction.
> + * This is called to mark the fields indicated in fieldmask as needing to be
> + * logged when the transaction is committed.  The inode must already be
> + * associated with the given transaction.
>   *
> - * The values for fieldmask are defined in xfs_inode_item.h.  We always
> - * log all of the core inode if any of it has changed, and we always log
> - * all of the inline data/extents/b-tree root if any of them has changed.
> + * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
> + * of the core inode if any of it has changed, and we always log all of the
> + * inline data/extents/b-tree root if any of them has changed.
> + *
> + * Grab and pin the cluster buffer associated with this inode to avoid RMW
> + * cycles at inode writeback time. Avoid the need to add error handling to every
> + * xfs_trans_log_inode() call by shutting down on read error.  This will cause
> + * transactions to fail and everything to error out, just like if we return a
> + * read error in a dirty transaction and cancel it.
>   */
>  void
>  xfs_trans_log_inode(
> @@ -122,21 +130,52 @@ xfs_trans_log_inode(
>  	}
>  
>  	/*
> -	 * Record the specific change for fdatasync optimisation. This
> -	 * allows fdatasync to skip log forces for inodes that are only
> -	 * timestamp dirty. We do this before the change count so that
> -	 * the core being logged in this case does not impact on fdatasync
> -	 * behaviour.
> +	 * Record the specific change for fdatasync optimisation. This allows
> +	 * fdatasync to skip log forces for inodes that are only timestamp
> +	 * dirty. We do this before the change count so that the core being
> +	 * logged in this case does not impact on fdatasync behaviour.
>  	 */
>  	spin_lock(&iip->ili_lock);
>  	iip->ili_fsync_fields |= flags;
>  
> +	if (!iip->ili_item.li_buf) {
> +		struct xfs_buf	*bp;
> +		int		error;
> +
> +		/*
> +		 * We hold the ILOCK here, so this inode is not going to be
> +		 * flushed while we are here. Further, because there is no
> +		 * buffer attached to the item, we know that there is no IO in
> +		 * progress, so nothing will clear the ili_fields while we read
> +		 * in the buffer. Hence we can safely drop the spin lock and
> +		 * read the buffer knowing that the state will not change from
> +		 * here.
> +		 */
> +		spin_unlock(&iip->ili_lock);
> +		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, NULL,
> +					&bp, 0);
> +		if (error) {
> +			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> +			return;
> +		}
> +
> +		/*
> +		 * We need an explicit buffer reference for the log item, We
> +		 * don't want the buffer attached to the transaction, so we have
> +		 * to release the transaction reference we just gained.

"We need an explicit buffer reference for the log item but don't want
the buffer to remain attached to the transaction.  Hold the buffer but
release the transaction reference." ?

> +		 */
> +		xfs_buf_hold(bp);
> +		xfs_trans_brelse(tp, bp);
> +
> +		spin_lock(&iip->ili_lock);
> +		iip->ili_item.li_buf = bp;
> +	}
> +
>  	/*
> -	 * Always OR in the bits from the ili_last_fields field.
> -	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
> -	 * routines in the eventual clearing of the ili_fields bits.
> -	 * See the big comment in xfs_iflush() for an explanation of
> -	 * this coordination mechanism.
> +	 * Always OR in the bits from the ili_last_fields field.  This is to
> +	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
> +	 * the eventual clearing of the ili_fields bits.  See the big comment in
> +	 * xfs_iflush() for an explanation of this coordination mechanism.
>  	 */
>  	flags |= iip->ili_last_fields | iversion_flags;
>  	iip->ili_fields |= flags;
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 7049f2ae8d186..86173a52526fe 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -130,6 +130,8 @@ xfs_inode_item_size(
>  	xfs_inode_item_data_fork_size(iip, nvecs, nbytes);
>  	if (XFS_IFORK_Q(ip))
>  		xfs_inode_item_attr_fork_size(iip, nvecs, nbytes);
> +
> +	ASSERT(iip->ili_item.li_buf);
>  }
>  
>  STATIC void
> @@ -439,6 +441,7 @@ xfs_inode_item_pin(
>  	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
>  
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> +	ASSERT(lip->li_buf);
>  
>  	trace_xfs_inode_pin(ip, _RET_IP_);
>  	atomic_inc(&ip->i_pincount);
> @@ -450,6 +453,12 @@ xfs_inode_item_pin(
>   * item which was previously pinned with a call to xfs_inode_item_pin().
>   *
>   * Also wake up anyone in xfs_iunpin_wait() if the count goes to 0.
> + *
> + * Note that unpin can race with inode cluster buffer freeing marking the buffer
> + * stale. In that case, flush completions are run from the buffer unpin call,
> + * which may happen before the inode is unpinned. If we lose the race, there
> + * will be no buffer attached to the log item, but the inode will be marked
> + * XFS_ISTALE.
>   */
>  STATIC void
>  xfs_inode_item_unpin(
> @@ -459,6 +468,7 @@ xfs_inode_item_unpin(
>  	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
>  
>  	trace_xfs_inode_unpin(ip, _RET_IP_);
> +	ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE));
>  	ASSERT(atomic_read(&ip->i_pincount) > 0);
>  	if (atomic_dec_and_test(&ip->i_pincount))
>  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
> @@ -647,10 +657,15 @@ xfs_inode_item_init(
>   */
>  void
>  xfs_inode_item_destroy(
> -	xfs_inode_t	*ip)
> +	struct xfs_inode	*ip)
>  {
> -	kmem_free(ip->i_itemp->ili_item.li_lv_shadow);
> -	kmem_cache_free(xfs_ili_zone, ip->i_itemp);
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
> +
> +	ASSERT(iip->ili_item.li_buf == NULL);
> +
> +	ip->i_itemp = NULL;
> +	kmem_free(iip->ili_item.li_lv_shadow);
> +	kmem_cache_free(xfs_ili_zone, iip);
>  }
>  
>  
> @@ -665,6 +680,13 @@ xfs_inode_item_destroy(
>   * list for other inodes that will run this function. We remove them from the
>   * buffer list so we can process all the inode IO completions in one AIL lock
>   * traversal.
> + *
> + * Note: Now that we attach the log item to the buffer when we first log the
> + * inode in memory, we can have unflushed inodes on the buffer list here. These
> + * inodes will have a zero ili_last_fields, so skip over them here. We do
> + * this check -after- we've checked for stale inodes, because we're guaranteed
> + * to have XFS_ISTALE set in the case that dirty inodes are in the CIL and have
> + * not yet had their dirtying transactions committed to disk.
>   */
>  void
>  xfs_iflush_done(
> @@ -688,14 +710,16 @@ xfs_iflush_done(
>  			continue;
>  		}
>  
> +		if (!iip->ili_last_fields)
> +			continue;
> +
>  		list_move_tail(&lip->li_bio_list, &tmp);
>  
>  		/* Do an unlocked check for needing the AIL lock. */
> -		if (lip->li_lsn == iip->ili_flush_lsn ||
> +		if (iip->ili_flush_lsn == lip->li_lsn ||
>  		    test_bit(XFS_LI_FAILED, &lip->li_flags))
>  			need_ail++;
>  	}
> -	ASSERT(list_empty(&bp->b_li_list));
>  
>  	/*
>  	 * We only want to pull the item from the AIL if it is actually there
> @@ -708,7 +732,8 @@ xfs_iflush_done(
>  		/* this is an opencoded batch version of xfs_trans_ail_delete */
>  		spin_lock(&ailp->ail_lock);
>  		list_for_each_entry(lip, &tmp, li_bio_list) {
> -			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
> +			iip = INODE_ITEM(lip);
> +			if (iip->ili_flush_lsn == lip->li_lsn) {
>  				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
>  				if (!tail_lsn && lsn)
>  					tail_lsn = lsn;
> @@ -725,14 +750,29 @@ xfs_iflush_done(
>  	 * them is safely on disk.
>  	 */
>  	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
> +		bool	drop_buffer = false;
> +
>  		list_del_init(&lip->li_bio_list);
>  		iip = INODE_ITEM(lip);
>  
>  		spin_lock(&iip->ili_lock);
>  		iip->ili_last_fields = 0;
> -		spin_unlock(&iip->ili_lock);
> +		iip->ili_flush_lsn = 0;
>  
> +		/*
> +		 * Remove the reference to the cluster buffer if the inode is
> +		 * clean in memory. Drop the buffer reference once we've dropped
> +		 * the locks we hold.
> +		 */
> +		ASSERT(iip->ili_item.li_buf == bp);
> +		if (!iip->ili_fields) {
> +			iip->ili_item.li_buf = NULL;
> +			drop_buffer = true;
> +		}
> +		spin_unlock(&iip->ili_lock);
>  		xfs_ifunlock(iip->ili_inode);
> +		if (drop_buffer)
> +			xfs_buf_rele(bp);
>  	}
>  }
>  
> @@ -747,6 +787,7 @@ xfs_iflush_abort(
>  	struct xfs_inode		*ip)
>  {
>  	struct xfs_inode_log_item	*iip = ip->i_itemp;
> +	struct xfs_buf		*bp = NULL;

Indentation seems inconsistent here.

Other than those two things this looks pretty good to me.

--D

>  
>  	if (iip) {
>  		xfs_trans_ail_delete(&iip->ili_item, 0);
> @@ -758,12 +799,14 @@ xfs_iflush_abort(
>  		iip->ili_last_fields = 0;
>  		iip->ili_fields = 0;
>  		iip->ili_fsync_fields = 0;
> +		iip->ili_flush_lsn = 0;
> +		bp = iip->ili_item.li_buf;
> +		iip->ili_item.li_buf = NULL;
>  		spin_unlock(&iip->ili_lock);
>  	}
> -	/*
> -	 * Release the inode's flush lock since we're done with it.
> -	 */
>  	xfs_ifunlock(ip);
> +	if (bp)
> +		xfs_buf_rele(bp);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 3004aeac91102..21ffc6dfcd13e 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -143,15 +143,10 @@ static inline void
>  xfs_clear_li_failed(
>  	struct xfs_log_item	*lip)
>  {
> -	struct xfs_buf	*bp = lip->li_buf;
> -
>  	ASSERT(test_bit(XFS_LI_IN_AIL, &lip->li_flags));
>  	lockdep_assert_held(&lip->li_ailp->ail_lock);
>  
> -	if (test_and_clear_bit(XFS_LI_FAILED, &lip->li_flags)) {
> -		lip->li_buf = NULL;
> -		xfs_buf_rele(bp);
> -	}
> +	clear_bit(XFS_LI_FAILED, &lip->li_flags);
>  }
>  
>  static inline void
> @@ -161,10 +156,7 @@ xfs_set_li_failed(
>  {
>  	lockdep_assert_held(&lip->li_ailp->ail_lock);
>  
> -	if (!test_and_set_bit(XFS_LI_FAILED, &lip->li_flags)) {
> -		xfs_buf_hold(bp);
> -		lip->li_buf = bp;
> -	}
> +	set_bit(XFS_LI_FAILED, &lip->li_flags);
>  }
>  
>  #endif	/* __XFS_TRANS_PRIV_H__ */
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/24] xfs: make inode reclaim almost non-blocking
  2020-05-22  3:50 ` [PATCH 13/24] xfs: make inode reclaim almost non-blocking Dave Chinner
  2020-05-22 12:19   ` Amir Goldstein
@ 2020-05-22 22:48   ` Darrick J. Wong
  2020-05-23 22:29     ` Dave Chinner
  1 sibling, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 22:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:18PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that dirty inode writeback doesn't cause read-modify-write
> cycles on the inode cluster buffer under memory pressure, the need
> to throttle memory reclaim to the rate at which we can clean dirty
> inodes goes away. That is due to the fact that we no longer thrash
> inode cluster buffers under memory pressure to clean dirty inodes.
> 
> This means inode writeback no longer stalls on memory allocation
> or read IO, and hence can be done asynchrnously without generating

"...asynchronously..."

> memory pressure. As a result, blocking inode writeback in reclaim is
> no longer necessary to prevent reclaim priority windup as cleaning
> dirty inodes is no longer dependent on having memory reserves
> available for the filesystem to make progress reclaiming inodes.
> 
> Hence we can convert inode reclaim to be non-blocking for shrinker
> callouts, both for direct reclaim and kswapd.
> 
> On a vanilla kernel, running a 16-way fsmark create workload on a
> 4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
> userspace mlock(). The OOM killer gets invoked at 15GB of
> pinned RAM.
> 
> With this patch alone, pinning memory triggers premature OOM
> killer invocation, sometimes with as much as 45% of RAM being free.
> It's trivially easy to trigger the OOM killer when reclaim does not
> block.
> 
> With pinning inode clusters in RAM adn then adding this patch, I can
> reliably pin 14.5GB of RAM and still have the fsmark workload run to
> completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
> is only a small amount of memory less than the vanilla kernel. It is
> much more reliable than just with async reclaim alone.

So the lack of OOM kills is the result of not having to do RMW and
ratcheting up the reclaim priority, right?

And, {con|per}versely, can I run fstests with 400MB of RAM now? :D

> simoops shows that allocation stalls go away when async reclaim is
> used. Vanilla kernel:
> 
> Run time: 1924 seconds
> Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
> Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
> Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
> work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
> alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
> 
> With inode cluster pinning and async reclaim:
> 
> Run time: 1924 seconds
> Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
> Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
> Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)

I'm not familiar with simoops, and ElGoog is not helpful.  What are the
units here?

> work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
> alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)
> 
> Latencies don't really change much, nor does the work rate. However,
> allocation almost never stalls with these changes, whilst the
> vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
> period. This difference is due to inode reclaim being largely
> non-blocking now.

<nod>

> IOWs, once we have pinned inode cluster buffers, we can make inode
> reclaim non-blocking without a major risk of premature and/or
> spurious OOM killer invocation, and without any changes to memory
> reclaim infrastructure.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok to me, provided my suppositions are correct...
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_icache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index d806d3bfa8936..0f0f8fcd61b03 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1420,7 +1420,7 @@ xfs_reclaim_inodes_nr(
>  	xfs_reclaim_work_queue(mp);
>  	xfs_ail_push_all(mp->m_ail);
>  
> -	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
> +	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
>  }
>  
>  /*
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 14/24] xfs: remove IO submission from xfs_reclaim_inode()
  2020-05-22  3:50 ` [PATCH 14/24] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
@ 2020-05-22 23:06   ` Darrick J. Wong
  2020-05-25  3:49     ` Dave Chinner
  2020-05-23  9:40   ` Christoph Hellwig
  1 sibling, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 23:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:19PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We no longer need to issue IO from shrinker based inode reclaim to
> prevent spurious OOM killer invocation. This leaves only the global
> filesystem management operations such as unmount needing to
> writeback dirty inodes and reclaim them.
> 
> Instead of using the reclaim pass to write dirty inodes before
> reclaiming them, use the AIL to push all the dirty inodes before we
> try to reclaim them. This allows us to remove all the conditional
> SYNC_WAIT locking and the writeback code from xfs_reclaim_inode()
> and greatly simplify the checks we need to do to reclaim an inode.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 116 +++++++++++---------------------------------
>  1 file changed, 29 insertions(+), 87 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 0f0f8fcd61b03..ee9bc82a0dfbe 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1130,24 +1130,17 @@ xfs_reclaim_inode_grab(
>   *	dirty, async	=> requeue
>   *	dirty, sync	=> flush, wait and reclaim
>   */

The function comment probably ought to describe what the two return
values mean.  true when the inode was freed and false if we need to try
again, right?

> -STATIC int
> +static bool
>  xfs_reclaim_inode(
>  	struct xfs_inode	*ip,
>  	struct xfs_perag	*pag,
>  	int			sync_mode)
>  {
> -	struct xfs_buf		*bp = NULL;
>  	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
> -	int			error;
>  
> -restart:
> -	error = 0;
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	if (!xfs_iflock_nowait(ip)) {
> -		if (!(sync_mode & SYNC_WAIT))
> -			goto out;
> -		xfs_iflock(ip);
> -	}
> +	if (!xfs_iflock_nowait(ip))
> +		goto out;
>  
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
>  		xfs_iunpin_wait(ip);
> @@ -1155,51 +1148,19 @@ xfs_reclaim_inode(
>  		xfs_iflush_abort(ip);
>  		goto reclaim;
>  	}
> -	if (xfs_ipincount(ip)) {
> -		if (!(sync_mode & SYNC_WAIT))
> -			goto out_ifunlock;
> -		xfs_iunpin_wait(ip);
> -	}
> +	if (xfs_ipincount(ip))
> +		goto out_ifunlock;
>  	if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) {
>  		xfs_ifunlock(ip);
>  		goto reclaim;
>  	}
>  
> -	/*
> -	 * Never flush out dirty data during non-blocking reclaim, as it would
> -	 * just contend with AIL pushing trying to do the same job.
> -	 */
> -	if (!(sync_mode & SYNC_WAIT))
> -		goto out_ifunlock;
> -
> -	/*
> -	 * Now we have an inode that needs flushing.
> -	 *
> -	 * Note that xfs_iflush will never block on the inode buffer lock, as
> -	 * xfs_ifree_cluster() can lock the inode buffer before it locks the
> -	 * ip->i_lock, and we are doing the exact opposite here.  As a result,
> -	 * doing a blocking xfs_imap_to_bp() to get the cluster buffer would
> -	 * result in an ABBA deadlock with xfs_ifree_cluster().
> -	 *
> -	 * As xfs_ifree_cluser() must gather all inodes that are active in the
> -	 * cache to mark them stale, if we hit this case we don't actually want
> -	 * to do IO here - we want the inode marked stale so we can simply
> -	 * reclaim it.  Hence if we get an EAGAIN error here,  just unlock the
> -	 * inode, back off and try again.  Hopefully the next pass through will
> -	 * see the stale flag set on the inode.
> -	 */
> -	error = xfs_iflush(ip, &bp);
> -	if (error == -EAGAIN) {
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -		/* backoff longer than in xfs_ifree_cluster */
> -		delay(2);
> -		goto restart;
> -	}
> -
> -	if (!error) {
> -		error = xfs_bwrite(bp);
> -		xfs_buf_relse(bp);
> -	}
> +out_ifunlock:
> +	xfs_ifunlock(ip);
> +out:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	xfs_iflags_clear(ip, XFS_IRECLAIM);
> +	return false;
>  
>  reclaim:
>  	ASSERT(!xfs_isiflocked(ip));
> @@ -1249,21 +1210,7 @@ xfs_reclaim_inode(
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  
>  	__xfs_inode_free(ip);
> -	return error;
> -
> -out_ifunlock:
> -	xfs_ifunlock(ip);
> -out:
> -	xfs_iflags_clear(ip, XFS_IRECLAIM);
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -	/*
> -	 * We could return -EAGAIN here to make reclaim rescan the inode tree in
> -	 * a short while. However, this just burns CPU time scanning the tree
> -	 * waiting for IO to complete and the reclaim work never goes back to
> -	 * the idle state. Instead, return 0 to let the next scheduled
> -	 * background reclaim attempt to reclaim the inode again.
> -	 */
> -	return 0;
> +	return true;
>  }
>  
>  /*
> @@ -1272,20 +1219,17 @@ xfs_reclaim_inode(
>   * then a shut down during filesystem unmount reclaim walk leak all the
>   * unreclaimed inodes.
>   */
> -STATIC int
> +static int

The function comment /really/ needs to note that the return value
here is number of inodes that were skipped, and not just some negative
error code.

>  xfs_reclaim_inodes_ag(
>  	struct xfs_mount	*mp,
>  	int			flags,
>  	int			*nr_to_scan)
>  {
>  	struct xfs_perag	*pag;
> -	int			error = 0;
> -	int			last_error = 0;
>  	xfs_agnumber_t		ag;
>  	int			trylock = flags & SYNC_TRYLOCK;
>  	int			skipped;
>  
> -restart:
>  	ag = 0;
>  	skipped = 0;
>  	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
> @@ -1359,9 +1303,8 @@ xfs_reclaim_inodes_ag(
>  			for (i = 0; i < nr_found; i++) {
>  				if (!batch[i])
>  					continue;
> -				error = xfs_reclaim_inode(batch[i], pag, flags);
> -				if (error && last_error != -EFSCORRUPTED)
> -					last_error = error;
> +				if (!xfs_reclaim_inode(batch[i], pag, flags))
> +					skipped++;
>  			}
>  
>  			*nr_to_scan -= XFS_LOOKUP_BATCH;
> @@ -1377,19 +1320,7 @@ xfs_reclaim_inodes_ag(
>  		mutex_unlock(&pag->pag_ici_reclaim_lock);
>  		xfs_perag_put(pag);
>  	}
> -
> -	/*
> -	 * if we skipped any AG, and we still have scan count remaining, do
> -	 * another pass this time using blocking reclaim semantics (i.e
> -	 * waiting on the reclaim locks and ignoring the reclaim cursors). This
> -	 * ensure that when we get more reclaimers than AGs we block rather
> -	 * than spin trying to execute reclaim.
> -	 */
> -	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
> -		trylock = 0;
> -		goto restart;
> -	}
> -	return last_error;
> +	return skipped;
>  }
>  
>  int
> @@ -1398,8 +1329,18 @@ xfs_reclaim_inodes(
>  	int		mode)
>  {
>  	int		nr_to_scan = INT_MAX;
> +	int		skipped;
>  
> -	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
> +	skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
> +	if (!(mode & SYNC_WAIT))
> +		return 0;
> +
> +	do {
> +		xfs_ail_push_all_sync(mp->m_ail);
> +		skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
> +	} while (skipped > 0);
> +
> +	return 0;

Might as well kill the return value here since none of the callers care.

>  }
>  
>  /*
> @@ -1420,7 +1361,8 @@ xfs_reclaim_inodes_nr(
>  	xfs_reclaim_work_queue(mp);
>  	xfs_ail_push_all(mp->m_ail);
>  
> -	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);

So the old code was returning negative error codes here?  Given that the
only caller is free_cached_objects which adds it to the 'freed' count...
wow.

> +	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
> +	return 0;

Why do we return zero freed items here?  The VFS asked us to clear
shrink_control->nr_to_scan (passed in here as nr_to_scan) and we're
supposed to report what we did, right?

Or is there some odd subtlety here where we hate the shrinker and that's
why we return zero?

--D

>  }
>  
>  /*
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 15/24] xfs: allow multiple reclaimers per AG
  2020-05-22  3:50 ` [PATCH 15/24] xfs: allow multiple reclaimers per AG Dave Chinner
@ 2020-05-22 23:10   ` Darrick J. Wong
  2020-05-23 22:35     ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 23:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:20PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Inode reclaim will still throttle direct reclaim on the per-ag
> reclaim locks. This is no longer necessary as reclaim can run
> non-blocking now. Hence we can remove these locks so that we don't
> arbitrarily block reclaimers just because there are more direct
> reclaimers than there are AGs.
> 
> This can result in multiple reclaimers working on the same range of
> an AG, but this doesn't cause any apparent issues. Optimising the
> spread of concurrent reclaimers for best efficiency can be done in a
> future patchset.

"Future patchset" as in "not in the 9 patches that I have left to read"?

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 28 +++++++---------------------
>  fs/xfs/xfs_mount.c  |  4 ----
>  fs/xfs/xfs_mount.h  |  1 -
>  3 files changed, 7 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index ee9bc82a0dfbe..f44493b2eae77 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1226,12 +1226,9 @@ xfs_reclaim_inodes_ag(
>  	int			*nr_to_scan)
>  {
>  	struct xfs_perag	*pag;
> -	xfs_agnumber_t		ag;
> -	int			trylock = flags & SYNC_TRYLOCK;
> -	int			skipped;
> +	xfs_agnumber_t		ag = 0;
> +	int			skipped = 0;
>  
> -	ag = 0;
> -	skipped = 0;
>  	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
>  		unsigned long	first_index = 0;
>  		int		done = 0;
> @@ -1239,16 +1236,7 @@ xfs_reclaim_inodes_ag(
>  
>  		ag = pag->pag_agno + 1;
>  
> -		if (trylock) {
> -			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
> -				skipped++;
> -				xfs_perag_put(pag);
> -				continue;
> -			}
> -			first_index = pag->pag_ici_reclaim_cursor;
> -		} else
> -			mutex_lock(&pag->pag_ici_reclaim_lock);
> -
> +		first_index = READ_ONCE(pag->pag_ici_reclaim_cursor);
>  		do {
>  			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
>  			int	i;
> @@ -1313,11 +1301,9 @@ xfs_reclaim_inodes_ag(
>  
>  		} while (nr_found && !done && *nr_to_scan > 0);
>  
> -		if (trylock && !done)
> -			pag->pag_ici_reclaim_cursor = first_index;
> -		else
> -			pag->pag_ici_reclaim_cursor = 0;
> -		mutex_unlock(&pag->pag_ici_reclaim_lock);
> +		if (done)
> +			first_index = 0;
> +		WRITE_ONCE(pag->pag_ici_reclaim_cursor, first_index);
>  		xfs_perag_put(pag);
>  	}
>  	return skipped;
> @@ -1331,7 +1317,7 @@ xfs_reclaim_inodes(
>  	int		nr_to_scan = INT_MAX;
>  	int		skipped;
>  
> -	skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
> +	xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);

This could go in the previous patch, right?

The rest of this patch looks reasonable, so with that fixed,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

>  	if (!(mode & SYNC_WAIT))
>  		return 0;
>  
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index d5dcf98698600..03158b42a1943 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -148,7 +148,6 @@ xfs_free_perag(
>  		ASSERT(atomic_read(&pag->pag_ref) == 0);
>  		xfs_iunlink_destroy(pag);
>  		xfs_buf_hash_destroy(pag);
> -		mutex_destroy(&pag->pag_ici_reclaim_lock);
>  		call_rcu(&pag->rcu_head, __xfs_free_perag);
>  	}
>  }
> @@ -200,7 +199,6 @@ xfs_initialize_perag(
>  		pag->pag_agno = index;
>  		pag->pag_mount = mp;
>  		spin_lock_init(&pag->pag_ici_lock);
> -		mutex_init(&pag->pag_ici_reclaim_lock);
>  		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
>  		if (xfs_buf_hash_init(pag))
>  			goto out_free_pag;
> @@ -242,7 +240,6 @@ xfs_initialize_perag(
>  out_hash_destroy:
>  	xfs_buf_hash_destroy(pag);
>  out_free_pag:
> -	mutex_destroy(&pag->pag_ici_reclaim_lock);
>  	kmem_free(pag);
>  out_unwind_new_pags:
>  	/* unwind any prior newly initialized pags */
> @@ -252,7 +249,6 @@ xfs_initialize_perag(
>  			break;
>  		xfs_buf_hash_destroy(pag);
>  		xfs_iunlink_destroy(pag);
> -		mutex_destroy(&pag->pag_ici_reclaim_lock);
>  		kmem_free(pag);
>  	}
>  	return error;
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 3725d25ad97e8..a72cfcaa4ad12 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -354,7 +354,6 @@ typedef struct xfs_perag {
>  	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
>  	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
>  	int		pag_ici_reclaimable;	/* reclaimable inodes */
> -	struct mutex	pag_ici_reclaim_lock;	/* serialisation point */
>  	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
>  
>  	/* buffer cache index */
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 16/24] xfs: don't block inode reclaim on the ILOCK
  2020-05-22  3:50 ` [PATCH 16/24] xfs: don't block inode reclaim on the ILOCK Dave Chinner
@ 2020-05-22 23:11   ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 23:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:21PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we attempt to reclaim an inode, the first thing we do it take

"...we do is take the inode lock."

With that fixed,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> the inode lock. This is blocking right now, so if the inode being
> accessed by something else (e.g. being flushed to the cluster
> buffer) we will block here.
> 
> Change this to a trylock so that we do not block inode reclaim
> unnecessarily here.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index f44493b2eae77..c020d2379e12e 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1138,9 +1138,10 @@ xfs_reclaim_inode(
>  {
>  	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
>  
> -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	if (!xfs_iflock_nowait(ip))
> +	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
>  		goto out;
> +	if (!xfs_iflock_nowait(ip))
> +		goto out_iunlock;
>  
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
>  		xfs_iunpin_wait(ip);
> @@ -1157,8 +1158,9 @@ xfs_reclaim_inode(
>  
>  out_ifunlock:
>  	xfs_ifunlock(ip);
> -out:
> +out_iunlock:
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +out:
>  	xfs_iflags_clear(ip, XFS_IRECLAIM);
>  	return false;
>  
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 17/24] xfs: remove SYNC_TRYLOCK from inode reclaim
  2020-05-22  3:50 ` [PATCH 17/24] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
@ 2020-05-22 23:14   ` Darrick J. Wong
  2020-05-23 22:42     ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 23:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:22PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> All background reclaim is SYNC_TRYLOCK already, and even blocking
> reclaim (SYNC_WAIT) can use trylock mechanisms as
> xfs_reclaim_inodes_ag() will keep cycling until there are no more
> reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode
> reclaim and make everything unconditionally non-blocking.

Random question: Does xfs_quiesce_attr need to call xfs_reclaim_inodes
twice now, or does the second SYNC_WAIT call suffice now?

> Signed-off-by: Dave Chinner <dchinner@redhat.com>

This patch itself looks fine though.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_icache.c | 27 +++++++++++----------------
>  1 file changed, 11 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index c020d2379e12e..8b366bc7b53c9 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1049,10 +1049,9 @@ xfs_inode_ag_iterator_tag(
>   * Grab the inode for reclaim exclusively.
>   * Return 0 if we grabbed it, non-zero otherwise.
>   */
> -STATIC int
> +static int
>  xfs_reclaim_inode_grab(
> -	struct xfs_inode	*ip,
> -	int			flags)
> +	struct xfs_inode	*ip)
>  {
>  	ASSERT(rcu_read_lock_held());
>  
> @@ -1061,12 +1060,10 @@ xfs_reclaim_inode_grab(
>  		return 1;
>  
>  	/*
> -	 * If we are asked for non-blocking operation, do unlocked checks to
> -	 * see if the inode already is being flushed or in reclaim to avoid
> -	 * lock traffic.
> +	 * Do unlocked checks to see if the inode already is being flushed or in
> +	 * reclaim to avoid lock traffic.
>  	 */
> -	if ((flags & SYNC_TRYLOCK) &&
> -	    __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
> +	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
>  		return 1;
>  
>  	/*
> @@ -1133,8 +1130,7 @@ xfs_reclaim_inode_grab(
>  static bool
>  xfs_reclaim_inode(
>  	struct xfs_inode	*ip,
> -	struct xfs_perag	*pag,
> -	int			sync_mode)
> +	struct xfs_perag	*pag)
>  {
>  	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
>  
> @@ -1224,7 +1220,6 @@ xfs_reclaim_inode(
>  static int
>  xfs_reclaim_inodes_ag(
>  	struct xfs_mount	*mp,
> -	int			flags,
>  	int			*nr_to_scan)
>  {
>  	struct xfs_perag	*pag;
> @@ -1262,7 +1257,7 @@ xfs_reclaim_inodes_ag(
>  			for (i = 0; i < nr_found; i++) {
>  				struct xfs_inode *ip = batch[i];
>  
> -				if (done || xfs_reclaim_inode_grab(ip, flags))
> +				if (done || xfs_reclaim_inode_grab(ip))
>  					batch[i] = NULL;
>  
>  				/*
> @@ -1293,7 +1288,7 @@ xfs_reclaim_inodes_ag(
>  			for (i = 0; i < nr_found; i++) {
>  				if (!batch[i])
>  					continue;
> -				if (!xfs_reclaim_inode(batch[i], pag, flags))
> +				if (!xfs_reclaim_inode(batch[i], pag))
>  					skipped++;
>  			}
>  
> @@ -1319,13 +1314,13 @@ xfs_reclaim_inodes(
>  	int		nr_to_scan = INT_MAX;
>  	int		skipped;
>  
> -	xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
> +	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
>  	if (!(mode & SYNC_WAIT))
>  		return 0;
>  
>  	do {
>  		xfs_ail_push_all_sync(mp->m_ail);
> -		skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
> +		skipped = xfs_reclaim_inodes_ag(mp, &nr_to_scan);
>  	} while (skipped > 0);
>  
>  	return 0;
> @@ -1349,7 +1344,7 @@ xfs_reclaim_inodes_nr(
>  	xfs_reclaim_work_queue(mp);
>  	xfs_ail_push_all(mp->m_ail);
>  
> -	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
> +	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
>  	return 0;
>  }
>  
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 18/24] xfs: clean up inode reclaim comments
  2020-05-22  3:50 ` [PATCH 18/24] xfs: clean up inode reclaim comments Dave Chinner
@ 2020-05-22 23:17   ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 23:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:23PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Inode reclaim is quite different now to the way described in various
> comments, so update all the comments explaining what it does and how
> it works.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 128 ++++++++++++--------------------------------
>  1 file changed, 35 insertions(+), 93 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 8b366bc7b53c9..81c23384d3310 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -141,11 +141,8 @@ xfs_inode_free(
>  }
>  
>  /*
> - * Queue a new inode reclaim pass if there are reclaimable inodes and there
> - * isn't a reclaim pass already in progress. By default it runs every 5s based
> - * on the xfs periodic sync default of 30s. Perhaps this should have it's own
> - * tunable, but that can be done if this method proves to be ineffective or too
> - * aggressive.
> + * Queue background inode reclaim work if there are reclaimable inodes and there
> + * isn't reclaim work already scheduled or in progress.
>   */
>  static void
>  xfs_reclaim_work_queue(
> @@ -164,8 +161,7 @@ xfs_reclaim_work_queue(
>   * This is a fast pass over the inode cache to try to get reclaim moving on as
>   * many inodes as possible in a short period of time. It kicks itself every few
>   * seconds, as well as being kicked by the inode cache shrinker when memory
> - * goes low. It scans as quickly as possible avoiding locked inodes or those
> - * already being flushed, and once done schedules a future pass.
> + * goes low.
>   */
>  void
>  xfs_reclaim_worker(
> @@ -618,48 +614,31 @@ xfs_iget_cache_miss(
>  }
>  
>  /*
> - * Look up an inode by number in the given file system.
> - * The inode is looked up in the cache held in each AG.
> - * If the inode is found in the cache, initialise the vfs inode
> - * if necessary.
> + * Look up an inode by number in the given file system.  The inode is looked up
> + * in the cache held in each AG.  If the inode is found in the cache, initialise
> + * the vfs inode if necessary.
>   *
> - * If it is not in core, read it in from the file system's device,
> - * add it to the cache and initialise the vfs inode.
> + * If it is not in core, read it in from the file system's device, add it to the
> + * cache and initialise the vfs inode.
>   *
>   * The inode is locked according to the value of the lock_flags parameter.
> - * This flag parameter indicates how and if the inode's IO lock and inode lock
> - * should be taken.
> - *
> - * mp -- the mount point structure for the current file system.  It points
> - *       to the inode hash table.
> - * tp -- a pointer to the current transaction if there is one.  This is
> - *       simply passed through to the xfs_iread() call.
> - * ino -- the number of the inode desired.  This is the unique identifier
> - *        within the file system for the inode being requested.
> - * lock_flags -- flags indicating how to lock the inode.  See the comment
> - *		 for xfs_ilock() for a list of valid values.
> + * Inode lookup is only done during metadata operations and not as part of the
> + * data IO path. Hence we only allow locking of the XFS_ILOCK during lookup.
>   */
>  int
>  xfs_iget(
> -	xfs_mount_t	*mp,
> -	xfs_trans_t	*tp,
> -	xfs_ino_t	ino,
> -	uint		flags,
> -	uint		lock_flags,
> -	xfs_inode_t	**ipp)
> +	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
> +	xfs_ino_t		ino,
> +	uint			flags,
> +	uint			lock_flags,
> +	struct xfs_inode	**ipp)
>  {
> -	xfs_inode_t	*ip;
> -	int		error;
> -	xfs_perag_t	*pag;
> -	xfs_agino_t	agino;
> +	struct xfs_inode	*ip;
> +	struct xfs_perag	*pag;
> +	xfs_agino_t		agino;
> +	int			error;
>  
> -	/*
> -	 * xfs_reclaim_inode() uses the ILOCK to ensure an inode
> -	 * doesn't get freed while it's being referenced during a
> -	 * radix tree traversal here.  It assumes this function
> -	 * aqcuires only the ILOCK (and therefore it has no need to
> -	 * involve the IOLOCK in this synchronization).
> -	 */
>  	ASSERT((lock_flags & (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED)) == 0);
>  
>  	/* reject inode numbers outside existing AGs */
> @@ -771,15 +750,7 @@ xfs_inode_ag_walk_grab(
>  
>  	ASSERT(rcu_read_lock_held());
>  
> -	/*
> -	 * check for stale RCU freed inode
> -	 *
> -	 * If the inode has been reallocated, it doesn't matter if it's not in
> -	 * the AG we are walking - we are walking for writeback, so if it
> -	 * passes all the "valid inode" checks and is dirty, then we'll write
> -	 * it back anyway.  If it has been reallocated and still being
> -	 * initialised, the XFS_INEW check below will catch it.
> -	 */
> +	/* Check for stale RCU freed inode */
>  	spin_lock(&ip->i_flags_lock);
>  	if (!ip->i_ino)
>  		goto out_unlock_noent;
> @@ -1089,43 +1060,16 @@ xfs_reclaim_inode_grab(
>  }
>  
>  /*
> - * Inodes in different states need to be treated differently. The following
> - * table lists the inode states and the reclaim actions necessary:
> - *
> - *	inode state	     iflush ret		required action
> - *      ---------------      ----------         ---------------
> - *	bad			-		reclaim
> - *	shutdown		EIO		unpin and reclaim
> - *	clean, unpinned		0		reclaim
> - *	stale, unpinned		0		reclaim
> - *	clean, pinned(*)	0		requeue
> - *	stale, pinned		EAGAIN		requeue
> - *	dirty, async		-		requeue
> - *	dirty, sync		0		reclaim
> + * Inode reclaim is non-blocking, so the default action if progress cannot be
> + * made is to "requeue" the inode for reclaim by unlocking it and clearing the
> + * XFS_IRECLAIM flag.  If we are in a shutdown state, we don't care about
> + * blocking anymore and hence we can wait for the inode to be able to reclaim
> + * it.
>   *
> - * (*) dgc: I don't think the clean, pinned state is possible but it gets
> - * handled anyway given the order of checks implemented.
> - *
> - * Also, because we get the flush lock first, we know that any inode that has
> - * been flushed delwri has had the flush completed by the time we check that
> - * the inode is clean.
> - *
> - * Note that because the inode is flushed delayed write by AIL pushing, the
> - * flush lock may already be held here and waiting on it can result in very
> - * long latencies.  Hence for sync reclaims, where we wait on the flush lock,
> - * the caller should push the AIL first before trying to reclaim inodes to
> - * minimise the amount of time spent waiting.  For background relaim, we only
> - * bother to reclaim clean inodes anyway.
> - *
> - * Hence the order of actions after gaining the locks should be:
> - *	bad		=> reclaim
> - *	shutdown	=> unpin and reclaim
> - *	pinned, async	=> requeue
> - *	pinned, sync	=> unpin
> - *	stale		=> reclaim
> - *	clean		=> reclaim
> - *	dirty, async	=> requeue
> - *	dirty, sync	=> flush, wait and reclaim
> + * We do no IO here - if callers require indoes to be cleaned they must push the

"...require inodes to be cleaned..."

> + * AIL first to trigger writeback of dirty inodes. The enbales both the

"This enables writeback to be done in the background in a non-blocking
manner, and enables memory reclaim to make progress without blocking."

--D

> + * writeback to be done in the background in a non-blocking manner, as well as
> + * allowing reclaim to make progress without blocking.
>   */
>  static bool
>  xfs_reclaim_inode(
> @@ -1327,13 +1271,11 @@ xfs_reclaim_inodes(
>  }
>  
>  /*
> - * Scan a certain number of inodes for reclaim.
> - *
> - * When called we make sure that there is a background (fast) inode reclaim in
> - * progress, while we will throttle the speed of reclaim via doing synchronous
> - * reclaim of inodes. That means if we come across dirty inodes, we wait for
> - * them to be cleaned, which we hope will not be very long due to the
> - * background walker having already kicked the IO off on those dirty inodes.
> + * The shrinker infrastructure determines how many inodes we should scan for
> + * reclaim. We want as many clean inodes ready to reclaim as possible, so we
> + * push the AIL here. We also want to proactively free up memory if we can to
> + * minimise the amount of work memory reclaim has to do so we kick the
> + * background reclaim if it isn't already scheduled.
>   */
>  long
>  xfs_reclaim_inodes_nr(
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/24] xfs: attach inodes to the cluster buffer when dirtied
  2020-05-22  3:50 ` [PATCH 19/24] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
@ 2020-05-22 23:48   ` Darrick J. Wong
  2020-05-23 22:59     ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 23:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:24PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Rather than attach inodes to the cluster buffer just when we are
> doing IO, attach the inodes to the cluster buffer when they are
> dirtied. The means the buffer always carries a list of dirty inodes
> that reference it, and we can use that list to make more fundamental
> changes to inode writeback that aren't otherwise possible.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c |   9 +-
>  fs/xfs/xfs_inode.c              | 154 +++++++++++---------------------
>  fs/xfs/xfs_inode_item.c         |  19 ++--
>  3 files changed, 68 insertions(+), 114 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index e130eb2994156..e5f58a4b13eee 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -162,13 +162,16 @@ xfs_trans_log_inode(
>  		/*
>  		 * We need an explicit buffer reference for the log item, We
>  		 * don't want the buffer attached to the transaction, so we have
> -		 * to release the transaction reference we just gained.
> +		 * to release the transaction reference we just gained. But we
> +		 * don't want to drop the buffer lock until we've attached the
> +		 * inode item to the buffer log item list....
>  		 */
>  		xfs_buf_hold(bp);
> -		xfs_trans_brelse(tp, bp);
> -
>  		spin_lock(&iip->ili_lock);
>  		iip->ili_item.li_buf = bp;
> +		bp->b_flags |= _XBF_INODES;
> +		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> +		xfs_trans_brelse(tp, bp);
>  	}
>  
>  	/*
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index c5529853f513c..961eef2ce8c4a 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2498,17 +2498,18 @@ xfs_iunlink_remove(
>  }
>  
>  /*
> - * Look up the inode number specified and mark it stale if it is found. If it is
> - * dirty, return the inode so it can be attached to the cluster buffer so it can
> - * be processed appropriately when the cluster free transaction completes.
> + * Look up the inode number specified and if it is not already marked XFS_ISTALE
> + * mark it stale. We should only find clean inodes in this lookup that aren't
> + * already stale.
>   */
> -static struct xfs_inode *
> -xfs_ifree_get_one_inode(
> +static void
> +xfs_ifree_mark_inode_stale(
>  	struct xfs_perag	*pag,
>  	struct xfs_inode	*free_ip,
>  	xfs_ino_t		inum)
>  {
>  	struct xfs_mount	*mp = pag->pag_mount;
> +	struct xfs_inode_log_item *iip;
>  	struct xfs_inode	*ip;
>  
>  retry:
> @@ -2516,8 +2517,10 @@ xfs_ifree_get_one_inode(
>  	ip = radix_tree_lookup(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, inum));
>  
>  	/* Inode not in memory, nothing to do */
> -	if (!ip)
> -		goto out_rcu_unlock;
> +	if (!ip) {
> +		rcu_read_unlock();
> +		return;
> +	}
>  
>  	/*
>  	 * because this is an RCU protected lookup, we could find a recently
> @@ -2528,9 +2531,9 @@ xfs_ifree_get_one_inode(
>  	spin_lock(&ip->i_flags_lock);
>  	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
>  		spin_unlock(&ip->i_flags_lock);
> -		goto out_rcu_unlock;
> +		rcu_read_unlock();
> +		return;
>  	}
> -	spin_unlock(&ip->i_flags_lock);
>  
>  	/*
>  	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
> @@ -2540,43 +2543,45 @@ xfs_ifree_get_one_inode(
>  	 */
>  	if (ip != free_ip) {
>  		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> +			spin_unlock(&ip->i_flags_lock);
>  			rcu_read_unlock();
>  			delay(1);
>  			goto retry;
>  		}
> -
> -		/*
> -		 * Check the inode number again in case we're racing with
> -		 * freeing in xfs_reclaim_inode().  See the comments in that
> -		 * function for more information as to why the initial check is
> -		 * not sufficient.
> -		 */
> -		if (ip->i_ino != inum) {
> -			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -			goto out_rcu_unlock;
> -		}
>  	}
> +	ip->i_flags |= XFS_ISTALE;
> +	spin_unlock(&ip->i_flags_lock);
>  	rcu_read_unlock();
>  
> -	xfs_iflock(ip);
> -	xfs_iflags_set(ip, XFS_ISTALE);
> +	/* If we can't get the flush lock, the inode is already attached */
> +	if (!xfs_iflock_nowait(ip))
> +		goto out_iunlock;
>  
>  	/*
> -	 * We don't need to attach clean inodes or those only with unlogged
> -	 * changes (which we throw away, anyway).
> +	 * Inodes not attached to the buffer can be released immediately.
> +	 * Everything else has to go through xfs_iflush_abort() on journal
> +	 * commit as the flock synchronises removal of the inode from the
> +	 * cluster buffer against inode reclaim.
>  	 */
> -	if (!ip->i_itemp || xfs_inode_clean(ip)) {
> -		ASSERT(ip != free_ip);
> +	if (!ip->i_itemp || list_empty(&ip->i_itemp->ili_item.li_bio_list)) {
>  		xfs_ifunlock(ip);
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -		goto out_no_inode;
> +		goto out_iunlock;
>  	}
> -	return ip;
>  
> -out_rcu_unlock:
> -	rcu_read_unlock();
> -out_no_inode:
> -	return NULL;
> +	/* we have a dirty inode in memory that has not yet been flushed. */
> +	iip = ip->i_itemp;
> +	ASSERT(!list_empty(&iip->ili_item.li_bio_list));
> +
> +	spin_lock(&iip->ili_lock);
> +	iip->ili_last_fields = iip->ili_fields;
> +	iip->ili_fields = 0;
> +	iip->ili_fsync_fields = 0;
> +	spin_unlock(&iip->ili_lock);
> +	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
> +				&iip->ili_item.li_lsn);
> +out_iunlock:
> +	if (ip != free_ip)
> +		xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  }
>  
>  /*
> @@ -2586,25 +2591,21 @@ xfs_ifree_get_one_inode(
>   */
>  STATIC int
>  xfs_ifree_cluster(
> -	xfs_inode_t		*free_ip,
> -	xfs_trans_t		*tp,
> +	struct xfs_inode	*free_ip,
> +	struct xfs_trans	*tp,
>  	struct xfs_icluster	*xic)
>  {
> -	xfs_mount_t		*mp = free_ip->i_mount;
> +	struct xfs_mount	*mp = free_ip->i_mount;
> +	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
> +	struct xfs_perag	*pag;
> +	struct xfs_buf		*bp;
> +	xfs_daddr_t		blkno;
> +	xfs_ino_t		inum = xic->first_ino;
>  	int			nbufs;
>  	int			i, j;
>  	int			ioffset;
> -	xfs_daddr_t		blkno;
> -	xfs_buf_t		*bp;
> -	xfs_inode_t		*ip;
> -	struct xfs_inode_log_item *iip;
> -	struct xfs_log_item	*lip;
> -	struct xfs_perag	*pag;
> -	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
> -	xfs_ino_t		inum;
>  	int			error;
>  
> -	inum = xic->first_ino;
>  	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inum));
>  	nbufs = igeo->ialloc_blks / igeo->blocks_per_cluster;
>  
> @@ -2649,53 +2650,12 @@ xfs_ifree_cluster(
>  		bp->b_ops = &xfs_inode_buf_ops;
>  
>  		/*
> -		 * Walk the inodes already attached to the buffer and mark them
> -		 * stale. These will all have the flush locks held, so an
> -		 * in-memory inode walk can't lock them. By marking them all
> -		 * stale first, we will not attempt to lock them in the loop
> -		 * below as the XFS_ISTALE flag will be set.
> -		 */
> -		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> -			if (lip->li_type == XFS_LI_INODE) {
> -				iip = (struct xfs_inode_log_item *)lip;
> -				xfs_trans_ail_copy_lsn(mp->m_ail,
> -							&iip->ili_flush_lsn,
> -							&iip->ili_item.li_lsn);
> -				xfs_iflags_set(iip->ili_inode, XFS_ISTALE);
> -			}
> -		}

Hm.  I think I'm a little confused here.  I think the consequence of
attaching inode items to the buffer whenever we dirty the inode is that
we no longer need to travel the inode list to set ISTALE because we know
that the lookup loop below is sufficient to catch all of the inodes that
are still hanging around in memory?

We don't call xfs_ifree_cluster if any of the inodes in it are
allocated, which means that all the on-disk inodes are either (a) not
represented in memory, in which case we don't have to stale them; or (b)
they're in memory and dirty (because they've recently been freed).  But
if that's true, then surely you could find them all via b_li_list?

On the other hand, it seems redundant to iterate the list /and/ do the
lookups and we no longer need to "set it up for being staled", so the
second loop survives?

OTOH this is patch 19 of 24, maybe my brain already went to beer.

> -
> -
> -		/*
> -		 * For each inode in memory attempt to add it to the inode
> -		 * buffer and set it up for being staled on buffer IO
> -		 * completion.  This is safe as we've locked out tail pushing
> -		 * and flushing by locking the buffer.
> -		 *
> -		 * We have already marked every inode that was part of a
> -		 * transaction stale above, which means there is no point in
> -		 * even trying to lock them.
> +		 * Now we need to set all the cached clean inodes as XFS_ISTALE,
> +		 * too. This requires lookups, and will skip inodes that we've
> +		 * already marked XFS_ISTALE.
>  		 */
> -		for (i = 0; i < igeo->inodes_per_cluster; i++) {
> -			ip = xfs_ifree_get_one_inode(pag, free_ip, inum + i);
> -			if (!ip)
> -				continue;
> -
> -			iip = ip->i_itemp;
> -			spin_lock(&iip->ili_lock);
> -			iip->ili_last_fields = iip->ili_fields;
> -			iip->ili_fields = 0;
> -			iip->ili_fsync_fields = 0;
> -			spin_unlock(&iip->ili_lock);
> -			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
> -						&iip->ili_item.li_lsn);
> -
> -			list_add_tail(&iip->ili_item.li_bio_list,
> -						&bp->b_li_list);
> -
> -			if (ip != free_ip)
> -				xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -		}
> +		for (i = 0; i < igeo->inodes_per_cluster; i++)
> +			xfs_ifree_mark_inode_stale(pag, free_ip, inum + i);

I get that we're mostly just hoisting everything in the loop body to the
end of xfs_ifree_get_one_inode, but can that be part of a separate hoist
patch?

--D

>  
>  		xfs_trans_stale_inode_buf(tp, bp);
>  		xfs_trans_binval(tp, bp);
> @@ -3831,22 +3791,12 @@ xfs_iflush_int(
>  	 * Store the current LSN of the inode so that we can tell whether the
>  	 * item has moved in the AIL from xfs_iflush_done().
>  	 */
> +	ASSERT(iip->ili_item.li_lsn);
>  	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  				&iip->ili_item.li_lsn);
>  
> -	/*
> -	 * Attach the inode item callback to the buffer whether the flush
> -	 * succeeded or not. If not, the caller will shut down and fail I/O
> -	 * completion on the buffer to remove the inode from the AIL and release
> -	 * the flush lock.
> -	 */
> -	bp->b_flags |= _XBF_INODES;
> -	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> -
>  	/* generate the checksum. */
>  	xfs_dinode_calc_crc(mp, dip);
> -
> -	ASSERT(!list_empty(&bp->b_li_list));
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 86173a52526fe..7718cfc39eec8 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -683,10 +683,7 @@ xfs_inode_item_destroy(
>   *
>   * Note: Now that we attach the log item to the buffer when we first log the
>   * inode in memory, we can have unflushed inodes on the buffer list here. These
> - * inodes will have a zero ili_last_fields, so skip over them here. We do
> - * this check -after- we've checked for stale inodes, because we're guaranteed
> - * to have XFS_ISTALE set in the case that dirty inodes are in the CIL and have
> - * not yet had their dirtying transactions committed to disk.
> + * inodes will have a zero ili_last_fields, so skip over them here.
>   */
>  void
>  xfs_iflush_done(
> @@ -704,15 +701,14 @@ xfs_iflush_done(
>  	 */
>  	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
>  		iip = INODE_ITEM(lip);
> +		if (!iip->ili_last_fields)
> +			continue;
> +
>  		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
> -			list_del_init(&lip->li_bio_list);
>  			xfs_iflush_abort(iip->ili_inode);
>  			continue;
>  		}
>  
> -		if (!iip->ili_last_fields)
> -			continue;
> -
>  		list_move_tail(&lip->li_bio_list, &tmp);
>  
>  		/* Do an unlocked check for needing the AIL lock. */
> @@ -762,12 +758,16 @@ xfs_iflush_done(
>  		/*
>  		 * Remove the reference to the cluster buffer if the inode is
>  		 * clean in memory. Drop the buffer reference once we've dropped
> -		 * the locks we hold.
> +		 * the locks we hold. If the inode is dirty in memory, we need
> +		 * to put the inode item back on the buffer list for another
> +		 * pass through the flush machinery.
>  		 */
>  		ASSERT(iip->ili_item.li_buf == bp);
>  		if (!iip->ili_fields) {
>  			iip->ili_item.li_buf = NULL;
>  			drop_buffer = true;
> +		} else {
> +			list_add(&lip->li_bio_list, &bp->b_li_list);
>  		}
>  		spin_unlock(&iip->ili_lock);
>  		xfs_ifunlock(iip->ili_inode);
> @@ -802,6 +802,7 @@ xfs_iflush_abort(
>  		iip->ili_flush_lsn = 0;
>  		bp = iip->ili_item.li_buf;
>  		iip->ili_item.li_buf = NULL;
> +		list_del_init(&iip->ili_item.li_bio_list);
>  		spin_unlock(&iip->ili_lock);
>  	}
>  	xfs_ifunlock(ip);
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 20/24] xfs: xfs_iflush() is no longer necessary
  2020-05-22  3:50 ` [PATCH 20/24] xfs: xfs_iflush() is no longer necessary Dave Chinner
@ 2020-05-22 23:54   ` Darrick J. Wong
  0 siblings, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 23:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:25PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now we have a cached buffer on inode log items, we don't need
> to do buffer lookups when flushing inodes anymore - all we need
> to do is lock the buffer and we are ready to go.
> 
> This largely gets rid of the need for xfs_iflush(), which is
> essentially just a mechanism to look up the buffer and flush the
> inode to it. Instead, we can just call xfs_iflush_cluster() with a
> few modifications to ensure it also flushes the inode we already
> hold locked.
> 
> This allows the AIL inode item pushing to be almost entirely
> non-blocking in XFS - we won't block unless memory allocation
> for the cluster inode lookup blocks or the block device queues are
> full.
> 
> Writeback during inode reclaim becomes a little more complex because
> we now have to lock the buffer ourselves, but otherwise this change
> is largely a functional no-op that removes a whole lot of code.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

/This/ looks pretty straightforward.  I always wondered about the fact
that the inode item push function would extract a buffer pointer but
then imap_to_bp would replace our pointer...

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode.c      | 106 ++++++----------------------------------
>  fs/xfs/xfs_inode.h      |   2 +-
>  fs/xfs/xfs_inode_item.c |  51 +++++++------------
>  3 files changed, 32 insertions(+), 127 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 961eef2ce8c4a..a94528d26328b 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3429,7 +3429,18 @@ xfs_rename(
>  	return error;
>  }
>  
> -STATIC int
> +/*
> + * Non-blocking flush of dirty inode metadata into the backing buffer.
> + *
> + * The caller must have a reference to the inode and hold the cluster buffer
> + * locked. The function will walk across all the inodes on the cluster buffer it
> + * can find and lock without blocking, and flush them to the cluster buffer.
> + *
> + * On success, the caller must write out the buffer returned in *bp and
> + * release it. On failure, the filesystem will be shut down, the buffer will
> + * have been unlocked and released, and EFSCORRUPTED will be returned.
> + */
> +int
>  xfs_iflush_cluster(
>  	struct xfs_inode	*ip,
>  	struct xfs_buf		*bp)
> @@ -3464,8 +3475,6 @@ xfs_iflush_cluster(
>  
>  	for (i = 0; i < nr_found; i++) {
>  		cip = cilist[i];
> -		if (cip == ip)
> -			continue;
>  
>  		/*
>  		 * because this is an RCU protected lookup, we could find a
> @@ -3556,99 +3565,11 @@ xfs_iflush_cluster(
>  	kmem_free(cilist);
>  out_put:
>  	xfs_perag_put(pag);
> -	return error;
> -}
> -
> -/*
> - * Flush dirty inode metadata into the backing buffer.
> - *
> - * The caller must have the inode lock and the inode flush lock held.  The
> - * inode lock will still be held upon return to the caller, and the inode
> - * flush lock will be released after the inode has reached the disk.
> - *
> - * The caller must write out the buffer returned in *bpp and release it.
> - */
> -int
> -xfs_iflush(
> -	struct xfs_inode	*ip,
> -	struct xfs_buf		**bpp)
> -{
> -	struct xfs_mount	*mp = ip->i_mount;
> -	struct xfs_buf		*bp = NULL;
> -	struct xfs_dinode	*dip;
> -	int			error;
> -
> -	XFS_STATS_INC(mp, xs_iflush_count);
> -
> -	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
> -	ASSERT(xfs_isiflocked(ip));
> -	ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_BTREE ||
> -	       ip->i_df.if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
> -
> -	*bpp = NULL;
> -
> -	xfs_iunpin_wait(ip);
> -
> -	/*
> -	 * For stale inodes we cannot rely on the backing buffer remaining
> -	 * stale in cache for the remaining life of the stale inode and so
> -	 * xfs_imap_to_bp() below may give us a buffer that no longer contains
> -	 * inodes below. We have to check this after ensuring the inode is
> -	 * unpinned so that it is safe to reclaim the stale inode after the
> -	 * flush call.
> -	 */
> -	if (xfs_iflags_test(ip, XFS_ISTALE)) {
> -		xfs_ifunlock(ip);
> -		return 0;
> -	}
> -
> -	/*
> -	 * Get the buffer containing the on-disk inode. We are doing a try-lock
> -	 * operation here, so we may get an EAGAIN error. In that case, return
> -	 * leaving the inode dirty.
> -	 *
> -	 * If we get any other error, we effectively have a corruption situation
> -	 * and we cannot flush the inode. Abort the flush and shut down.
> -	 */
> -	error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &bp, XBF_TRYLOCK);
> -	if (error == -EAGAIN) {
> -		xfs_ifunlock(ip);
> -		return error;
> -	}
> -	if (error)
> -		goto abort;
> -
> -	/*
> -	 * If the buffer is pinned then push on the log now so we won't
> -	 * get stuck waiting in the write for too long.
> -	 */
> -	if (xfs_buf_ispinned(bp))
> -		xfs_log_force(mp, 0);
> -
> -	/*
> -	 * Flush the provided inode then attempt to gather others from the
> -	 * cluster into the write.
> -	 *
> -	 * Note: Once we attempt to flush an inode, we must run buffer
> -	 * completion callbacks on any failure. If this fails, simulate an I/O
> -	 * failure on the buffer and shut down.
> -	 */
> -	error = xfs_iflush_int(ip, bp);
> -	if (!error)
> -		error = xfs_iflush_cluster(ip, bp);
>  	if (error) {
>  		bp->b_flags |= XBF_ASYNC;
>  		xfs_buf_ioend_fail(bp);
> -		goto shutdown;
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
>  	}
> -
> -	*bpp = bp;
> -	return 0;
> -
> -abort:
> -	xfs_iflush_abort(ip);
> -shutdown:
> -	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
>  	return error;
>  }
>  
> @@ -3667,6 +3588,7 @@ xfs_iflush_int(
>  	ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_BTREE ||
>  	       ip->i_df.if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
>  	ASSERT(iip != NULL && iip->ili_fields != 0);
> +	ASSERT(iip->ili_item.li_buf == bp);
>  
>  	dip = xfs_buf_offset(bp, ip->i_imap.im_boffset);
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index dadcf19458960..d1109eb13ba2e 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -427,7 +427,7 @@ int		xfs_log_force_inode(struct xfs_inode *ip);
>  void		xfs_iunpin_wait(xfs_inode_t *);
>  #define xfs_ipincount(ip)	((unsigned int) atomic_read(&ip->i_pincount))
>  
> -int		xfs_iflush(struct xfs_inode *, struct xfs_buf **);
> +int		xfs_iflush_cluster(struct xfs_inode *, struct xfs_buf *);
>  void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
>  				struct xfs_inode *ip1, uint ip1_mode);
>  
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 7718cfc39eec8..42b4b5fe1e2a9 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -504,53 +504,35 @@ xfs_inode_item_push(
>  	uint			rval = XFS_ITEM_SUCCESS;
>  	int			error;
>  
> -	if (xfs_ipincount(ip) > 0)
> -		return XFS_ITEM_PINNED;
> +	ASSERT(iip->ili_item.li_buf);
>  
> -	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
> -		return XFS_ITEM_LOCKED;
> +	if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp) ||
> +	    (ip->i_flags & XFS_ISTALE))
> +		return XFS_ITEM_PINNED;
>  
> -	/*
> -	 * Re-check the pincount now that we stabilized the value by
> -	 * taking the ilock.
> -	 */
> -	if (xfs_ipincount(ip) > 0) {
> -		rval = XFS_ITEM_PINNED;
> -		goto out_unlock;
> -	}
> +	/* If the inode is already flush locked, we're already flushing. */
> +	if (xfs_isiflocked(ip))
> +		return XFS_ITEM_FLUSHING;
>  
> -	/*
> -	 * Stale inode items should force out the iclog.
> -	 */
> -	if (ip->i_flags & XFS_ISTALE) {
> -		rval = XFS_ITEM_PINNED;
> -		goto out_unlock;
> -	}
> +	if (!xfs_buf_trylock(bp))
> +		return XFS_ITEM_LOCKED;
>  
> -	/*
> -	 * Someone else is already flushing the inode.  Nothing we can do
> -	 * here but wait for the flush to finish and remove the item from
> -	 * the AIL.
> -	 */
> -	if (!xfs_iflock_nowait(ip)) {
> -		rval = XFS_ITEM_FLUSHING;
> -		goto out_unlock;
> +	if (bp->b_flags & _XBF_DELWRI_Q) {
> +		xfs_buf_unlock(bp);
> +		return XFS_ITEM_FLUSHING;
>  	}
> -
> -	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
>  	spin_unlock(&lip->li_ailp->ail_lock);
>  
> -	error = xfs_iflush(ip, &bp);
> +	error = xfs_iflush_cluster(ip, bp);
>  	if (!error) {
>  		if (!xfs_buf_delwri_queue(bp, buffer_list))
>  			rval = XFS_ITEM_FLUSHING;
> -		xfs_buf_relse(bp);
> -	} else if (error == -EAGAIN)
> +		xfs_buf_unlock(bp);
> +	} else {
>  		rval = XFS_ITEM_LOCKED;
> +	}
>  
>  	spin_lock(&lip->li_ailp->ail_lock);
> -out_unlock:
> -	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  	return rval;
>  }
>  
> @@ -567,6 +549,7 @@ xfs_inode_item_release(
>  
>  	ASSERT(ip->i_itemp != NULL);
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> +	ASSERT(lip->li_buf || !test_bit(XFS_LI_DIRTY, &lip->li_flags));
>  
>  	lock_flags = iip->ili_lock_flags;
>  	iip->ili_lock_flags = 0;
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 21/24] xfs: rename xfs_iflush_int()
  2020-05-22  3:50 ` [PATCH 21/24] xfs: rename xfs_iflush_int() Dave Chinner
  2020-05-22 12:33   ` Amir Goldstein
@ 2020-05-22 23:57   ` Darrick J. Wong
  1 sibling, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-22 23:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:26PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> with xfs_iflush() gone, we can rename xfs_iflush_int() back to
> xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't
> need the forward definition any more.

So of course git moves xfs_iflush_cluster instead.  Why move 114 lines
when you could move 146? :P

> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Eh, whatever,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode.c | 293 ++++++++++++++++++++++-----------------------
>  1 file changed, 146 insertions(+), 147 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index a94528d26328b..cbf8edf62d102 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -44,7 +44,6 @@ kmem_zone_t *xfs_inode_zone;
>   */
>  #define	XFS_ITRUNC_MAX_EXTENTS	2
>  
> -STATIC int xfs_iflush_int(struct xfs_inode *, struct xfs_buf *);
>  STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *);
>  STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *);
>  
> @@ -3429,152 +3428,8 @@ xfs_rename(
>  	return error;
>  }
>  
> -/*
> - * Non-blocking flush of dirty inode metadata into the backing buffer.
> - *
> - * The caller must have a reference to the inode and hold the cluster buffer
> - * locked. The function will walk across all the inodes on the cluster buffer it
> - * can find and lock without blocking, and flush them to the cluster buffer.
> - *
> - * On success, the caller must write out the buffer returned in *bp and
> - * release it. On failure, the filesystem will be shut down, the buffer will
> - * have been unlocked and released, and EFSCORRUPTED will be returned.
> - */
> -int
> -xfs_iflush_cluster(
> -	struct xfs_inode	*ip,
> -	struct xfs_buf		*bp)
> -{
> -	struct xfs_mount	*mp = ip->i_mount;
> -	struct xfs_perag	*pag;
> -	unsigned long		first_index, mask;
> -	int			cilist_size;
> -	struct xfs_inode	**cilist;
> -	struct xfs_inode	*cip;
> -	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
> -	int			error = 0;
> -	int			nr_found;
> -	int			clcount = 0;
> -	int			i;
> -
> -	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> -
> -	cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *);
> -	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
> -	if (!cilist)
> -		goto out_put;
> -
> -	mask = ~(igeo->inodes_per_cluster - 1);
> -	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
> -	rcu_read_lock();
> -	/* really need a gang lookup range call here */
> -	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
> -					first_index, igeo->inodes_per_cluster);
> -	if (nr_found == 0)
> -		goto out_free;
> -
> -	for (i = 0; i < nr_found; i++) {
> -		cip = cilist[i];
> -
> -		/*
> -		 * because this is an RCU protected lookup, we could find a
> -		 * recently freed or even reallocated inode during the lookup.
> -		 * We need to check under the i_flags_lock for a valid inode
> -		 * here. Skip it if it is not valid or the wrong inode.
> -		 */
> -		spin_lock(&cip->i_flags_lock);
> -		if (!cip->i_ino ||
> -		    __xfs_iflags_test(cip, XFS_ISTALE)) {
> -			spin_unlock(&cip->i_flags_lock);
> -			continue;
> -		}
> -
> -		/*
> -		 * Once we fall off the end of the cluster, no point checking
> -		 * any more inodes in the list because they will also all be
> -		 * outside the cluster.
> -		 */
> -		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
> -			spin_unlock(&cip->i_flags_lock);
> -			break;
> -		}
> -		spin_unlock(&cip->i_flags_lock);
> -
> -		/*
> -		 * Do an un-protected check to see if the inode is dirty and
> -		 * is a candidate for flushing.  These checks will be repeated
> -		 * later after the appropriate locks are acquired.
> -		 */
> -		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
> -			continue;
> -
> -		/*
> -		 * Try to get locks.  If any are unavailable or it is pinned,
> -		 * then this inode cannot be flushed and is skipped.
> -		 */
> -
> -		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
> -			continue;
> -		if (!xfs_iflock_nowait(cip)) {
> -			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> -			continue;
> -		}
> -		if (xfs_ipincount(cip)) {
> -			xfs_ifunlock(cip);
> -			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> -			continue;
> -		}
> -
> -
> -		/*
> -		 * Check the inode number again, just to be certain we are not
> -		 * racing with freeing in xfs_reclaim_inode(). See the comments
> -		 * in that function for more information as to why the initial
> -		 * check is not sufficient.
> -		 */
> -		if (!cip->i_ino) {
> -			xfs_ifunlock(cip);
> -			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> -			continue;
> -		}
> -
> -		/*
> -		 * arriving here means that this inode can be flushed.  First
> -		 * re-check that it's dirty before flushing.
> -		 */
> -		if (!xfs_inode_clean(cip)) {
> -			error = xfs_iflush_int(cip, bp);
> -			if (error) {
> -				xfs_iunlock(cip, XFS_ILOCK_SHARED);
> -				goto out_free;
> -			}
> -			clcount++;
> -		} else {
> -			xfs_ifunlock(cip);
> -		}
> -		xfs_iunlock(cip, XFS_ILOCK_SHARED);
> -	}
> -
> -	if (clcount) {
> -		XFS_STATS_INC(mp, xs_icluster_flushcnt);
> -		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
> -	}
> -
> -out_free:
> -	rcu_read_unlock();
> -	kmem_free(cilist);
> -out_put:
> -	xfs_perag_put(pag);
> -	if (error) {
> -		bp->b_flags |= XBF_ASYNC;
> -		xfs_buf_ioend_fail(bp);
> -		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> -	}
> -	return error;
> -}
> -
> -STATIC int
> -xfs_iflush_int(
> +static int
> +xfs_iflush(
>  	struct xfs_inode	*ip,
>  	struct xfs_buf		*bp)
>  {
> @@ -3722,6 +3577,150 @@ xfs_iflush_int(
>  	return error;
>  }
>  
> +/*
> + * Non-blocking flush of dirty inode metadata into the backing buffer.
> + *
> + * The caller must have a reference to the inode and hold the cluster buffer
> + * locked. The function will walk across all the inodes on the cluster buffer it
> + * can find and lock without blocking, and flush them to the cluster buffer.
> + *
> + * On success, the caller must write out the buffer returned in *bp and
> + * release it. On failure, the filesystem will be shut down, the buffer will
> + * have been unlocked and released, and EFSCORRUPTED will be returned.
> + */
> +int
> +xfs_iflush_cluster(
> +	struct xfs_inode	*ip,
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_perag	*pag;
> +	unsigned long		first_index, mask;
> +	int			cilist_size;
> +	struct xfs_inode	**cilist;
> +	struct xfs_inode	*cip;
> +	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
> +	int			error = 0;
> +	int			nr_found;
> +	int			clcount = 0;
> +	int			i;
> +
> +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> +
> +	cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *);
> +	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
> +	if (!cilist)
> +		goto out_put;
> +
> +	mask = ~(igeo->inodes_per_cluster - 1);
> +	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
> +	rcu_read_lock();
> +	/* really need a gang lookup range call here */
> +	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
> +					first_index, igeo->inodes_per_cluster);
> +	if (nr_found == 0)
> +		goto out_free;
> +
> +	for (i = 0; i < nr_found; i++) {
> +		cip = cilist[i];
> +
> +		/*
> +		 * because this is an RCU protected lookup, we could find a
> +		 * recently freed or even reallocated inode during the lookup.
> +		 * We need to check under the i_flags_lock for a valid inode
> +		 * here. Skip it if it is not valid or the wrong inode.
> +		 */
> +		spin_lock(&cip->i_flags_lock);
> +		if (!cip->i_ino ||
> +		    __xfs_iflags_test(cip, XFS_ISTALE)) {
> +			spin_unlock(&cip->i_flags_lock);
> +			continue;
> +		}
> +
> +		/*
> +		 * Once we fall off the end of the cluster, no point checking
> +		 * any more inodes in the list because they will also all be
> +		 * outside the cluster.
> +		 */
> +		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
> +			spin_unlock(&cip->i_flags_lock);
> +			break;
> +		}
> +		spin_unlock(&cip->i_flags_lock);
> +
> +		/*
> +		 * Do an un-protected check to see if the inode is dirty and
> +		 * is a candidate for flushing.  These checks will be repeated
> +		 * later after the appropriate locks are acquired.
> +		 */
> +		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
> +			continue;
> +
> +		/*
> +		 * Try to get locks.  If any are unavailable or it is pinned,
> +		 * then this inode cannot be flushed and is skipped.
> +		 */
> +
> +		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
> +			continue;
> +		if (!xfs_iflock_nowait(cip)) {
> +			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> +			continue;
> +		}
> +		if (xfs_ipincount(cip)) {
> +			xfs_ifunlock(cip);
> +			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> +			continue;
> +		}
> +
> +
> +		/*
> +		 * Check the inode number again, just to be certain we are not
> +		 * racing with freeing in xfs_reclaim_inode(). See the comments
> +		 * in that function for more information as to why the initial
> +		 * check is not sufficient.
> +		 */
> +		if (!cip->i_ino) {
> +			xfs_ifunlock(cip);
> +			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> +			continue;
> +		}
> +
> +		/*
> +		 * arriving here means that this inode can be flushed.  First
> +		 * re-check that it's dirty before flushing.
> +		 */
> +		if (!xfs_inode_clean(cip)) {
> +			error = xfs_iflush(cip, bp);
> +			if (error) {
> +				xfs_iunlock(cip, XFS_ILOCK_SHARED);
> +				goto out_free;
> +			}
> +			clcount++;
> +		} else {
> +			xfs_ifunlock(cip);
> +		}
> +		xfs_iunlock(cip, XFS_ILOCK_SHARED);
> +	}
> +
> +	if (clcount) {
> +		XFS_STATS_INC(mp, xs_icluster_flushcnt);
> +		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
> +	}
> +
> +out_free:
> +	rcu_read_unlock();
> +	kmem_free(cilist);
> +out_put:
> +	xfs_perag_put(pag);
> +	if (error) {
> +		bp->b_flags |= XBF_ASYNC;
> +		xfs_buf_ioend_fail(bp);
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +	}
> +	return error;
> +}
> +
>  /* Release an inode. */
>  void
>  xfs_irele(
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration
  2020-05-22  3:50 ` [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
@ 2020-05-23  0:13   ` Darrick J. Wong
  2020-05-23 23:14     ` Dave Chinner
  2020-05-23 11:31   ` Christoph Hellwig
  2020-05-23 11:39   ` Christoph Hellwig
  2 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-23  0:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:27PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have all the dirty inodes attached to the cluster
> buffer, we don't actually have to do radix tree lookups to find
> them. Sure, the radix tree is efficient, but walking a linked list
> of just the dirty inodes attached to the buffer is much better.
> 
> We are also no longer dependent on having a locked inode passed into
> the function to determine where to start the lookup. This means we
> can drop it from the function call and treat all inodes the same.
> 
> We also make xfs_iflush_cluster skip inodes marked with
> XFS_IRECLAIM. This we avoid races with inodes that reclaim is
> actively referencing or are being re-initialised by inode lookup. If
> they are actually dirty, they'll get written by a future cluster
> flush....
> 
> We also add a shutdown check after obtaining the flush lock so that
> we catch inodes that are dirty in memory and may have inconsistent
> state due to the shutdown in progress. We abort these inodes
> directly and so they remove themselves directly from the buffer list
> and the AIL rather than having to wait for the buffer to be failed
> and callbacks run to be processed correctly.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode.c      | 150 ++++++++++++++++------------------------
>  fs/xfs/xfs_inode.h      |   2 +-
>  fs/xfs/xfs_inode_item.c |  15 +++-
>  3 files changed, 74 insertions(+), 93 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index cbf8edf62d102..7db0f97e537e3 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3428,7 +3428,7 @@ xfs_rename(
>  	return error;
>  }
>  
> -static int
> +int
>  xfs_iflush(

Not sure why this drops the static?

>  	struct xfs_inode	*ip,
>  	struct xfs_buf		*bp)
> @@ -3590,117 +3590,94 @@ xfs_iflush(
>   */
>  int
>  xfs_iflush_cluster(
> -	struct xfs_inode	*ip,
>  	struct xfs_buf		*bp)
>  {
> -	struct xfs_mount	*mp = ip->i_mount;
> -	struct xfs_perag	*pag;
> -	unsigned long		first_index, mask;
> -	int			cilist_size;
> -	struct xfs_inode	**cilist;
> -	struct xfs_inode	*cip;
> -	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
> -	int			error = 0;
> -	int			nr_found;
> +	struct xfs_mount	*mp = bp->b_mount;
> +	struct xfs_log_item	*lip, *n;
> +	struct xfs_inode	*ip;
> +	struct xfs_inode_log_item *iip;
>  	int			clcount = 0;
> -	int			i;
> -
> -	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> -
> -	cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *);
> -	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
> -	if (!cilist)
> -		goto out_put;
> -
> -	mask = ~(igeo->inodes_per_cluster - 1);
> -	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
> -	rcu_read_lock();
> -	/* really need a gang lookup range call here */
> -	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
> -					first_index, igeo->inodes_per_cluster);
> -	if (nr_found == 0)
> -		goto out_free;
> +	int			error = 0;
>  
> -	for (i = 0; i < nr_found; i++) {
> -		cip = cilist[i];
> +	/*
> +	 * We must use the safe variant here as on shutdown xfs_iflush_abort()
> +	 * can remove itself from the list.
> +	 */
> +	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
> +		iip = (struct xfs_inode_log_item *)lip;
> +		ip = iip->ili_inode;
>  
>  		/*
> -		 * because this is an RCU protected lookup, we could find a
> -		 * recently freed or even reallocated inode during the lookup.
> -		 * We need to check under the i_flags_lock for a valid inode
> -		 * here. Skip it if it is not valid or the wrong inode.
> +		 * Quick and dirty check to avoid locks if possible.
>  		 */
> -		spin_lock(&cip->i_flags_lock);
> -		if (!cip->i_ino ||
> -		    __xfs_iflags_test(cip, XFS_ISTALE)) {
> -			spin_unlock(&cip->i_flags_lock);
> +		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK))
> +			continue;
> +		if (xfs_ipincount(ip))
>  			continue;
> -		}
>  
>  		/*
> -		 * Once we fall off the end of the cluster, no point checking
> -		 * any more inodes in the list because they will also all be
> -		 * outside the cluster.
> +		 * The inode is still attached to the buffer, which means it is
> +		 * dirty but reclaim might try to grab it. Check carefully for
> +		 * that, and grab the ilock while still holding the i_flags_lock
> +		 * to guarantee reclaim will not be able to reclaim this inode
> +		 * once we drop the i_flags_lock.
>  		 */
> -		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
> -			spin_unlock(&cip->i_flags_lock);
> -			break;
> +		spin_lock(&ip->i_flags_lock);
> +		ASSERT(!__xfs_iflags_test(ip, XFS_ISTALE));
> +		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK)) {
> +			spin_unlock(&ip->i_flags_lock);
> +			continue;
>  		}
> -		spin_unlock(&cip->i_flags_lock);
>  
>  		/*
> -		 * Do an un-protected check to see if the inode is dirty and
> -		 * is a candidate for flushing.  These checks will be repeated
> -		 * later after the appropriate locks are acquired.
> +		 * ILOCK will pin the inode against reclaim and prevent
> +		 * concurrent transactions modifying the inode while we are
> +		 * flushing the inode.
>  		 */
> -		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
> +		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) {
> +			spin_unlock(&ip->i_flags_lock);
>  			continue;
> +		}
> +		spin_unlock(&ip->i_flags_lock);
>  
>  		/*
> -		 * Try to get locks.  If any are unavailable or it is pinned,
> -		 * then this inode cannot be flushed and is skipped.
> +		 * Skip inodes that are already flush locked as they have
> +		 * already been written to the buffer.
>  		 */
> -
> -		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
> -			continue;
> -		if (!xfs_iflock_nowait(cip)) {
> -			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> +		if (!xfs_iflock_nowait(ip)) {
> +			xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  			continue;
>  		}
> -		if (xfs_ipincount(cip)) {
> -			xfs_ifunlock(cip);
> -			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> -			continue;
> -		}
> -
>  
>  		/*
> -		 * Check the inode number again, just to be certain we are not
> -		 * racing with freeing in xfs_reclaim_inode(). See the comments
> -		 * in that function for more information as to why the initial
> -		 * check is not sufficient.
> +		 * If we are shut down, unpin and abort the inode now as there
> +		 * is no point in flushing it to the buffer just to get an IO
> +		 * completion to abort the buffer and remove it from the AIL.
>  		 */
> -		if (!cip->i_ino) {
> -			xfs_ifunlock(cip);
> -			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> +		if (XFS_FORCED_SHUTDOWN(mp)) {
> +			xfs_iunpin_wait(ip);
> +			/* xfs_iflush_abort() drops the flush lock */
> +			xfs_iflush_abort(ip);
> +			xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +			error = EIO;

error = -EIO?

>  			continue;
>  		}
>  
> -		/*
> -		 * arriving here means that this inode can be flushed.  First
> -		 * re-check that it's dirty before flushing.
> -		 */
> -		if (!xfs_inode_clean(cip)) {
> -			error = xfs_iflush(cip, bp);
> -			if (error) {
> -				xfs_iunlock(cip, XFS_ILOCK_SHARED);
> -				goto out_free;
> -			}
> -			clcount++;
> -		} else {
> -			xfs_ifunlock(cip);
> +		/* don't block waiting on a log force to unpin dirty inodes */
> +		if (xfs_ipincount(ip)) {
> +			xfs_ifunlock(ip);
> +			xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +			continue;
>  		}
> -		xfs_iunlock(cip, XFS_ILOCK_SHARED);
> +
> +		if (!xfs_inode_clean(ip))
> +			error = xfs_iflush(ip, bp);
> +		else
> +			xfs_ifunlock(ip);
> +		xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +		if (error)
> +			break;
> +		clcount++;
>  	}
>  
>  	if (clcount) {
> @@ -3708,11 +3685,6 @@ xfs_iflush_cluster(
>  		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
>  	}
>  
> -out_free:
> -	rcu_read_unlock();
> -	kmem_free(cilist);
> -out_put:
> -	xfs_perag_put(pag);
>  	if (error) {
>  		bp->b_flags |= XBF_ASYNC;
>  		xfs_buf_ioend_fail(bp);
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index d1109eb13ba2e..b93cf9076df8a 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -427,7 +427,7 @@ int		xfs_log_force_inode(struct xfs_inode *ip);
>  void		xfs_iunpin_wait(xfs_inode_t *);
>  #define xfs_ipincount(ip)	((unsigned int) atomic_read(&ip->i_pincount))
>  
> -int		xfs_iflush_cluster(struct xfs_inode *, struct xfs_buf *);
> +int		xfs_iflush_cluster(struct xfs_buf *);
>  void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
>  				struct xfs_inode *ip1, uint ip1_mode);
>  
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 42b4b5fe1e2a9..af4764f97a339 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -523,11 +523,20 @@ xfs_inode_item_push(
>  	}
>  	spin_unlock(&lip->li_ailp->ail_lock);
>  
> -	error = xfs_iflush_cluster(ip, bp);
> +	/*
> +	 * We need to hold a reference for flushing the cluster buffer as it may
> +	 * fail the buffer without IO submission. In which case, we better have
> +	 * a reference for that completion as otherwise we don't get a reference
> +	 * for IO until we queue it for delwri submission.

<confused>

What completion are we talking about?  Does this refer to the fact that
xfs_iflush_cluster handles a flush failure by simulating an async write
failure which could result in us giving away the inode log item's
reference to the buffer?

--D

> +	 */
> +	xfs_buf_hold(bp);
> +	error = xfs_iflush_cluster(bp);
>  	if (!error) {
> -		if (!xfs_buf_delwri_queue(bp, buffer_list))
> +		if (!xfs_buf_delwri_queue(bp, buffer_list)) {
> +			ASSERT(0);
>  			rval = XFS_ITEM_FLUSHING;
> -		xfs_buf_unlock(bp);
> +		}
> +		xfs_buf_relse(bp);
>  	} else {
>  		rval = XFS_ITEM_LOCKED;
>  	}
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 24/24] xfs: remove xfs_inobp_check()
  2020-05-22  3:50 ` [PATCH 24/24] xfs: remove xfs_inobp_check() Dave Chinner
@ 2020-05-23  0:16   ` Darrick J. Wong
  2020-05-23 11:36   ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-23  0:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:29PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> This debug code is called on every xfs_iflush() call, which then
> checks every inode in the buffer for non-zero unlinked list field.
> Hence it checks every inode in the cluster buffer every time a
> single inode on that cluster it flushed. This is resulting in:
> 
> -   38.91%     5.33%  [kernel]  [k] xfs_iflush                                                                                                              ▒
>    - 17.70% xfs_iflush                                                                                                                                      ▒
>       - 9.93% xfs_inobp_check                                                                                                                                   ▒
>            4.36% xfs_buf_offset                                                                                                                                 ▒

Overly long lines there ^^^^^.

> 
> 10% of the CPU time spent flushing inodes is repeatedly checking
> unlinked fields in the buffer. We don't need to do this.
> 
> The other place we call xfs_inobp_check() is
> xfs_iunlink_update_dinode(), and this is after we've done this
> assert for the agino we are about to write into that inode:
> 
> 	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
> 
> which means we've already checked that the agino we are about to
> write is not 0 on debug kernels. The inode buffer verifiers do
> everything else we need, so let's just remove this debug code.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

But with that fixed up,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/libxfs/xfs_inode_buf.c | 24 ------------------------
>  fs/xfs/libxfs/xfs_inode_buf.h |  6 ------
>  fs/xfs/xfs_inode.c            |  2 --
>  3 files changed, 32 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 1af97235785c8..6b6f67595bf4e 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -20,30 +20,6 @@
>  
>  #include <linux/iversion.h>
>  
> -/*
> - * Check that none of the inode's in the buffer have a next
> - * unlinked field of 0.
> - */
> -#if defined(DEBUG)
> -void
> -xfs_inobp_check(
> -	xfs_mount_t	*mp,
> -	xfs_buf_t	*bp)
> -{
> -	int		i;
> -	xfs_dinode_t	*dip;
> -
> -	for (i = 0; i < M_IGEO(mp)->inodes_per_cluster; i++) {
> -		dip = xfs_buf_offset(bp, i * mp->m_sb.sb_inodesize);
> -		if (!dip->di_next_unlinked)  {
> -			xfs_alert(mp,
> -	"Detected bogus zero next_unlinked field in inode %d buffer 0x%llx.",
> -				i, (long long)bp->b_bn);
> -		}
> -	}
> -}
> -#endif
> -
>  /*
>   * If we are doing readahead on an inode buffer, we might be in log recovery
>   * reading an inode allocation buffer that hasn't yet been replayed, and hence
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
> index 865ac493c72a2..6b08b9d060c2e 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.h
> +++ b/fs/xfs/libxfs/xfs_inode_buf.h
> @@ -52,12 +52,6 @@ int	xfs_inode_from_disk(struct xfs_inode *ip, struct xfs_dinode *from);
>  void	xfs_log_dinode_to_disk(struct xfs_log_dinode *from,
>  			       struct xfs_dinode *to);
>  
> -#if defined(DEBUG)
> -void	xfs_inobp_check(struct xfs_mount *, struct xfs_buf *);
> -#else
> -#define	xfs_inobp_check(mp, bp)
> -#endif /* DEBUG */
> -
>  xfs_failaddr_t xfs_dinode_verify(struct xfs_mount *mp, xfs_ino_t ino,
>  			   struct xfs_dinode *dip);
>  xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 7db0f97e537e3..98a494e42aa6a 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2146,7 +2146,6 @@ xfs_iunlink_update_dinode(
>  	xfs_dinode_calc_crc(mp, dip);
>  	xfs_trans_inode_buf(tp, ibp);
>  	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
> -	xfs_inobp_check(mp, ibp);
>  }
>  
>  /* Set an in-core inode's unlinked pointer and return the old value. */
> @@ -3538,7 +3537,6 @@ xfs_iflush(
>  	xfs_iflush_fork(ip, dip, iip, XFS_DATA_FORK);
>  	if (XFS_IFORK_Q(ip))
>  		xfs_iflush_fork(ip, dip, iip, XFS_ATTR_FORK);
> -	xfs_inobp_check(mp, bp);
>  
>  	/*
>  	 * We've recorded everything logged in the inode, so we'd like to clear
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 23/24] xfs: factor xfs_iflush_done
  2020-05-22  3:50 ` [PATCH 23/24] xfs: factor xfs_iflush_done Dave Chinner
@ 2020-05-23  0:20   ` Darrick J. Wong
  2020-05-23 11:35   ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-23  0:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:28PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_iflush_done() does 3 distinct operations to the inodes attached
> to the buffer. Separate these operations out into functions so that
> it is easier to modify these operations independently in future.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Seems like a fairly simple refactoring,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode_item.c | 156 +++++++++++++++++++++-------------------
>  1 file changed, 82 insertions(+), 74 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index af4764f97a339..4dd4f45dcc46e 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -662,104 +662,67 @@ xfs_inode_item_destroy(
>  
>  
>  /*
> - * This is the inode flushing I/O completion routine.  It is called
> - * from interrupt level when the buffer containing the inode is
> - * flushed to disk.  It is responsible for removing the inode item
> - * from the AIL if it has not been re-logged, and unlocking the inode's
> - * flush lock.
> - *
> - * To reduce AIL lock traffic as much as possible, we scan the buffer log item
> - * list for other inodes that will run this function. We remove them from the
> - * buffer list so we can process all the inode IO completions in one AIL lock
> - * traversal.
> - *
> - * Note: Now that we attach the log item to the buffer when we first log the
> - * inode in memory, we can have unflushed inodes on the buffer list here. These
> - * inodes will have a zero ili_last_fields, so skip over them here.
> + * We only want to pull the item from the AIL if it is actually there
> + * and its location in the log has not changed since we started the
> + * flush.  Thus, we only bother if the inode's lsn has not changed.
>   */
>  void
> -xfs_iflush_done(
> -	struct xfs_buf		*bp)
> +xfs_iflush_ail_updates(
> +	struct xfs_ail		*ailp,
> +	struct list_head	*list)
>  {
> -	struct xfs_inode_log_item *iip;
> -	struct xfs_log_item	*lip, *n;
> -	struct xfs_ail		*ailp = bp->b_mount->m_ail;
> -	int			need_ail = 0;
> -	LIST_HEAD(tmp);
> +	struct xfs_log_item	*lip;
> +	xfs_lsn_t		tail_lsn = 0;
>  
> -	/*
> -	 * Pull the attached inodes from the buffer one at a time and take the
> -	 * appropriate action on them.
> -	 */
> -	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
> -		iip = INODE_ITEM(lip);
> -		if (!iip->ili_last_fields)
> -			continue;
> +	/* this is an opencoded batch version of xfs_trans_ail_delete */
> +	spin_lock(&ailp->ail_lock);
> +	list_for_each_entry(lip, list, li_bio_list) {
> +		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> +		xfs_lsn_t	lsn;
>  
> -		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
> -			xfs_iflush_abort(iip->ili_inode);
> +		if (iip->ili_flush_lsn != lip->li_lsn) {
> +			xfs_clear_li_failed(lip);
>  			continue;
>  		}
>  
> -		list_move_tail(&lip->li_bio_list, &tmp);
> -
> -		/* Do an unlocked check for needing the AIL lock. */
> -		if (iip->ili_flush_lsn == lip->li_lsn ||
> -		    test_bit(XFS_LI_FAILED, &lip->li_flags))
> -			need_ail++;
> +		lsn = xfs_ail_delete_one(ailp, lip);
> +		if (!tail_lsn && lsn)
> +			tail_lsn = lsn;
>  	}
> +	xfs_ail_update_finish(ailp, tail_lsn);
> +}
>  
> -	/*
> -	 * We only want to pull the item from the AIL if it is actually there
> -	 * and its location in the log has not changed since we started the
> -	 * flush.  Thus, we only bother if the inode's lsn has not changed.
> -	 */
> -	if (need_ail) {
> -		xfs_lsn_t	tail_lsn = 0;
> -
> -		/* this is an opencoded batch version of xfs_trans_ail_delete */
> -		spin_lock(&ailp->ail_lock);
> -		list_for_each_entry(lip, &tmp, li_bio_list) {
> -			iip = INODE_ITEM(lip);
> -			if (iip->ili_flush_lsn == lip->li_lsn) {
> -				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
> -				if (!tail_lsn && lsn)
> -					tail_lsn = lsn;
> -			} else {
> -				xfs_clear_li_failed(lip);
> -			}
> -		}
> -		xfs_ail_update_finish(ailp, tail_lsn);
> -	}
> +/*
> + * Walk the list of inodes that have completed their IOs. If they are clean
> + * remove them from the list and dissociate them from the buffer. Buffers that
> + * are still dirty remain linked to the buffer and on the list. Caller must
> + * handle them appropriately.
> + */
> +void
> +xfs_iflush_finish(
> +	struct xfs_buf		*bp,
> +	struct list_head	*list)
> +{
> +	struct xfs_log_item	*lip, *n;
>  
> -	/*
> -	 * Clean up and unlock the flush lock now we are done. We can clear the
> -	 * ili_last_fields bits now that we know that the data corresponding to
> -	 * them is safely on disk.
> -	 */
> -	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
> +	list_for_each_entry_safe(lip, n, list, li_bio_list) {
> +		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
>  		bool	drop_buffer = false;
>  
> -		list_del_init(&lip->li_bio_list);
> -		iip = INODE_ITEM(lip);
> -
>  		spin_lock(&iip->ili_lock);
>  		iip->ili_last_fields = 0;
>  		iip->ili_flush_lsn = 0;
>  
>  		/*
>  		 * Remove the reference to the cluster buffer if the inode is
> -		 * clean in memory. Drop the buffer reference once we've dropped
> -		 * the locks we hold. If the inode is dirty in memory, we need
> -		 * to put the inode item back on the buffer list for another
> -		 * pass through the flush machinery.
> +		 * clean in memory and drop the buffer reference once we've
> +		 * dropped the locks we hold.
>  		 */
>  		ASSERT(iip->ili_item.li_buf == bp);
>  		if (!iip->ili_fields) {
>  			iip->ili_item.li_buf = NULL;
> +			list_del_init(&lip->li_bio_list);
>  			drop_buffer = true;
> -		} else {
> -			list_add(&lip->li_bio_list, &bp->b_li_list);
>  		}
>  		spin_unlock(&iip->ili_lock);
>  		xfs_ifunlock(iip->ili_inode);
> @@ -768,6 +731,51 @@ xfs_iflush_done(
>  	}
>  }
>  
> +/*
> + * Inode buffer IO completion routine.  It is responsible for removing inodes
> + * attached to the buffer from the AIL if they have not been re-logged, as well
> + * as completing the flush and unlocking the inode.
> + */
> +void
> +xfs_iflush_done(
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_log_item	*lip, *n;
> +	LIST_HEAD(flushed_inodes);
> +	LIST_HEAD(ail_updates);
> +
> +	/*
> +	 * Pull the attached inodes from the buffer one at a time and take the
> +	 * appropriate action on them.
> +	 */
> +	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
> +		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> +		if (!iip->ili_last_fields)
> +			continue;
> +
> +		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
> +			xfs_iflush_abort(iip->ili_inode);
> +			continue;
> +		}
> +
> +		/* Do an unlocked check for needing the AIL lock. */
> +		if (iip->ili_flush_lsn == lip->li_lsn ||
> +		    test_bit(XFS_LI_FAILED, &lip->li_flags))
> +			list_move_tail(&lip->li_bio_list, &ail_updates);
> +		else
> +			list_move_tail(&lip->li_bio_list, &flushed_inodes);
> +	}
> +
> +	if (!list_empty(&ail_updates)) {
> +		xfs_iflush_ail_updates(bp->b_mount->m_ail, &ail_updates);
> +		list_splice_tail(&ail_updates, &flushed_inodes);
> +	}
> +
> +	xfs_iflush_finish(bp, &flushed_inodes);
> +	if (!list_empty(&flushed_inodes))
> +		list_splice_tail(&flushed_inodes, &bp->b_li_list);
> +}
> +
>  /*
>   * This is the inode flushing abort routine.  It is called from xfs_iflush when
>   * the filesystem is shutting down to clean up the inode state.  It is
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 02/24] xfs: add an inode item lock
  2020-05-22  3:50 ` [PATCH 02/24] xfs: add an inode item lock Dave Chinner
  2020-05-22  6:45   ` Amir Goldstein
  2020-05-22 21:24   ` Darrick J. Wong
@ 2020-05-23  8:45   ` Christoph Hellwig
  2 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-23  8:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

just a few minor style nipicks below:

> -	if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
> -	    IS_I_VERSION(VFS_I(ip))) {
> -		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
> -			flags |= XFS_ILOG_CORE;
> +	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags) &&
> +	    IS_I_VERSION(inode)) {
> +		if (inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> +			iversion_flags = XFS_ILOG_CORE;

I find the ordering of the conditionals here weird (and yes I know this
comes from the old code), given that test_and_set_bit is an action
applicable independend of IS_I_VERSION.  Maybe reshuffle this to:

	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags) {
		if (IS_I_VERSION(inode) &&
		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
			iversion_flags = XFS_ILOG_CORE;
 	}

> -	tp->t_flags |= XFS_TRANS_DIRTY;
> +	/*
> +	 * Record the specific change for fdatasync optimisation. This
> +	 * allows fdatasync to skip log forces for inodes that are only
> +	 * timestamp dirty. We do this before the change count so that
> +	 * the core being logged in this case does not impact on fdatasync
> +	 * behaviour.
> +	 */

This comment could save a precious line by using up all 80 characters :)

> +	flags |= iip->ili_last_fields | iversion_flags;
> +	iip->ili_fields |= flags;
> +	spin_unlock(&iip->ili_lock);
>  }

Maybe something like this:

	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);

would make more sense as the flags isn't used anywhere below.

> +			spin_lock(&iip->ili_lock);
>  			iip->ili_last_fields = iip->ili_fields;
>  			iip->ili_fields = 0;
>  			iip->ili_fsync_fields = 0;
> +			spin_unlock(&iip->ili_lock);
>  			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  						&iip->ili_item.li_lsn);

We have to copies of the exact above code sequence.  Maybe it makes
sense to be factored into a little helper?

> +	spin_lock(&iip->ili_lock);
> +	iip->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);

Please add whitespaces around the "|".

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/24] xfs: mark inode buffers in cache
  2020-05-22  3:50 ` [PATCH 03/24] xfs: mark inode buffers in cache Dave Chinner
  2020-05-22  7:45   ` Amir Goldstein
  2020-05-22 21:35   ` Darrick J. Wong
@ 2020-05-23  8:48   ` Christoph Hellwig
  2020-05-25  0:06     ` Dave Chinner
  2 siblings, 1 reply; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-23  8:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:08PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Inode buffers always have write IO callbacks, so by marking them
> directly we can avoid needing to attach ->b_iodone functions to
> them. This avoids an indirect call, and makes future modifications
> much simpler.
> 
> This is largely a rearrangement of the code at this point - no IO
> completion functionality changes at this point, just how the
> code is run is modified.

So I've looked over the whole series where this leads, and I really like
killing off li_cb and sorting out the mess around b_iodone.  But I think
the series ends up moving too much out of xfs_buf.c.  I'd rather keep
a minimal b_write_done callback rather than the checking of the inode
and dquote flags, and keep most of the logic in xfs_buf.c.  This is the
patch I cooked up on top of your series to show what I mean - it removes
the need for both flags and kills about 100 lines of code:


diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index e5f58a4b13eee..4220fee1c9ddb 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -169,7 +169,7 @@ xfs_trans_log_inode(
 		xfs_buf_hold(bp);
 		spin_lock(&iip->ili_lock);
 		iip->ili_item.li_buf = bp;
-		bp->b_flags |= _XBF_INODES;
+		bp->b_write_done = xfs_iflush_done;
 		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
 		xfs_trans_brelse(tp, bp);
 	}
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 5fc83637f7aff..b1c3d8076d52d 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1172,11 +1172,7 @@ xfs_buf_wait_unpin(
 	set_current_state(TASK_RUNNING);
 }
 
-/*
- *	Buffer Utility Routines
- */
-
-void
+static void
 xfs_buf_ioend(
 	struct xfs_buf	*bp)
 {
@@ -1198,38 +1194,41 @@ xfs_buf_ioend(
 			bp->b_ops->verify_read(bp);
 		if (!bp->b_error)
 			bp->b_flags |= XBF_DONE;
-		xfs_buf_ioend_finish(bp);
-		return;
-	}
-
-	if (!bp->b_error) {
-		bp->b_flags &= ~XBF_WRITE_FAIL;
-		bp->b_flags |= XBF_DONE;
-	}
-
-	/*
-	 * If this is a log recovery buffer, we aren't doing transactional IO
-	 * yet so we need to let it handle IO completions.
-	 */
-	if (bp->b_flags & _XBF_LOGRCVY) {
+	} else if (unlikely(bp->b_flags & _XBF_LOGRCVY)) {
 		xlog_recover_iodone(bp);
-		return;
-	}
-
-	/* inodes always have a callback on write */
-	if (bp->b_flags & _XBF_INODES) {
-		xfs_buf_inode_iodone(bp);
-		return;
-	}
+	} else {
+		/*
+		 * If there is an error, process it. Some errors require us to
+		 * run callbacks after failure processing is done so we detect
+		 * that and take appropriate action.
+		 */
+		if (bp->b_error) {
+			if (xfs_buf_iodone_callback_error(bp))
+				return;
+		} else {
+			bp->b_flags &= ~XBF_WRITE_FAIL;
+			bp->b_flags |= XBF_DONE;
+		}
 
-	/* dquots always have a callback on write */
-	if (bp->b_flags & _XBF_DQUOTS) {
-		xfs_buf_dquot_iodone(bp);
-		return;
+		/*
+		 * Successful IO or permanent error. Either way, we can clear
+		 * the retry state here in preparation for the next error that
+		 * may occur.
+		 */
+		bp->b_last_error = 0;
+		bp->b_retries = 0;
+		bp->b_first_retry_time = 0;
+
+		if (bp->b_write_done)
+			bp->b_write_done(bp);
+		if (bp->b_log_item)
+			xfs_buf_item_done(bp);
 	}
 
-	/* dirty buffers always need to be removed from the AIL */
-	xfs_buf_dirty_iodone(bp);
+	if (bp->b_flags & XBF_ASYNC)
+		xfs_buf_relse(bp);
+	else
+		complete(&bp->b_iowait);
 }
 
 static void
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index a127d56d5eff0..a060d8bfeb515 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -32,9 +32,7 @@ struct xfs_buf;
 #define XBF_WRITE_FAIL	 (1 << 7) /* async writes have failed on this buffer */
 
 /* buffer type flags for write callbacks */
-#define _XBF_INODES	 (1 << 16)/* inode buffer */
-#define _XBF_DQUOTS	 (1 << 17)/* dquot buffer */
-#define _XBF_LOGRCVY	 (1 << 18)/* log recovery buffer */
+#define _XBF_LOGRCVY	(1 << 18)/* log recovery buffer */
 
 /* flags used only internally */
 #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
@@ -57,8 +55,6 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_DONE,		"DONE" }, \
 	{ XBF_STALE,		"STALE" }, \
 	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
-	{ _XBF_INODES,		"INODES" }, \
-	{ _XBF_DQUOTS,		"DQUOTS" }, \
 	{ _XBF_LOGRCVY,		"LOG_RECOVERY" }, \
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
@@ -156,6 +152,7 @@ typedef struct xfs_buf {
 	xfs_buftarg_t		*b_target;	/* buffer target (device) */
 	void			*b_addr;	/* virtual address of buffer */
 	struct work_struct	b_ioend_work;
+	void 			(*b_write_done)(struct xfs_buf *bp);
 	struct completion	b_iowait;	/* queue for I/O waiters */
 	struct xfs_buf_log_item	*b_log_item;
 	struct list_head	b_li_list;	/* Log items list head */
@@ -262,23 +259,8 @@ extern void xfs_buf_unlock(xfs_buf_t *);
 #define xfs_buf_islocked(bp) \
 	((bp)->b_sema.count <= 0)
 
-static inline void xfs_buf_relse(xfs_buf_t *bp)
-{
-	xfs_buf_unlock(bp);
-	xfs_buf_rele(bp);
-}
-
 /* Buffer Read and Write Routines */
 extern int xfs_bwrite(struct xfs_buf *bp);
-extern void xfs_buf_ioend(struct xfs_buf *bp);
-static inline void xfs_buf_ioend_finish(struct xfs_buf *bp)
-{
-	if (bp->b_flags & XBF_ASYNC)
-		xfs_buf_relse(bp);
-	else
-		complete(&bp->b_iowait);
-}
-
 extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error,
 		xfs_failaddr_t failaddr);
 #define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address)
@@ -343,6 +325,12 @@ static inline int xfs_buf_ispinned(struct xfs_buf *bp)
 	return atomic_read(&bp->b_pin_count);
 }
 
+static inline void xfs_buf_relse(xfs_buf_t *bp)
+{
+	xfs_buf_unlock(bp);
+	xfs_buf_rele(bp);
+}
+
 static inline int
 xfs_buf_verify_cksum(struct xfs_buf *bp, unsigned long cksum_offset)
 {
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index d855a2b7486c5..21b0d2f3f1af9 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -15,9 +15,6 @@
 #include "xfs_buf_item.h"
 #include "xfs_inode.h"
 #include "xfs_inode_item.h"
-#include "xfs_quota.h"
-#include "xfs_dquot_item.h"
-#include "xfs_dquot.h"
 #include "xfs_trans_priv.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
@@ -30,8 +27,6 @@ static inline struct xfs_buf_log_item *BUF_ITEM(struct xfs_log_item *lip)
 	return container_of(lip, struct xfs_buf_log_item, bli_item);
 }
 
-static void xfs_buf_item_done(struct xfs_buf *bp);
-
 /* Is this log iovec plausibly large enough to contain the buffer log format? */
 bool
 xfs_buf_log_check_iovec(
@@ -986,7 +981,7 @@ xfs_buf_do_callbacks_fail(
 	spin_unlock(&ailp->ail_lock);
 }
 
-static bool
+bool
 xfs_buf_iodone_callback_error(
 	struct xfs_buf		*bp)
 {
@@ -1075,38 +1070,12 @@ xfs_buf_iodone_callback_error(
 	return false;
 }
 
-static inline bool
-xfs_buf_had_callback_errors(
-	struct xfs_buf		*bp)
-{
-
-	/*
-	 * If there is an error, process it. Some errors require us to run
-	 * callbacks after failure processing is done so we detect that and take
-	 * appropriate action.
-	 */
-	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
-		return true;
-
-	/*
-	 * Successful IO or permanent error. Either way, we can clear the
-	 * retry state here in preparation for the next error that may occur.
-	 */
-	bp->b_last_error = 0;
-	bp->b_retries = 0;
-	bp->b_first_retry_time = 0;
-	return false;
-}
-
-static void
+void
 xfs_buf_item_done(
 	struct xfs_buf		*bp)
 {
 	struct xfs_buf_log_item	*bip = bp->b_log_item;
 
-	if (!bip)
-		return;
-
 	/*
 	 * If we are forcibly shutting down, this may well be off the AIL
 	 * already. That's because we simulate the log-committed callbacks to
@@ -1121,52 +1090,3 @@ xfs_buf_item_done(
 	xfs_buf_item_free(bip);
 	xfs_buf_rele(bp);
 }
-
-/*
- * Inode buffer iodone callback function.
- */
-void
-xfs_buf_inode_iodone(
-	struct xfs_buf		*bp)
-{
-	if (xfs_buf_had_callback_errors(bp))
-		return;
-
-	xfs_buf_item_done(bp);
-	xfs_iflush_done(bp);
-	xfs_buf_ioend_finish(bp);
-}
-
-/*
- * Dquot buffer iodone callback function.
- */
-void
-xfs_buf_dquot_iodone(
-	struct xfs_buf		*bp)
-{
-	if (xfs_buf_had_callback_errors(bp))
-		return;
-
-	/* a newly allocated dquot buffer might have a log item attached */
-	xfs_buf_item_done(bp);
-	xfs_dquot_done(bp);
-	xfs_buf_ioend_finish(bp);
-}
-
-/*
- * Dirty buffer iodone callback function.
- *
- * Note that for things like remote attribute buffers, there may not be a buffer
- * log item here, so processing the buffer log item must remain be optional.
- */
-void
-xfs_buf_dirty_iodone(
-	struct xfs_buf		*bp)
-{
-	if (xfs_buf_had_callback_errors(bp))
-		return;
-
-	xfs_buf_item_done(bp);
-	xfs_buf_ioend_finish(bp);
-}
-
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index ecfad1915a86b..4ffeabd3ad967 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -54,11 +54,11 @@ void	xfs_buf_item_relse(struct xfs_buf *);
 bool	xfs_buf_item_put(struct xfs_buf_log_item *);
 void	xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint);
 bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
-void	xfs_buf_inode_iodone(struct xfs_buf *);
-void	xfs_buf_dquot_iodone(struct xfs_buf *);
-void	xfs_buf_dirty_iodone(struct xfs_buf *);
 bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
 
+bool	xfs_buf_iodone_callback_error(struct xfs_buf *bp);
+void	xfs_buf_item_done(struct xfs_buf *bp);
+
 extern kmem_zone_t	*xfs_buf_item_zone;
 
 #endif	/* __XFS_BUF_ITEM_H__ */
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 34fc1bcb1eefd..55f34c60340c0 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1085,7 +1085,7 @@ xfs_qm_dqflush_done(
 	xfs_dqfunlock(dqp);
 }
 
-void
+static void
 xfs_dquot_done(
 	struct xfs_buf		*bp)
 {
@@ -1194,7 +1194,7 @@ xfs_qm_dqflush(
 	 * Attach the dquot to the buffer so that we can remove this dquot from
 	 * the AIL and release the flush lock once the dquot is synced to disk.
 	 */
-	bp->b_flags |= _XBF_DQUOTS;
+	bp->b_write_done = xfs_dquot_done;
 	list_add_tail(&dqp->q_logitem.qli_item.li_bio_list, &bp->b_li_list);
 
 	/*
diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
index f1924288986d3..fe3e46df604b4 100644
--- a/fs/xfs/xfs_dquot.h
+++ b/fs/xfs/xfs_dquot.h
@@ -174,7 +174,6 @@ void		xfs_qm_dqput(struct xfs_dquot *dqp);
 void		xfs_dqlock2(struct xfs_dquot *, struct xfs_dquot *);
 
 void		xfs_dquot_set_prealloc_limits(struct xfs_dquot *);
-void		xfs_dquot_done(struct xfs_buf *);
 
 static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp)
 {
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 0aa823aeafca9..5a1dbcde6e9f0 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -269,15 +269,15 @@ void
 xlog_recover_iodone(
 	struct xfs_buf	*bp)
 {
-	if (bp->b_error) {
-		/*
-		 * We're not going to bother about retrying
-		 * this during recovery. One strike!
-		 */
-		if (!XFS_FORCED_SHUTDOWN(bp->b_mount)) {
-			xfs_buf_ioerror_alert(bp, __this_address);
-			xfs_force_shutdown(bp->b_mount, SHUTDOWN_META_IO_ERROR);
-		}
+	/*
+	 * We're not going to bother about retrying this during recovery.
+	 * One strike!
+	 */
+	if (!bp->b_error) {
+		bp->b_flags |= XBF_DONE;
+	} else if (!XFS_FORCED_SHUTDOWN(bp->b_mount)) {
+		xfs_buf_ioerror_alert(bp, __this_address);
+		xfs_force_shutdown(bp->b_mount, SHUTDOWN_META_IO_ERROR);
 	}
 
 	/*
@@ -288,7 +288,6 @@ xlog_recover_iodone(
 		xfs_buf_item_relse(bp);
 	ASSERT(bp->b_log_item == NULL);
 	bp->b_flags &= ~_XBF_LOGRCVY;
-	xfs_buf_ioend_finish(bp);
 }
 
 /*
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 11cd666cd99a6..3a2fe19832d89 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -618,7 +618,6 @@ xfs_trans_inode_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_INODE_BUF;
-	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
 
@@ -643,7 +642,6 @@ xfs_trans_stale_inode_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_STALE_INODE;
-	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
 
@@ -668,7 +666,6 @@ xfs_trans_inode_alloc_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_INODE_ALLOC_BUF;
-	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
 
@@ -779,6 +776,5 @@ xfs_trans_dquot_buf(
 		break;
 	}
 
-	bp->b_flags |= _XBF_DQUOTS;
 	xfs_trans_buf_set_type(tp, bp, type);
 }

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/24] xfs: clean up whacky buffer log item list reinit
  2020-05-22  3:50 ` [PATCH 07/24] xfs: clean up whacky buffer log item list reinit Dave Chinner
  2020-05-22 22:01   ` Darrick J. Wong
@ 2020-05-23  8:50   ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-23  8:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 08/24] xfs: fold xfs_istale_done into xfs_iflush_done
  2020-05-22  3:50 ` [PATCH 08/24] xfs: fold xfs_istale_done into xfs_iflush_done Dave Chinner
  2020-05-22 22:10   ` Darrick J. Wong
@ 2020-05-23  9:12   ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-23  9:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:13PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Having different io completion callbacks for different inode states
> makes things complex. We can detect if the inode is stale via the
> XFS_ISTALE flag in IO completion, so we don't need a special
> callback just for this.
> 
> This means inodes only have a single iodone callback, and inode IO
> completion is entirely buffer centric at this point. Hence we no
> longer need to use a log item callback at all as we can just call
> xfs_iflush_done() directly from the buffer completions and walk the
> buffer log item list to complete the all inodes under IO.

I find the subject a bit confuing.  Yes, this merges xfs_istale_done into
xfs_iflush_done, but that ѕeems to be a side effect of the other
changes.

The main change is that the inode I/O completion is run directly
instead of through b_iodone and li_cb.  Maybe pick a subject that
better reflects this?

Otherwise this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

One minor nitpick below:

> -static void
> -xfs_buf_run_callbacks(
> +static inline bool
> +xfs_buf_had_callback_errors(
>  	struct xfs_buf		*bp)
>  {
>  
> @@ -1152,7 +1155,7 @@ xfs_buf_run_callbacks(
>  	 * appropriate action.
>  	 */
>  	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
> -		return;
> +		return true;
>  
>  	/*
>  	 * Successful IO or permanent error. Either way, we can clear the
> @@ -1161,7 +1164,16 @@ xfs_buf_run_callbacks(
>  	bp->b_last_error = 0;
>  	bp->b_retries = 0;
>  	bp->b_first_retry_time = 0;
> +	return false;
> +}
>  
> +static void
> +xfs_buf_run_callbacks(
> +	struct xfs_buf		*bp)
> +{
> +
> +	if (xfs_buf_had_callback_errors(bp))
> +		return;
>  	xfs_buf_do_callbacks(bp);
>  	bp->b_log_item = NULL;
>  }

Maybe its worth splitting this refactoring into a separate prep patch?


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 09/24] xfs: use direct calls for dquot IO completion
  2020-05-22  3:50 ` [PATCH 09/24] xfs: use direct calls for dquot IO completion Dave Chinner
  2020-05-22 22:13   ` Darrick J. Wong
@ 2020-05-23  9:16   ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-23  9:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> +void
> +xfs_dquot_done(
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_log_item	*lip;
> +
> +	while (!list_empty(&bp->b_li_list)) {
> +		lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
> +				       li_bio_list);
> +
> +		/*
> +		 * Remove the item from the list, so we don't have any
> +		 * confusion if the item is added to another buf.
> +		 * Don't touch the log item after calling its
> +		 * callback, because it could have freed itself.
> +		 */
> +		list_del_init(&lip->li_bio_list);
> +		xfs_qm_dqflush_done(lip);

I know this was just moved, but I find the comment horrible confusing
and not actually adding any value.  Can we just remove it?  For extra
clarity this could also be switched to list_for_each_entry_safe:

	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
		list_del_init(&lip->li_bio_list);
		xfs_qm_dqflush_done(lip);
	}

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/24] xfs: clean up the buffer iodone callback functions
  2020-05-22  3:50 ` [PATCH 10/24] xfs: clean up the buffer iodone callback functions Dave Chinner
  2020-05-22 22:26   ` Darrick J. Wong
@ 2020-05-23  9:19   ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-23  9:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:15PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we've sorted inode and dquot buffers, we can apply the same
> cleanups to dirty buffers with buffer log items. They only have one

Btw, I find the "dirty" buffers terminology strange, as we don't really
use that anywhere else in xfs_buf.[ch].

The patch itself looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 11/24] xfs: get rid of log item callbacks
  2020-05-22  3:50 ` [PATCH 11/24] xfs: get rid of log item callbacks Dave Chinner
  2020-05-22 22:27   ` Darrick J. Wong
@ 2020-05-23  9:19   ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-23  9:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 12/24] xfs: pin inode backing buffer to the inode log item
  2020-05-22  3:50 ` [PATCH 12/24] xfs: pin inode backing buffer to the inode log item Dave Chinner
  2020-05-22 22:39   ` Darrick J. Wong
@ 2020-05-23  9:34   ` Christoph Hellwig
  2020-05-23 21:43     ` Dave Chinner
  1 sibling, 1 reply; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-23  9:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -143,15 +143,10 @@ static inline void
>  xfs_clear_li_failed(
>  	struct xfs_log_item	*lip)
>  {
> -	struct xfs_buf	*bp = lip->li_buf;
> -
>  	ASSERT(test_bit(XFS_LI_IN_AIL, &lip->li_flags));
>  	lockdep_assert_held(&lip->li_ailp->ail_lock);
>  
> -	if (test_and_clear_bit(XFS_LI_FAILED, &lip->li_flags)) {
> -		lip->li_buf = NULL;
> -		xfs_buf_rele(bp);
> -	}
> +	clear_bit(XFS_LI_FAILED, &lip->li_flags);
>  }
>  
>  static inline void
> @@ -161,10 +156,7 @@ xfs_set_li_failed(
>  {
>  	lockdep_assert_held(&lip->li_ailp->ail_lock);
>  
> -	if (!test_and_set_bit(XFS_LI_FAILED, &lip->li_flags)) {
> -		xfs_buf_hold(bp);
> -		lip->li_buf = bp;
> -	}
> +	set_bit(XFS_LI_FAILED, &lip->li_flags);
>  }

Isn't this going to break quotas, which don't always have li_buf set?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 14/24] xfs: remove IO submission from xfs_reclaim_inode()
  2020-05-22  3:50 ` [PATCH 14/24] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
  2020-05-22 23:06   ` Darrick J. Wong
@ 2020-05-23  9:40   ` Christoph Hellwig
  2020-05-23 22:35     ` Dave Chinner
  1 sibling, 1 reply; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-23  9:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> +out_ifunlock:
> +	xfs_ifunlock(ip);
> +out:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	xfs_iflags_clear(ip, XFS_IRECLAIM);
> +	return false;
>  
>  reclaim:

I find the reordering of the error handling to sit before the actual
reclaim action here really weird to read.  What about something like
this folded in instead?

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 9bb022a3570a3..6f873eca8c916 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1091,18 +1091,10 @@ xfs_reclaim_inode(
 	}
 	if (xfs_ipincount(ip))
 		goto out_ifunlock;
-	if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) {
-		xfs_ifunlock(ip);
-		goto reclaim;
-	}
+	if (!xfs_inode_clean(ip) && !xfs_iflags_test(ip, XFS_ISTALE))
+		goto out_ifunlock;
 
-out_ifunlock:
 	xfs_ifunlock(ip);
-out_iunlock:
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-out:
-	xfs_iflags_clear(ip, XFS_IRECLAIM);
-	return false;
 
 reclaim:
 	ASSERT(!xfs_isiflocked(ip));
@@ -1153,6 +1145,14 @@ xfs_reclaim_inode(
 
 	__xfs_inode_free(ip);
 	return true;
+
+out_ifunlock:
+	xfs_ifunlock(ip);
+out_iunlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out:
+	xfs_iflags_clear(ip, XFS_IRECLAIM);
+	return false;
 }
 
 /*

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration
  2020-05-22  3:50 ` [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
  2020-05-23  0:13   ` Darrick J. Wong
@ 2020-05-23 11:31   ` Christoph Hellwig
  2020-05-23 23:23     ` Dave Chinner
  2020-05-23 11:39   ` Christoph Hellwig
  2 siblings, 1 reply; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-23 11:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:27PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have all the dirty inodes attached to the cluster
> buffer, we don't actually have to do radix tree lookups to find
> them. Sure, the radix tree is efficient, but walking a linked list
> of just the dirty inodes attached to the buffer is much better.
> 
> We are also no longer dependent on having a locked inode passed into
> the function to determine where to start the lookup. This means we
> can drop it from the function call and treat all inodes the same.
> 
> We also make xfs_iflush_cluster skip inodes marked with
> XFS_IRECLAIM. This we avoid races with inodes that reclaim is
> actively referencing or are being re-initialised by inode lookup. If
> they are actually dirty, they'll get written by a future cluster
> flush....
> 
> We also add a shutdown check after obtaining the flush lock so that
> we catch inodes that are dirty in memory and may have inconsistent
> state due to the shutdown in progress. We abort these inodes
> directly and so they remove themselves directly from the buffer list
> and the AIL rather than having to wait for the buffer to be failed
> and callbacks run to be processed correctly.

I suspect we should just kill off xfs_iflush_cluster with this, as it
is best split between xfs_iflush and xfs_inode_item_push.  POC patch
below, but as part of that I noticed some really odd error handling,
which I'll bring up separately.

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 6f873eca8c916..8784e70626403 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1102,12 +1102,12 @@ xfs_reclaim_inode(
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always appears
 	 * to be reclaimed with an invalid inode number when in the free state.
-	 * We do this as early as possible under the ILOCK so that
-	 * xfs_iflush_cluster() and xfs_ifree_cluster() can be guaranteed to
-	 * detect races with us here. By doing this, we guarantee that once
-	 * xfs_iflush_cluster() or xfs_ifree_cluster() has locked XFS_ILOCK that
-	 * it will see either a valid inode that will serialise correctly, or it
-	 * will see an invalid inode that it can skip.
+	 * We do this as early as possible under the ILOCK so that xfs_iflush()
+	 * and xfs_ifree_cluster() can be guaranteed to detect races with us
+	 * here. By doing this, we guarantee that once xfs_iflush() or
+	 * xfs_ifree_cluster() has locked XFS_ILOCK that it will see either a
+	 * valid inode that will serialise correctly, or it will see an invalid
+	 * inode that it can skip.
 	 */
 	spin_lock(&ip->i_flags_lock);
 	ip->i_flags = XFS_IRECLAIM;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 5bbdb0c5ed172..8b0197a386a0f 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3431,19 +3431,70 @@ xfs_rename(
 
 int
 xfs_iflush(
-	struct xfs_inode	*ip,
+	struct xfs_inode_log_item *iip,
 	struct xfs_buf		*bp)
 {
-	struct xfs_inode_log_item *iip = ip->i_itemp;
-	struct xfs_dinode	*dip;
+	struct xfs_inode	*ip = iip->ili_inode;
 	struct xfs_mount	*mp = ip->i_mount;
-	int			error;
+	struct xfs_dinode	*dip;
+	int			error = 0;
+
+	/*
+	 * Quick and dirty check to avoid locks if possible.
+	 */
+	if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK))
+		return 0;
+	if (xfs_ipincount(ip))
+		return 0;
+
+	/*
+	 * The inode is still attached to the buffer, which means it is dirty
+	 * but reclaim might try to grab it.  Check carefully for that, and grab
+	 * the ilock while still holding the i_flags_lock to guarantee reclaim
+	 * will not be able to reclaim this inode once we drop the i_flags_lock.
+	 */
+	spin_lock(&ip->i_flags_lock);
+	ASSERT(!__xfs_iflags_test(ip, XFS_ISTALE));
+	if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK))
+		goto out_unlock_flags_lock;
+
+	/*
+	 * ILOCK will pin the inode against reclaim and prevent concurrent
+	 * transactions modifying the inode while we are flushing the inode.
+	 */
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
+		goto out_unlock_flags_lock;
+	spin_unlock(&ip->i_flags_lock);
+
+	/*
+	 * Skip inodes that are already flush locked as they have already been
+	 * written to the buffer.
+	 */
+	if (!xfs_iflock_nowait(ip))
+		goto out_unlock_ilock;
+
+	/*
+	 * If we are shut down, unpin and abort the inode now as there is no
+	 * point in flushing it to the buffer just to get an IO completion to
+	 * abort the buffer and remove it from the AIL.
+	 */
+	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
+		xfs_iunpin_wait(ip);
+		/* xfs_iflush_abort() drops the flush lock */
+		xfs_iflush_abort(ip);
+		error = -EIO;
+		goto out_unlock_ilock;
+	}
+
+	/* don't block waiting on a log force to unpin dirty inodes */
+	if (xfs_ipincount(ip) || xfs_inode_clean(ip)) {
+		xfs_ifunlock(ip);
+		goto out_unlock_ilock;
+	}
 
-	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
-	ASSERT(xfs_isiflocked(ip));
 	ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_BTREE ||
 	       ip->i_df.if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
-	ASSERT(iip != NULL && iip->ili_fields != 0);
+	ASSERT(iip->ili_fields != 0);
 	ASSERT(iip->ili_item.li_buf == bp);
 
 	dip = xfs_buf_offset(bp, ip->i_imap.im_boffset);
@@ -3574,122 +3625,13 @@ xfs_iflush(
 
 	/* generate the checksum. */
 	xfs_dinode_calc_crc(mp, dip);
-	return error;
-}
-
-/*
- * Non-blocking flush of dirty inode metadata into the backing buffer.
- *
- * The caller must have a reference to the inode and hold the cluster buffer
- * locked. The function will walk across all the inodes on the cluster buffer it
- * can find and lock without blocking, and flush them to the cluster buffer.
- *
- * On success, the caller must write out the buffer returned in *bp and
- * release it. On failure, the filesystem will be shut down, the buffer will
- * have been unlocked and released, and EFSCORRUPTED will be returned.
- */
-int
-xfs_iflush_cluster(
-	struct xfs_buf		*bp)
-{
-	struct xfs_mount	*mp = bp->b_mount;
-	struct xfs_log_item	*lip, *n;
-	struct xfs_inode	*ip;
-	struct xfs_inode_log_item *iip;
-	int			clcount = 0;
-	int			error = 0;
 
-	/*
-	 * We must use the safe variant here as on shutdown xfs_iflush_abort()
-	 * can remove itself from the list.
-	 */
-	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
-		iip = (struct xfs_inode_log_item *)lip;
-		ip = iip->ili_inode;
-
-		/*
-		 * Quick and dirty check to avoid locks if possible.
-		 */
-		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK))
-			continue;
-		if (xfs_ipincount(ip))
-			continue;
-
-		/*
-		 * The inode is still attached to the buffer, which means it is
-		 * dirty but reclaim might try to grab it. Check carefully for
-		 * that, and grab the ilock while still holding the i_flags_lock
-		 * to guarantee reclaim will not be able to reclaim this inode
-		 * once we drop the i_flags_lock.
-		 */
-		spin_lock(&ip->i_flags_lock);
-		ASSERT(!__xfs_iflags_test(ip, XFS_ISTALE));
-		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK)) {
-			spin_unlock(&ip->i_flags_lock);
-			continue;
-		}
-
-		/*
-		 * ILOCK will pin the inode against reclaim and prevent
-		 * concurrent transactions modifying the inode while we are
-		 * flushing the inode.
-		 */
-		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) {
-			spin_unlock(&ip->i_flags_lock);
-			continue;
-		}
-		spin_unlock(&ip->i_flags_lock);
-
-		/*
-		 * Skip inodes that are already flush locked as they have
-		 * already been written to the buffer.
-		 */
-		if (!xfs_iflock_nowait(ip)) {
-			xfs_iunlock(ip, XFS_ILOCK_SHARED);
-			continue;
-		}
-
-		/*
-		 * If we are shut down, unpin and abort the inode now as there
-		 * is no point in flushing it to the buffer just to get an IO
-		 * completion to abort the buffer and remove it from the AIL.
-		 */
-		if (XFS_FORCED_SHUTDOWN(mp)) {
-			xfs_iunpin_wait(ip);
-			/* xfs_iflush_abort() drops the flush lock */
-			xfs_iflush_abort(ip);
-			xfs_iunlock(ip, XFS_ILOCK_SHARED);
-			error = EIO;
-			continue;
-		}
-
-		/* don't block waiting on a log force to unpin dirty inodes */
-		if (xfs_ipincount(ip)) {
-			xfs_ifunlock(ip);
-			xfs_iunlock(ip, XFS_ILOCK_SHARED);
-			continue;
-		}
-
-		if (!xfs_inode_clean(ip))
-			error = xfs_iflush(ip, bp);
-		else
-			xfs_ifunlock(ip);
-		xfs_iunlock(ip, XFS_ILOCK_SHARED);
-		if (error)
-			break;
-		clcount++;
-	}
-
-	if (clcount) {
-		XFS_STATS_INC(mp, xs_icluster_flushcnt);
-		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
-	}
+out_unlock_ilock:
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	return error;
 
-	if (error) {
-		bp->b_flags |= XBF_ASYNC;
-		xfs_buf_ioend_fail(bp);
-		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-	}
+out_unlock_flags_lock:
+	spin_unlock(&ip->i_flags_lock);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index b93cf9076df8a..870e6768e119e 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -427,7 +427,7 @@ int		xfs_log_force_inode(struct xfs_inode *ip);
 void		xfs_iunpin_wait(xfs_inode_t *);
 #define xfs_ipincount(ip)	((unsigned int) atomic_read(&ip->i_pincount))
 
-int		xfs_iflush_cluster(struct xfs_buf *);
+int		xfs_iflush(struct xfs_inode_log_item *iip, struct xfs_buf *bp);
 void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
 				struct xfs_inode *ip1, uint ip1_mode);
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 4dd4f45dcc46e..b82d7f3628745 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -501,8 +501,10 @@ xfs_inode_item_push(
 	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
 	struct xfs_inode	*ip = iip->ili_inode;
 	struct xfs_buf		*bp = lip->li_buf;
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_log_item	*l, *n;
 	uint			rval = XFS_ITEM_SUCCESS;
-	int			error;
+	int			clcount = 0;
 
 	ASSERT(iip->ili_item.li_buf);
 
@@ -530,17 +532,34 @@ xfs_inode_item_push(
 	 * for IO until we queue it for delwri submission.
 	 */
 	xfs_buf_hold(bp);
-	error = xfs_iflush_cluster(bp);
-	if (!error) {
-		if (!xfs_buf_delwri_queue(bp, buffer_list)) {
-			ASSERT(0);
-			rval = XFS_ITEM_FLUSHING;
+
+	/*
+	 * We must use the safe variant here as on shutdown xfs_iflush_abort()
+	 * can remove itself from the list.
+	 */
+	list_for_each_entry_safe(l, n, &bp->b_li_list, li_bio_list) {
+		if (xfs_iflush(INODE_ITEM(l), bp)) {
+			bp->b_flags |= XBF_ASYNC;
+			xfs_buf_ioend_fail(bp);
+			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+			rval = XFS_ITEM_LOCKED;
+			goto out_unlock;
 		}
-		xfs_buf_relse(bp);
-	} else {
-		rval = XFS_ITEM_LOCKED;
+		clcount++;
+	}
+
+	if (clcount) {
+		XFS_STATS_INC(mp, xs_icluster_flushcnt);
+		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
+	}
+
+	if (!xfs_buf_delwri_queue(bp, buffer_list)) {
+		ASSERT(0);
+		rval = XFS_ITEM_FLUSHING;
 	}
+	xfs_buf_relse(bp);
 
+out_unlock:
 	spin_lock(&lip->li_ailp->ail_lock);
 	return rval;
 }

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 23/24] xfs: factor xfs_iflush_done
  2020-05-22  3:50 ` [PATCH 23/24] xfs: factor xfs_iflush_done Dave Chinner
  2020-05-23  0:20   ` Darrick J. Wong
@ 2020-05-23 11:35   ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-23 11:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:28PM +1000, Dave Chinner wrote:
> +xfs_iflush_ail_updates(
> +	struct xfs_ail		*ailp,
> +	struct list_head	*list)
>  {
> +	struct xfs_log_item	*lip;
> +	xfs_lsn_t		tail_lsn = 0;
>  
> +	spin_lock(&ailp->ail_lock);
> +	list_for_each_entry(lip, list, li_bio_list) {
> +		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> +		xfs_lsn_t	lsn;
>  
> +		if (iip->ili_flush_lsn != lip->li_lsn) {
> +			xfs_clear_li_failed(lip);
>  			continue;
>  		}
>  
> +		lsn = xfs_ail_delete_one(ailp, lip);
> +		if (!tail_lsn && lsn)
> +			tail_lsn = lsn;

iip is only used once, this could be shortened by using INODE_ITEM()
directly in the if expession:

	if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn) {

> +	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
> +		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> +		if (!iip->ili_last_fields)
> +			continue;

Please add an empty line after the variable declaration.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 24/24] xfs: remove xfs_inobp_check()
  2020-05-22  3:50 ` [PATCH 24/24] xfs: remove xfs_inobp_check() Dave Chinner
  2020-05-23  0:16   ` Darrick J. Wong
@ 2020-05-23 11:36   ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-23 11:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration
  2020-05-22  3:50 ` [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
  2020-05-23  0:13   ` Darrick J. Wong
  2020-05-23 11:31   ` Christoph Hellwig
@ 2020-05-23 11:39   ` Christoph Hellwig
  2 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-23 11:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 01:50:27PM +1000, Dave Chinner wrote:
> +		if (XFS_FORCED_SHUTDOWN(mp)) {
> +			xfs_iunpin_wait(ip);
> +			/* xfs_iflush_abort() drops the flush lock */
> +			xfs_iflush_abort(ip);
> +			xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +			error = EIO;
>  			continue;
>  		}

This looks suspicious.  For one it assigns a postitive errno value to
error. But also it continues into the next iteration, which will then
usually overwrite error again, unless this was the last inode to be
written.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous
  2020-05-22  4:04 ` [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
@ 2020-05-23 16:18   ` Darrick J. Wong
  2020-05-23 21:22     ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Darrick J. Wong @ 2020-05-23 16:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 22, 2020 at 02:04:01PM +1000, Dave Chinner wrote:
> 
> FWIW, I forgot to put it in the original description - the series
> can be pulled from my git tree here:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim

Hmm, so I tried this out with quotas enabled and hit this in xfs/438:

MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 /dev/sdf
MOUNT_OPTIONS="-o usrquota,grpquota,prjquota"

 BUG: kernel NULL pointer dereference, address: 0000000000000020
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0 
 Oops: 0000 [#1] PREEMPT SMP
 CPU: 3 PID: 824887 Comm: xfsaild/dm-0 Tainted: G        W         5.7.0-rc4-djw #rc4
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
 RIP: 0010:do_raw_spin_trylock+0x5/0x40
 Code: 64 de 81 48 89 ef e8 ba fe ff ff eb 8b 89 c6 48 89 ef e8 de dc ff ff 66 90 eb 8b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 07 85 c0 75 28 ba 01 00 00 00 f0 0f b1 17 75 1d 65 8b 05 83 d8
 RSP: 0018:ffffc90000afbdc0 EFLAGS: 00010086
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: ffff888070ee0000 RSI: 0000000000000000 RDI: 0000000000000020
 RBP: 0000000000000020 R08: 0000000000000001 R09: 0000000000000001
 R10: 0000000000000000 R11: ffffc90000afbc3d R12: 0000000000000038
 R13: 0000000000000202 R14: 0000000000000003 R15: ffff88800688a600
 FS:  0000000000000000(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000020 CR3: 000000003bba2001 CR4: 00000000001606a0
 Call Trace:
  _raw_spin_lock_irqsave+0x47/0x80
  ? down_trylock+0xf/0x30
  down_trylock+0xf/0x30
  xfs_buf_trylock+0x1a/0x1f0 [xfs]
  xfsaild+0xb69/0x1320 [xfs]
  kthread+0x130/0x170
  ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
  ? kthread_park+0x90/0x90
  ret_from_fork+0x3a/0x50
 Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison btrfs blake2b_generic xor zstd_decompress zstd_compress lzo_compress lzo_decompress zlib_deflate raid6_pq dm_snapshot dm_bufio dm_flakey xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip xt_REDIRECT ip_set_hash_net xt_set ip_set_hash_mac xt_tcpudp ip_set iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables nfsv4 af_packet [last unloaded: scsi_debug]
 Dumping ftrace buffer:
    (ftrace buffer empty)
 CR2: 0000000000000020
 ---[ end trace 4ac61a00d1e3b068 ]---

--D

> Cheers,
> 
> Dave.
> 
> On Fri, May 22, 2020 at 01:50:05PM +1000, Dave Chinner wrote:
> > Hi folks,
> > 
> > Inode flushing requires that we first lock an inode, then check it,
> > then lock the underlying buffer, flush the inode to the buffer and
> > finally add the inode to the buffer to be unlocked on IO completion.
> > We then walk all the other cached inodes in the buffer range and
> > optimistically lock and flush them to the buffer without blocking.
> > 
> > This cluster write effectively repeats the same code we do with the
> > initial inode, except now it has to special case that initial inode
> > that is already locked. Hence we have multiple copies of very
> > similar code, and it is a result of inode cluster flushing being
> > based on a specific inode rather than grabbing the buffer and
> > flushing all available inodes to it.
> > 
> > The problem with this at the moment is that we we can't look up the
> > buffer until we have guaranteed that an inode is held exclusively
> > and it's not going away while we get the buffer through an imap
> > lookup. Hence we are kinda stuck locking an inode before we can look
> > up the buffer.
> > 
> > This is also a result of inodes being detached from the cluster
> > buffer except when IO is being done. This has the further problem
> > that the cluster buffer can be reclaimed from memory and then the
> > inode can be dirtied. At this point cleaning the inode requires a
> > read-modify-write cycle on the cluster buffer. If we then are put
> > under memory pressure, cleaning that dirty inode to reclaim it
> > requires allocating memory for the cluster buffer and this leads to
> > all sorts of problems.
> > 
> > We used synchronous inode writeback in reclaim as a throttle that
> > provided a forwards progress mechanism when RMW cycles were required
> > to clean inodes. Async writeback of inodes (e.g. via the AIL) would
> > immediately exhaust remaining memory reserves trying to allocate
> > inode cluster after inode cluster. The synchronous writeback of an
> > inode cluster allowed reclaim to release the inode cluster and have
> > it freed almost immediately which could then be used to allocate the
> > next inode cluster buffer. Hence the IO based throttling mechanism
> > largely guaranteed forwards progress in inode reclaim. By removing
> > the requirement for require memory allocation for inode writeback
> > filesystem level, we can issue writeback asynchrnously and not have
> > to worry about the memory exhaustion anymore.
> > 
> > Another issue is that if we have slow disks, we can build up dirty
> > inodes in memory that can then take hours for an operation like
> > unmount to flush. A RMW cycle per inode on a slow RAID6 device can
> > mean we only clean 50 inodes a second, and when there are hundreds
> > of thousands of dirty inodes that need to be cleaned this can take a
> > long time. PInning the cluster buffers will greatly speed up inode
> > writeback on slow storage systems like this.
> > 
> > These limitations all stem from the same source: inode writeback is
> > inode centric, And they are largely solved by the same architectural
> > change: make inode writeback cluster buffer centric.  This series is
> > makes that architectural change.
> > 
> > Firstly, we start by pinning the inode backing buffer in memory
> > when an inode is marked dirty (i.e. when it is logged). By tracking
> > the number of dirty inodes on a buffer as a counter rather than a
> > flag, we avoid the problem of overlapping inode dirtying and buffer
> > flushing racing to set/clear the dirty flag. Hence as long as there
> > is a dirty inode in memory, the buffer will not be able to be
> > reclaimed. We can safely do this inode cluster buffer lookup when we
> > dirty an inode as we do not hold the buffer locked - we merely take
> > a reference to it and then release it - and hence we don't cause any
> > new lock order issues.
> > 
> > When the inode is finally cleaned, the reference to the buffer can
> > be removed from the inode log item and the buffer released. This is
> > done from the inode completion callbacks that are attached to the
> > buffer when the inode is flushed.
> > 
> > Pinning the cluster buffer in this way immediately avoids the RMW
> > problem in inode writeback and reclaim contexts by moving the memory
> > allocation and the blocking buffer read into the transaction context
> > that dirties the inode.  This inverts our dirty inode throttling
> > mechanism - we now throttle the rate at which we can dirty inodes to
> > rate at which we can allocate memory and read inode cluster buffers
> > into memory rather than via throttling reclaim to rate at which we
> > can clean dirty inodes.
> > 
> > Hence if we are under memory pressure, we'll block on memory
> > allocation when trying to dirty the referenced inode, rather than in
> > the memory reclaim path where we are trying to clean unreferenced
> > inodes to free memory.  Hence we no longer have to guarantee
> > forwards progress in inode reclaim as we aren't doing memory
> > allocation, and that means we can remove inode writeback from the
> > XFS inode shrinker completely without changing the system tolerance
> > for low memory operation.
> > 
> > Tracking the buffers via the inode log item also allows us to
> > completely rework the inode flushing mechanism. While the inode log
> > item is in the AIL, it is safe for the AIL to access any member of
> > the log item. Hence the AIL push mechanisms can access the buffer
> > attached to the inode without first having to lock the inode.
> > 
> > This means we can essentially lock the buffer directly and then
> > call xfs_iflush_cluster() without first going through xfs_iflush()
> > to find the buffer. Hence we can remove xfs_iflush() altogether,
> > because the two places that call it - the inode item push code and
> > inode reclaim - no longer need to flush inodes directly.
> > 
> > This can be further optimised by attaching the inode to the cluster
> > buffer when the inode is dirtied. i.e. when we add the buffer
> > reference to the inode log item, we also attach the inode to the
> > buffer for IO processing. This leads to the dirty inodes always
> > being attached to the buffer and hence we no longer need to add them
> > when we flush the inode and remove them when IO completes. Instead
> > the inodes are attached when the node log item is dirtied, and
> > removed when the inode log item is cleaned.
> > 
> > With this structure in place, we no longer need to do
> > lookups to find the dirty inodes in the cache to attach to the
> > buffer in xfs_iflush_cluster() - they are already attached to the
> > buffer. Hence when the AIL pushes an inode, we just grab the buffer
> > from the log item, and then walk the buffer log item list to lock
> > and flush the dirty inodes attached to the buffer.
> > 
> > This greatly simplifies inode writeback, and removes another memory
> > allocation from the inode writeback path (the array used for the
> > radix tree gang lookup). And while the radix tree lookups are fast,
> > walking the linked list of dirty inodes is faster.
> > 
> > There is followup work I am doing that uses the inode cluster buffer
> > as a replacement in the AIL for tracking dirty inodes. This part of
> > the series is not ready yet as it has some intricate locking
> > requirements. That is an optimisation, so I've left that out because
> > solving the inode reclaim blocking problems is the important part of
> > this work.
> > 
> > In short, this series simplifies inode writeback and fixes the long
> > standing inode reclaim blocking issues without requiring any changes
> > to the memory reclaim infrastructure.
> > 
> > Note: dquots should probably be converted to cluster flushing in a
> > similar way, as they have many of the same issues as inode flushing.
> > 
> > Thoughts, comments and improvemnts welcome.
> > 
> > -Dave.
> > 
> > 
> > 
> > Dave Chinner (24):
> >   xfs: remove logged flag from inode log item
> >   xfs: add an inode item lock
> >   xfs: mark inode buffers in cache
> >   xfs: mark dquot buffers in cache
> >   xfs: mark log recovery buffers for completion
> >   xfs: call xfs_buf_iodone directly
> >   xfs: clean up whacky buffer log item list reinit
> >   xfs: fold xfs_istale_done into xfs_iflush_done
> >   xfs: use direct calls for dquot IO completion
> >   xfs: clean up the buffer iodone callback functions
> >   xfs: get rid of log item callbacks
> >   xfs: pin inode backing buffer to the inode log item
> >   xfs: make inode reclaim almost non-blocking
> >   xfs: remove IO submission from xfs_reclaim_inode()
> >   xfs: allow multiple reclaimers per AG
> >   xfs: don't block inode reclaim on the ILOCK
> >   xfs: remove SYNC_TRYLOCK from inode reclaim
> >   xfs: clean up inode reclaim comments
> >   xfs: attach inodes to the cluster buffer when dirtied
> >   xfs: xfs_iflush() is no longer necessary
> >   xfs: rename xfs_iflush_int()
> >   xfs: rework xfs_iflush_cluster() dirty inode iteration
> >   xfs: factor xfs_iflush_done
> >   xfs: remove xfs_inobp_check()
> > 
> >  fs/xfs/libxfs/xfs_inode_buf.c   |  27 +-
> >  fs/xfs/libxfs/xfs_inode_buf.h   |   6 -
> >  fs/xfs/libxfs/xfs_trans_inode.c | 108 +++++--
> >  fs/xfs/xfs_buf.c                |  44 ++-
> >  fs/xfs/xfs_buf.h                |  49 +--
> >  fs/xfs/xfs_buf_item.c           | 205 +++++--------
> >  fs/xfs/xfs_buf_item.h           |   8 +-
> >  fs/xfs/xfs_buf_item_recover.c   |   5 +-
> >  fs/xfs/xfs_dquot.c              |  32 +-
> >  fs/xfs/xfs_dquot.h              |   1 +
> >  fs/xfs/xfs_dquot_item_recover.c |   2 +-
> >  fs/xfs/xfs_file.c               |   9 +-
> >  fs/xfs/xfs_icache.c             | 293 +++++-------------
> >  fs/xfs/xfs_inode.c              | 515 +++++++++++---------------------
> >  fs/xfs/xfs_inode.h              |   2 +-
> >  fs/xfs/xfs_inode_item.c         | 281 ++++++++---------
> >  fs/xfs/xfs_inode_item.h         |   9 +-
> >  fs/xfs/xfs_inode_item_recover.c |   2 +-
> >  fs/xfs/xfs_log_recover.c        |   5 +-
> >  fs/xfs/xfs_mount.c              |   4 -
> >  fs/xfs/xfs_mount.h              |   1 -
> >  fs/xfs/xfs_trans.h              |   3 -
> >  fs/xfs/xfs_trans_buf.c          |  15 +-
> >  fs/xfs/xfs_trans_priv.h         |  12 +-
> >  24 files changed, 680 insertions(+), 958 deletions(-)
> > 
> > -- 
> > 2.26.2.761.g0e0b3e54be
> > 
> > 
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous
  2020-05-23 16:18   ` Darrick J. Wong
@ 2020-05-23 21:22     ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-23 21:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sat, May 23, 2020 at 09:18:33AM -0700, Darrick J. Wong wrote:
> On Fri, May 22, 2020 at 02:04:01PM +1000, Dave Chinner wrote:
> > 
> > FWIW, I forgot to put it in the original description - the series
> > can be pulled from my git tree here:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim
> 
> Hmm, so I tried this out with quotas enabled and hit this in xfs/438:

Yeah, I found another bug about 2 hours after I send this - the
iodone error ->li_error callouts are not handled correctly, but I
haven't seen this one.

> 
> MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 /dev/sdf
> MOUNT_OPTIONS="-o usrquota,grpquota,prjquota"
> 
>  BUG: kernel NULL pointer dereference, address: 0000000000000020
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  PGD 0 P4D 0 
>  Oops: 0000 [#1] PREEMPT SMP
>  CPU: 3 PID: 824887 Comm: xfsaild/dm-0 Tainted: G        W         5.7.0-rc4-djw #rc4
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
>  RIP: 0010:do_raw_spin_trylock+0x5/0x40
>  Code: 64 de 81 48 89 ef e8 ba fe ff ff eb 8b 89 c6 48 89 ef e8 de dc ff ff 66 90 eb 8b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 07 85 c0 75 28 ba 01 00 00 00 f0 0f b1 17 75 1d 65 8b 05 83 d8
>  RSP: 0018:ffffc90000afbdc0 EFLAGS: 00010086
>  RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>  RDX: ffff888070ee0000 RSI: 0000000000000000 RDI: 0000000000000020
>  RBP: 0000000000000020 R08: 0000000000000001 R09: 0000000000000001
>  R10: 0000000000000000 R11: ffffc90000afbc3d R12: 0000000000000038
>  R13: 0000000000000202 R14: 0000000000000003 R15: ffff88800688a600
>  FS:  0000000000000000(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000020 CR3: 000000003bba2001 CR4: 00000000001606a0
>  Call Trace:
>   _raw_spin_lock_irqsave+0x47/0x80
>   ? down_trylock+0xf/0x30
>   down_trylock+0xf/0x30
>   xfs_buf_trylock+0x1a/0x1f0 [xfs]
>   xfsaild+0xb69/0x1320 [xfs]
>   kthread+0x130/0x170

Where is xfsaild calling xfs_buf_trylock directly?

Oh, resubmission of failed inode and dquot items, which may well be
the same problem as I mentioned above. I'll try to reproduce on
Monday...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 12/24] xfs: pin inode backing buffer to the inode log item
  2020-05-23  9:34   ` Christoph Hellwig
@ 2020-05-23 21:43     ` Dave Chinner
  2020-05-24  5:31       ` Christoph Hellwig
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2020-05-23 21:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Sat, May 23, 2020 at 02:34:51AM -0700, Christoph Hellwig wrote:
> > --- a/fs/xfs/xfs_trans_priv.h
> > +++ b/fs/xfs/xfs_trans_priv.h
> > @@ -143,15 +143,10 @@ static inline void
> >  xfs_clear_li_failed(
> >  	struct xfs_log_item	*lip)
> >  {
> > -	struct xfs_buf	*bp = lip->li_buf;
> > -
> >  	ASSERT(test_bit(XFS_LI_IN_AIL, &lip->li_flags));
> >  	lockdep_assert_held(&lip->li_ailp->ail_lock);
> >  
> > -	if (test_and_clear_bit(XFS_LI_FAILED, &lip->li_flags)) {
> > -		lip->li_buf = NULL;
> > -		xfs_buf_rele(bp);
> > -	}
> > +	clear_bit(XFS_LI_FAILED, &lip->li_flags);
> >  }
> >  
> >  static inline void
> > @@ -161,10 +156,7 @@ xfs_set_li_failed(
> >  {
> >  	lockdep_assert_held(&lip->li_ailp->ail_lock);
> >  
> > -	if (!test_and_set_bit(XFS_LI_FAILED, &lip->li_flags)) {
> > -		xfs_buf_hold(bp);
> > -		lip->li_buf = bp;
> > -	}
> > +	set_bit(XFS_LI_FAILED, &lip->li_flags);
> >  }
> 
> Isn't this going to break quotas, which don't always have li_buf set?

Yup, that'll be the assert fail that Darrick reported. Well spotted,
Christoph!

I've got to rework the error handling code anyway, so I might end up
getting rid of ->li_error and hard coding these like I've done the
iodone functions. That way the different objects can use different
failure mechanisms until the dquot code is converted to the same
"hold at dirty time" flushing mechanism...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 13/24] xfs: make inode reclaim almost non-blocking
  2020-05-22 22:48   ` Darrick J. Wong
@ 2020-05-23 22:29     ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-23 22:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, May 22, 2020 at 03:48:06PM -0700, Darrick J. Wong wrote:
> On Fri, May 22, 2020 at 01:50:18PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that dirty inode writeback doesn't cause read-modify-write
> > cycles on the inode cluster buffer under memory pressure, the need
> > to throttle memory reclaim to the rate at which we can clean dirty
> > inodes goes away. That is due to the fact that we no longer thrash
> > inode cluster buffers under memory pressure to clean dirty inodes.
> > 
> > This means inode writeback no longer stalls on memory allocation
> > or read IO, and hence can be done asynchrnously without generating
> 
> "...asynchronously..."
> 
> > memory pressure. As a result, blocking inode writeback in reclaim is
> > no longer necessary to prevent reclaim priority windup as cleaning
> > dirty inodes is no longer dependent on having memory reserves
> > available for the filesystem to make progress reclaiming inodes.
> > 
> > Hence we can convert inode reclaim to be non-blocking for shrinker
> > callouts, both for direct reclaim and kswapd.
> > 
> > On a vanilla kernel, running a 16-way fsmark create workload on a
> > 4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
> > userspace mlock(). The OOM killer gets invoked at 15GB of
> > pinned RAM.
> > 
> > With this patch alone, pinning memory triggers premature OOM
> > killer invocation, sometimes with as much as 45% of RAM being free.
> > It's trivially easy to trigger the OOM killer when reclaim does not
> > block.
> > 
> > With pinning inode clusters in RAM adn then adding this patch, I can
> > reliably pin 14.5GB of RAM and still have the fsmark workload run to
> > completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
> > is only a small amount of memory less than the vanilla kernel. It is
> > much more reliable than just with async reclaim alone.
> 
> So the lack of OOM kills is the result of not having to do RMW and
> ratcheting up the reclaim priority, right?

Effectively. The ratcheting up the reclaim priority without
writeback is a secondary effect of RMW in inode writeback.

That is, the AIL blocks on memory reclaim doing dirty inode
writeback because it has unbound demand (async flushing). Hence it
exhausts memory reserves if there are lots of dirty inodes. It's
also PF_MEMALLOC so, like kswapd, it can dip into certain reserves
that normal allocation can't.

The synchronous write behaviour of reclaim, however, bounds memory
demand at (N * ag count * pages per inode cluster), and hence it is
much more likely to make forwards progress, albeit slowly. The
synchronous write also has the effect of throttling the rate at
which reclaim cycles, hence slowly down the speed at which it ramps
up the reclaim priority rate. IOWs, we get both forwards progress
and lower reclaim priority because we block reclaim like this.

IOWs, removing the synchronous writeback from reclaim does two
things. The first is that it removes the ability to make forwards
progress reclaiming inodes from XFS when there is very low free
memory. This is bad for obvious reasons.

The second is that it allows reclaim to think it can't free
inode memory quickly and that's what causes the increase in reclaim
priority. i.e. it needs more scan loops to free inodes because
writeback of dirty inodes is slow and not making progress. This is
also bad, because we can make progress, just not as fast as memory
reclaim is capable of backing off from.

The sync writeback of inode clusters from reclaim mitigated both of
these issues when they occurred at the cost of increased allocation
latency at extreme OOM conditions...

This is why, despite everyone with OOM latency problems claiming "it
works for them so you should just merge it", just skipping inode
writeback in the shrinker has not been a solution to the problem -
it didn't solve the underlying "reclaim of dirty inodes can create
unbound memory demand" problem that the sync inode writeback
controlled.

Previous attempts to solve this problem had been focussed on
replacing the throttling the shrinker did with backoffs in the core
reclaim algorithms, but that's made no progress on the mm/ side of
things. Hence this patchset - trying to tackle the problem from a
different direction so we are no longer reliant on changing core OS
infrastructure to solve problems XFS users are having.

> And, {con|per}versely, can I run fstests with 400MB of RAM now? :D

If it is bound on sync inode writeback from memory reclaim, then it
will help, otherwise it may make things worse because the trade off
we are making here is that dirty inodes can pin substantially more
memory in cache while they queue to be written back.

Yup, that's the ugly downside of this approach. Rather than have the
core memory reclaim throttle and wait according to what we need it
to do, we simply make the XFS cache footprint larger every time we
dirty an inode. It also costs us 1-2% extra CPU per transaction, so
this change certainly isn't free. IMO, it's most definitely not the
most efficient, performant or desirable solution to the problem, but
it's one that works and is wholly contained within XFS.

> > simoops shows that allocation stalls go away when async reclaim is
> > used. Vanilla kernel:
> > 
> > Run time: 1924 seconds
> > Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
> > Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
> > Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
> > work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
> > alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
> > 
> > With inode cluster pinning and async reclaim:
> > 
> > Run time: 1924 seconds
> > Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
> > Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
> > Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
> 
> I'm not familiar with simoops, and ElGoog is not helpful.  What are the
> units here?

Microseconds, IIRC.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 14/24] xfs: remove IO submission from xfs_reclaim_inode()
  2020-05-23  9:40   ` Christoph Hellwig
@ 2020-05-23 22:35     ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-23 22:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Sat, May 23, 2020 at 02:40:55AM -0700, Christoph Hellwig wrote:
> > +out_ifunlock:
> > +	xfs_ifunlock(ip);
> > +out:
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > +	xfs_iflags_clear(ip, XFS_IRECLAIM);
> > +	return false;
> >  
> >  reclaim:
> 
> I find the reordering of the error handling to sit before the actual
> reclaim action here really weird to read.  What about something like
> this folded in instead?

Yeah, once the writeback restart goes away, it does look somewhat
weird.  I've been considering pulling the actual inode reclaim code
into it's own function, just to separate it from the "try-lock and
check if we can reclaim this inode" operations.

I'll have a look and see what falls out...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 15/24] xfs: allow multiple reclaimers per AG
  2020-05-22 23:10   ` Darrick J. Wong
@ 2020-05-23 22:35     ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-23 22:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, May 22, 2020 at 04:10:07PM -0700, Darrick J. Wong wrote:
> On Fri, May 22, 2020 at 01:50:20PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Inode reclaim will still throttle direct reclaim on the per-ag
> > reclaim locks. This is no longer necessary as reclaim can run
> > non-blocking now. Hence we can remove these locks so that we don't
> > arbitrarily block reclaimers just because there are more direct
> > reclaimers than there are AGs.
> > 
> > This can result in multiple reclaimers working on the same range of
> > an AG, but this doesn't cause any apparent issues. Optimising the
> > spread of concurrent reclaimers for best efficiency can be done in a
> > future patchset.
> 
> "Future patchset" as in "not in the 9 patches that I have left to read"?

Yes.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 17/24] xfs: remove SYNC_TRYLOCK from inode reclaim
  2020-05-22 23:14   ` Darrick J. Wong
@ 2020-05-23 22:42     ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-23 22:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, May 22, 2020 at 04:14:35PM -0700, Darrick J. Wong wrote:
> On Fri, May 22, 2020 at 01:50:22PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > All background reclaim is SYNC_TRYLOCK already, and even blocking
> > reclaim (SYNC_WAIT) can use trylock mechanisms as
> > xfs_reclaim_inodes_ag() will keep cycling until there are no more
> > reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode
> > reclaim and make everything unconditionally non-blocking.
> 
> Random question: Does xfs_quiesce_attr need to call xfs_reclaim_inodes
> twice now, or does the second SYNC_WAIT call suffice now?

I have another patch that drops it completely from
xfs_quiesce_attr() because it is largely unnecessary. That got
tangled up in a fixing other bugs in the patchset, so I dropped it
to get this out the door rather than have to rewrite the patch
-again-.

Essentially, xfs_quiesce_attr() used inode reclaim to flush dirty
inodes that the VFS didn't know about before we froze the
filesystem. It's a historical artifact that dates back to before we
logged every change to inodes, as AIL pushing couldn't clean
unlogged inode changes.  We don't actually need to reclaim inodes
here now that xfs_log_quiesce() pushes the AIL and waits for all
metadata writeback to complete.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 19/24] xfs: attach inodes to the cluster buffer when dirtied
  2020-05-22 23:48   ` Darrick J. Wong
@ 2020-05-23 22:59     ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-23 22:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, May 22, 2020 at 04:48:59PM -0700, Darrick J. Wong wrote:
> On Fri, May 22, 2020 at 01:50:24PM +1000, Dave Chinner wrote:
> > @@ -2649,53 +2650,12 @@ xfs_ifree_cluster(
> >  		bp->b_ops = &xfs_inode_buf_ops;
> >  
> >  		/*
> > -		 * Walk the inodes already attached to the buffer and mark them
> > -		 * stale. These will all have the flush locks held, so an
> > -		 * in-memory inode walk can't lock them. By marking them all
> > -		 * stale first, we will not attempt to lock them in the loop
> > -		 * below as the XFS_ISTALE flag will be set.
> > -		 */
> > -		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> > -			if (lip->li_type == XFS_LI_INODE) {
> > -				iip = (struct xfs_inode_log_item *)lip;
> > -				xfs_trans_ail_copy_lsn(mp->m_ail,
> > -							&iip->ili_flush_lsn,
> > -							&iip->ili_item.li_lsn);
> > -				xfs_iflags_set(iip->ili_inode, XFS_ISTALE);
> > -			}
> > -		}
> 
> Hm.  I think I'm a little confused here.  I think the consequence of
> attaching inode items to the buffer whenever we dirty the inode is that
> we no longer need to travel the inode list to set ISTALE because we know
> that the lookup loop below is sufficient to catch all of the inodes that
> are still hanging around in memory?

Yes. The issue here is that we now have inodes on this list that are
not flush locked, and so we can't just walk it assuming we can
change the flush state without holding the ILOCK to first ensure
the inode is not racing with reclaim, etc.

> We don't call xfs_ifree_cluster if any of the inodes in it are
> allocated, which means that all the on-disk inodes are either (a) not
> represented in memory, in which case we don't have to stale them; or (b)
> they're in memory and dirty (because they've recently been freed).  But
> if that's true, then surely you could find them all via b_li_list?

No, we can have inodes that are free but clean in memory when we
free the cluster. do an unlink of an inode, commit, push the AIL, it
gets written back and is now clean in memory with mode == 0 and
state XFS_IRECLAIMABLE. That inode is not reclaimed until memory
reclaim or the background reclaimer finds it and reclaims it. Those
are the inodes we want to mark XFS_ISTALE so they get treated by
lookup correctly if the cluster is reallocated and the inode
reinstantiated before it is reclaimed from memory.

> On the other hand, it seems redundant to iterate the list /and/ do the
> lookups and we no longer need to "set it up for being staled", so the
> second loop survives?

Yeah, the second loop is used because we have to look up the cache
anyway, and there's no point in iterating the buffer list because
it now has to do all the same "is it clean" checks and flush
locking, etc that the cache lookup has to do. A future patchset can
make the cache lookup use a gang lookup on the radix tree to
optimise it if necessary...

> > -
> > -
> > -		/*
> > -		 * For each inode in memory attempt to add it to the inode
> > -		 * buffer and set it up for being staled on buffer IO
> > -		 * completion.  This is safe as we've locked out tail pushing
> > -		 * and flushing by locking the buffer.
> > -		 *
> > -		 * We have already marked every inode that was part of a
> > -		 * transaction stale above, which means there is no point in
> > -		 * even trying to lock them.
> > +		 * Now we need to set all the cached clean inodes as XFS_ISTALE,
> > +		 * too. This requires lookups, and will skip inodes that we've
> > +		 * already marked XFS_ISTALE.
> >  		 */
> > -		for (i = 0; i < igeo->inodes_per_cluster; i++) {
> > -			ip = xfs_ifree_get_one_inode(pag, free_ip, inum + i);
> > -			if (!ip)
> > -				continue;
> > -
> > -			iip = ip->i_itemp;
> > -			spin_lock(&iip->ili_lock);
> > -			iip->ili_last_fields = iip->ili_fields;
> > -			iip->ili_fields = 0;
> > -			iip->ili_fsync_fields = 0;
> > -			spin_unlock(&iip->ili_lock);
> > -			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
> > -						&iip->ili_item.li_lsn);
> > -
> > -			list_add_tail(&iip->ili_item.li_bio_list,
> > -						&bp->b_li_list);
> > -
> > -			if (ip != free_ip)
> > -				xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > -		}
> > +		for (i = 0; i < igeo->inodes_per_cluster; i++)
> > +			xfs_ifree_mark_inode_stale(pag, free_ip, inum + i);
> 
> I get that we're mostly just hoisting everything in the loop body to the
> end of xfs_ifree_get_one_inode, but can that be part of a separate hoist
> patch?

Ok. Not really happy about breaking it up to even more fine grained
patches as this patchset is already a nightmare to keep up to date
against all the random cleanups going into for-next.....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration
  2020-05-23  0:13   ` Darrick J. Wong
@ 2020-05-23 23:14     ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-23 23:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, May 22, 2020 at 05:13:34PM -0700, Darrick J. Wong wrote:
> On Fri, May 22, 2020 at 01:50:27PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that we have all the dirty inodes attached to the cluster
> > buffer, we don't actually have to do radix tree lookups to find
> > them. Sure, the radix tree is efficient, but walking a linked list
> > of just the dirty inodes attached to the buffer is much better.
> > 
> > We are also no longer dependent on having a locked inode passed into
> > the function to determine where to start the lookup. This means we
> > can drop it from the function call and treat all inodes the same.
> > 
> > We also make xfs_iflush_cluster skip inodes marked with
> > XFS_IRECLAIM. This we avoid races with inodes that reclaim is
> > actively referencing or are being re-initialised by inode lookup. If
> > they are actually dirty, they'll get written by a future cluster
> > flush....
> > 
> > We also add a shutdown check after obtaining the flush lock so that
> > we catch inodes that are dirty in memory and may have inconsistent
> > state due to the shutdown in progress. We abort these inodes
> > directly and so they remove themselves directly from the buffer list
> > and the AIL rather than having to wait for the buffer to be failed
> > and callbacks run to be processed correctly.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_inode.c      | 150 ++++++++++++++++------------------------
> >  fs/xfs/xfs_inode.h      |   2 +-
> >  fs/xfs/xfs_inode_item.c |  15 +++-
> >  3 files changed, 74 insertions(+), 93 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index cbf8edf62d102..7db0f97e537e3 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -3428,7 +3428,7 @@ xfs_rename(
> >  	return error;
> >  }
> >  
> > -static int
> > +int
> >  xfs_iflush(
> 
> Not sure why this drops the static?

Stray hunk from reordering the patchset. This used to be before the
removal of writeback from reclaim, so it needed to be called from
there. But that ordering made no sense as it required heaps of
temporary changes to the reclaim code, so I put it first and got
rid of of all the reclaim writeback futzing. Clearly I forgot to
remove this hunk when I cleared xfs_iflush() out of the header file.

> > -		if (!cip->i_ino) {
> > -			xfs_ifunlock(cip);
> > -			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> > +		if (XFS_FORCED_SHUTDOWN(mp)) {
> > +			xfs_iunpin_wait(ip);
> > +			/* xfs_iflush_abort() drops the flush lock */
> > +			xfs_iflush_abort(ip);
> > +			xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > +			error = EIO;
> 
> error = -EIO?

Yup, good catch.

> > -	error = xfs_iflush_cluster(ip, bp);
> > +	/*
> > +	 * We need to hold a reference for flushing the cluster buffer as it may
> > +	 * fail the buffer without IO submission. In which case, we better have
> > +	 * a reference for that completion as otherwise we don't get a reference
> > +	 * for IO until we queue it for delwri submission.
> 
> <confused>
> 
> What completion are we talking about?  Does this refer to the fact that
> xfs_iflush_cluster handles a flush failure by simulating an async write
> failure which could result in us giving away the inode log item's
> reference to the buffer?

Yes. This is the way the code currently works before this patchset.
The IO reference used by the AIL writeback comes from
xfs_imap_to_bp(), and getting rid of that from the writeback path
means we no longer have an IO reference to the buffer before calling
xfs_iflush_cluster(). And, yes, the xfs_buf_ioend_fail() call in
xfs_iflush_cluster() implicitly relies on this reference existing.
Hence we have to take one ourselves and we cannot rely on the
IO reference we get from adding the buffer to the delwri list.

This was a bug I found a couple of hours before I posted the
patchset, and the bisect pointed at this patch as the cause. It may
be that it should be in the patch that gets rid of the xfs_iflush()
call and propagate through that way. However, after 3 days of
continuous bisecting to find bugs as a result of implicit,
undocumented stuff like this over and over again, the novelty was
starting to wear a bit thin.

I'll revisit this when I've regained some patience....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration
  2020-05-23 11:31   ` Christoph Hellwig
@ 2020-05-23 23:23     ` Dave Chinner
  2020-05-24  5:32       ` Christoph Hellwig
  0 siblings, 1 reply; 91+ messages in thread
From: Dave Chinner @ 2020-05-23 23:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Sat, May 23, 2020 at 04:31:31AM -0700, Christoph Hellwig wrote:
> On Fri, May 22, 2020 at 01:50:27PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that we have all the dirty inodes attached to the cluster
> > buffer, we don't actually have to do radix tree lookups to find
> > them. Sure, the radix tree is efficient, but walking a linked list
> > of just the dirty inodes attached to the buffer is much better.
> > 
> > We are also no longer dependent on having a locked inode passed into
> > the function to determine where to start the lookup. This means we
> > can drop it from the function call and treat all inodes the same.
> > 
> > We also make xfs_iflush_cluster skip inodes marked with
> > XFS_IRECLAIM. This we avoid races with inodes that reclaim is
> > actively referencing or are being re-initialised by inode lookup. If
> > they are actually dirty, they'll get written by a future cluster
> > flush....
> > 
> > We also add a shutdown check after obtaining the flush lock so that
> > we catch inodes that are dirty in memory and may have inconsistent
> > state due to the shutdown in progress. We abort these inodes
> > directly and so they remove themselves directly from the buffer list
> > and the AIL rather than having to wait for the buffer to be failed
> > and callbacks run to be processed correctly.
> 
> I suspect we should just kill off xfs_iflush_cluster with this, as it
> is best split between xfs_iflush and xfs_inode_item_push.  POC patch
> below, but as part of that I noticed some really odd error handling,
> which I'll bring up separately.

That's the exact opposite way the follow-on patchset I have goes.
xfs_inode_item_push() goes away entirely because tracking 30+ inodes
in the AIL for a single writeback IO event is amazingly inefficient
and consumes huge amounts of CPU unnecessarily.

IOWs, the original point of pinning the inode cluster buffer
directly at inode dirtying was so we could track dirty inodes in the
AIL via the cluster buffer instead of individually as inodes.  The
AIL drops by a factor of 30 in size, and instead of seeing 10,000
inode item push "success" reports and 300,000 "flushing" reports a
second, we only see the 10,000 success reports.

IOWs, the xfsaild is CPU bound in highly concurrent inode dirtying
workloads because of the massive numbers of inodes we cycle through
it. Tracking them by cluster buffer reduces the CPU over of the
xfsaild substantially. The whole "hang on, this solves the OOM kill
problem with async reclaim" was a happy accident - I didn't start
this patch set with the intent of fixing inode reclaim, it started
because inode tracking in the AIL is highly inefficient...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 12/24] xfs: pin inode backing buffer to the inode log item
  2020-05-23 21:43     ` Dave Chinner
@ 2020-05-24  5:31       ` Christoph Hellwig
  2020-05-24 23:13         ` Dave Chinner
  0 siblings, 1 reply; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-24  5:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs

On Sun, May 24, 2020 at 07:43:34AM +1000, Dave Chinner wrote:
> I've got to rework the error handling code anyway, so I might end up
> getting rid of ->li_error and hard coding these like I've done the
> iodone functions. That way the different objects can use different
> failure mechanisms until the dquot code is converted to the same
> "hold at dirty time" flushing mechanism...

FYI, while reviewing your series I looked at that area a bit, and
found the (pre-existing) code structure a little weird:

 - xfs_buf_iodone_callback_errorl  deals with the buffer itself and
   thus should sit in xfs_buf.c, not xfs_buf_item.c
 - xfs_buf_do_callbacks_fail really nees to be a buffer level
   methods instead of polig into b_li_list, which nothing else in
   "common" code does.  My though was to either add another method
   or overload the b_write_done method to pass the error back to
   the buffer owner and let the owner deal with the list iteration
   an exact error handling method.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration
  2020-05-23 23:23     ` Dave Chinner
@ 2020-05-24  5:32       ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2020-05-24  5:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs

On Sun, May 24, 2020 at 09:23:19AM +1000, Dave Chinner wrote:
> That's the exact opposite way the follow-on patchset I have goes.
> xfs_inode_item_push() goes away entirely because tracking 30+ inodes
> in the AIL for a single writeback IO event is amazingly inefficient
> and consumes huge amounts of CPU unnecessarily.

Ok.  It might still make sense to move the checks and locking into
xfs_iflush, as the code flow is much easier to follow that way.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 12/24] xfs: pin inode backing buffer to the inode log item
  2020-05-24  5:31       ` Christoph Hellwig
@ 2020-05-24 23:13         ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-24 23:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Sat, May 23, 2020 at 10:31:24PM -0700, Christoph Hellwig wrote:
> On Sun, May 24, 2020 at 07:43:34AM +1000, Dave Chinner wrote:
> > I've got to rework the error handling code anyway, so I might end up
> > getting rid of ->li_error and hard coding these like I've done the
> > iodone functions. That way the different objects can use different
> > failure mechanisms until the dquot code is converted to the same
> > "hold at dirty time" flushing mechanism...
> 
> FYI, while reviewing your series I looked at that area a bit, and
> found the (pre-existing) code structure a little weird:
> 
>  - xfs_buf_iodone_callback_errorl  deals with the buffer itself and
>    thus should sit in xfs_buf.c, not xfs_buf_item.c
>  - xfs_buf_do_callbacks_fail really nees to be a buffer level
>    methods instead of polig into b_li_list, which nothing else in
>    "common" code does.  My though was to either add another method
>    or overload the b_write_done method to pass the error back to
>    the buffer owner and let the owner deal with the list iteration
>    an exact error handling method.

No.

Just no.

Please stop with the "we have to clean up all this irrelevant stuff
before we land a feature/bug fix" idiocy already.

We do not need to completely rework the way the infrastructure is
laid out to fix this problem. That is not a priority for me, nor is
it important in any way to solving this problem. This patchset
already removes a huge amount of code so it cleans up a lot in the
process of fixing the important problem. But the reality is that it
also touches many very important areas in the code base and so we
need to -minimise- the unnecessary changes in the patchset, not add
more to it.

The most important thing we need to do here is that we get the
change correct. We do not need to completely rewrite how the code is
laid out, nor do we need to move hundreds of lines of code form one
file to another just to clean up some non-critical code. It's
completely unnecessary and irrelevant to fixing the problem the
patchset is trying to address.

Yes, I will rework the bits needed to fix the problems that have
been found, but I'm not going to go and make wholesale changes to
the buffer and buffer item IO completion infrastructure because it
is *not necessary* to fix the problems.

This patchset has been a nightmare so far precisely because of the
frequent cleanup patchsets merged in the past couple of months that
have caused widespread churn in the codebase. Almost none of these
cleanups have done anything other than change the code - most
haven't even been necessary for bug fixes to be applied, either.
They've just been "change" and that's caused me repeated problems
with severe patch conflicts.

Code cleanups *are not free*. They might be easy to do and review so
there's no big upfront cost to them. The cost to cleanups are in the
downstream effects - developer patch sets no longer apply, code is
no longer recognisable at a glance to experienced developers,
failure modes are different, bugs can be introduced, etc. All of
these things add time and resources to the work that other
developers not involved in the cleanup process are trying to do.

And when the work those developers are trying to address long term
problems and are full of complex, subtle interactions and changes?
Cleanups that keep overlapping with that work are actively harmful
to the process of fixing such problems.

The problem here is all these cleanups are reactive patchsets -
someone sends a patchset for review, and then immediately the list
is filled with cleanup patchsets that hit the exact area of code
that the original patchset modified.  This is not a one-off incident
- over the past few months this has happened almost every time every
time someone has posted a substantial feature or bug-fix patchset.

So, can we please stop with the "clean up before original patchset
lands" reviews and patch postings. If anyone has cleanup patches,
please send them out when you do them, not in response to someone
else trying to fix a problem. If anyone wants to make significant
clean ups around someone elses work as a result of reviewing that
code, we need to do it -after- the current patchset has been
reviewed and merged.

We will still get the code cleanup done, but we need to prioritise
the work we do appropriate. i.e. we need to land the important
thing first, then worry about the little stuff that isn't critical
to addressing the immediate issue. Code cleanups are definitely
necessary, but they are most definitely are not the most important
thing we need to do...

/end rant

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/24] xfs: mark inode buffers in cache
  2020-05-22 21:35   ` Darrick J. Wong
@ 2020-05-24 23:41     ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-24 23:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, May 22, 2020 at 02:35:17PM -0700, Darrick J. Wong wrote:
> On Fri, May 22, 2020 at 01:50:08PM +1000, Dave Chinner wrote:
> >  #define XFS_BUF_FLAGS \
> > @@ -50,12 +54,13 @@ typedef unsigned int xfs_buf_flags_t;
> >  	{ XBF_DONE,		"DONE" }, \
> >  	{ XBF_STALE,		"STALE" }, \
> >  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
> > -	{ XBF_TRYLOCK,		"TRYLOCK" },	/* should never be set */\
> > -	{ XBF_UNMAPPED,		"UNMAPPED" },	/* ditto */\
> > +	{ _XBF_INODES,		"INODES" }, \
> 
> This a toughie.  On the one hand if you're going to go introducing what
> amounts to two-bit buffer io completion type in the middle of b_flags
> then (like Amir says) this ideally would have a mask and switch
> statements and whatnot.

I didn't really want a mask, because these are not unconditional
"buffer type" flags. They are only set on buffers with inodes
attached for IO completion and used that way in this patchset.
Everywhere else that uses them only checks for a single type...

> I also wonder if we could tell the buffer type given all the
> xfs_trans_buf_set_type calls, but I think the answer is that not every
> buffer is guaranteed to have a buffer log item attached and a type code
> set correctly?

Correct - that's what I looked at first, but inode cluster buffers
that have not just been created or have an unlinked inode logged
against them don't have buffer log items. Similarly dquot buffers
may not have a log item, either.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/24] xfs: mark log recovery buffers for completion
  2020-05-22  7:41   ` Amir Goldstein
@ 2020-05-24 23:54     ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-24 23:54 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-xfs

On Fri, May 22, 2020 at 10:41:21AM +0300, Amir Goldstein wrote:
> On Fri, May 22, 2020 at 6:51 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Log recovery has it's own buffer write completion handler for
> > buffers that it directly recovers. Convert these to direct calls by
> > flagging these buffers as being log recovery buffers. The flag will
> > get cleared by the log recovery IO completion routine, so it will
> > never leak out of log recovery.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_buf.c                | 10 ++++++++++
> >  fs/xfs/xfs_buf.h                |  2 ++
> >  fs/xfs/xfs_buf_item_recover.c   |  5 ++---
> >  fs/xfs/xfs_dquot_item_recover.c |  2 +-
> >  fs/xfs/xfs_inode_item_recover.c |  2 +-
> >  fs/xfs/xfs_log_recover.c        |  5 ++---
> >  6 files changed, 18 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 77d40eb4a11db..b89685ce8519d 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -14,6 +14,7 @@
> >  #include "xfs_mount.h"
> >  #include "xfs_trace.h"
> >  #include "xfs_log.h"
> > +#include "xfs_log_recover.h"
> >  #include "xfs_trans.h"
> >  #include "xfs_buf_item.h"
> >  #include "xfs_errortag.h"
> > @@ -1207,6 +1208,15 @@ xfs_buf_ioend(
> >         if (read)
> >                 goto out_finish;
> >
> > +       /*
> > +        * If this is a log recovery buffer, we aren't doing transactional IO
> > +        * yet so we need to let it handle IO completions.
> > +        */
> > +       if (bp->b_flags & _XBF_LOGRCVY) {
> > +               xlog_recover_iodone(bp);
> > +               return;
> > +       }
> > +
> >         /* inodes always have a callback on write */
> >         if (bp->b_flags & _XBF_INODES) {
> >                 xfs_buf_inode_iodone(bp);
> 
> This turns out to be a "static calls" pattern.

Yes, that was one of the intents of this set of changes - to remove
a function pointer that is largely unnecessary. The main point of it
was to clearly separate out the different completions that need to
be run.

> I think it would look nicer as a switch statement on
> (bp->b_flags & _XBF_BUFFER_TYPE_MASK)
> It would be also nicer to document near flag definition
> that the type flags are mutually exclusive.

That doesn't actually make the code any more compact.

	case _XBF_LOGRCVY:
	       /*
		* If this is a log recovery buffer, we aren't doing
		* transactional IO yet so we need to let it handle IO
		* completions.
		*/
		xlog_recover_iodone(bp);
		return;

Is actually more lines of additional code than just a set of if()
based flag checks. So adding a mask just to make this handling a
switch statement is largely marginal as to it's benefit here.

> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index ec015df55b77a..0aa823aeafca9 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -287,9 +287,8 @@ xlog_recover_iodone(
> >         if (bp->b_log_item)
> >                 xfs_buf_item_relse(bp);
> >         ASSERT(bp->b_log_item == NULL);
> > -
> > -       bp->b_iodone = NULL;
> > -       xfs_buf_ioend(bp);
> > +       bp->b_flags &= ~_XBF_LOGRCVY;
> > +       xfs_buf_ioend_finish(bp);
> 
> 
> For someone like me who does not know all the assumptions
> about buffers, why is this fag leak prevention needed for log recovery
> buffers and not for inode/dquote buffers?
> 
> Wouldn't it be better to have:
>        bp->b_flags &= ~_XBF_BUFFER_TYPE_MASK;
> 
> inside xfs_buf_ioend_finish()?

No. The flag indicates there is a completion to run on the buffer;
having the completion run does not mean the next time the buffer is
written the completion doesn't need to run. i.e. for writes during
log recovery, the flag is a one-shot. It will be set before each
write, and cleared afterwards. For something like inode cluster
buffers, it will be set when the inode is attached to the buffer,
and cleared when there are no more inodes attached to it. There may
be hundreds of IO completions run on the buffer between attach and
detatch, so clearing the _XBF_INODES flag in xfs_buf_ioend_finish()
is completely wrong.

i.e. the flag is not managed by the buffer cache infrastructure; it's
owned by the subsystem that is attaching things to the buffer...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/24] xfs: mark inode buffers in cache
  2020-05-23  8:48   ` Christoph Hellwig
@ 2020-05-25  0:06     ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-25  0:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Sat, May 23, 2020 at 01:48:27AM -0700, Christoph Hellwig wrote:
> On Fri, May 22, 2020 at 01:50:08PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Inode buffers always have write IO callbacks, so by marking them
> > directly we can avoid needing to attach ->b_iodone functions to
> > them. This avoids an indirect call, and makes future modifications
> > much simpler.
> > 
> > This is largely a rearrangement of the code at this point - no IO
> > completion functionality changes at this point, just how the
> > code is run is modified.
> 
> So I've looked over the whole series where this leads, and I really like
> killing off li_cb and sorting out the mess around b_iodone.  But I think
> the series ends up moving too much out of xfs_buf.c.  I'd rather keep
> a minimal b_write_done callback rather than the checking of the inode
> and dquote flags, and keep most of the logic in xfs_buf.c.  This is the
> patch I cooked up on top of your series to show what I mean - it removes
> the need for both flags and kills about 100 lines of code:

Let's deal with further cleanups after this patchset and the
follow-on patchsets that I have are sorted out. All cleanups will do
right now is invalidate all the stuff I'm still working on that sits
on top of this patchset..

> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index e5f58a4b13eee..4220fee1c9ddb 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -169,7 +169,7 @@ xfs_trans_log_inode(
>  		xfs_buf_hold(bp);
>  		spin_lock(&iip->ili_lock);
>  		iip->ili_item.li_buf = bp;
> -		bp->b_flags |= _XBF_INODES;
> +		bp->b_write_done = xfs_iflush_done;

I originally tried this, but then I ended up with places in
follow-on patchsets where I had to do this sort of check:

	if (bp->b_write_done == xfs_iflush_done)

To detect a buffer that already had inodes attached to it. Which
clearly told me that there should be a state flag to indicate that
the buffer has attached inodes....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/24] xfs: clean up the buffer iodone callback functions
  2020-05-22 22:26   ` Darrick J. Wong
@ 2020-05-25  0:37     ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-25  0:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, May 22, 2020 at 03:26:11PM -0700, Darrick J. Wong wrote:
> On Fri, May 22, 2020 at 01:50:15PM +1000, Dave Chinner wrote:
> > @@ -1188,19 +1146,10 @@ void
> >  xfs_buf_inode_iodone(
> >  	struct xfs_buf		*bp)
> >  {
> > -	struct xfs_buf_log_item *blip = bp->b_log_item;
> > -	struct xfs_log_item	*lip;
> > -
> >  	if (xfs_buf_had_callback_errors(bp))
> >  		return;
> >  
> > -	/* If there is a buf_log_item attached, run its callback */
> > -	if (blip) {
> > -		lip = &blip->bli_item;
> > -		lip->li_cb(bp, lip);
> > -		bp->b_log_item = NULL;
> > -	}
> > -
> > +	xfs_buf_item_done(bp);
> >  	xfs_iflush_done(bp);
> 
> Just out of curiosity, we still have a reference to bp here even if
> xfs_buf_item_done calls xfs_buf_rele, right? 

Yes, the IO still has a reference to the buffer that won't be
released until the xfs_buf_ioend_finish() call is made.

> I think the answer is that yes
> we do still have the reference because the inodes themselves hold references
> to the cluster buffer, right?

That too, but the important one is the IO reference as even the
inode references can go away on completion at this point.

> If so,
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

Ta.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 14/24] xfs: remove IO submission from xfs_reclaim_inode()
  2020-05-22 23:06   ` Darrick J. Wong
@ 2020-05-25  3:49     ` Dave Chinner
  0 siblings, 0 replies; 91+ messages in thread
From: Dave Chinner @ 2020-05-25  3:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, May 22, 2020 at 04:06:42PM -0700, Darrick J. Wong wrote:
> On Fri, May 22, 2020 at 01:50:19PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > We no longer need to issue IO from shrinker based inode reclaim to
> > prevent spurious OOM killer invocation. This leaves only the global
> > filesystem management operations such as unmount needing to
> > writeback dirty inodes and reclaim them.
> > 
> > Instead of using the reclaim pass to write dirty inodes before
> > reclaiming them, use the AIL to push all the dirty inodes before we
> > try to reclaim them. This allows us to remove all the conditional
> > SYNC_WAIT locking and the writeback code from xfs_reclaim_inode()
> > and greatly simplify the checks we need to do to reclaim an inode.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_icache.c | 116 +++++++++++---------------------------------
> >  1 file changed, 29 insertions(+), 87 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 0f0f8fcd61b03..ee9bc82a0dfbe 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -1130,24 +1130,17 @@ xfs_reclaim_inode_grab(
> >   *	dirty, async	=> requeue
> >   *	dirty, sync	=> flush, wait and reclaim
> >   */
> 
> The function comment probably ought to describe what the two return
> values mean.  true when the inode was freed and false if we need to try
> again, right?

The comments are all updated in a later patch. It seemed better to
do that than to have to rewrite the repeatedly as behaviour changes
in each patch.

> > @@ -1272,20 +1219,17 @@ xfs_reclaim_inode(
> >   * then a shut down during filesystem unmount reclaim walk leak all the
> >   * unreclaimed inodes.
> >   */
> > -STATIC int
> > +static int
> 
> The function comment /really/ needs to note that the return value
> here is number of inodes that were skipped, and not just some negative
> error code.

OK, done.

> > @@ -1398,8 +1329,18 @@ xfs_reclaim_inodes(
> >  	int		mode)
> >  {
> >  	int		nr_to_scan = INT_MAX;
> > +	int		skipped;
> >  
> > -	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
> > +	skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
> > +	if (!(mode & SYNC_WAIT))
> > +		return 0;
> > +
> > +	do {
> > +		xfs_ail_push_all_sync(mp->m_ail);
> > +		skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
> > +	} while (skipped > 0);
> > +
> > +	return 0;
> 
> Might as well kill the return value here since none of the callers care.

I think I did that in the SYNC_WAIT futzing patches that were
causing me problems and I dropped. It was in a separate patch,
anyway.

> >  }
> >  
> >  /*
> > @@ -1420,7 +1361,8 @@ xfs_reclaim_inodes_nr(
> >  	xfs_reclaim_work_queue(mp);
> >  	xfs_ail_push_all(mp->m_ail);
> >  
> > -	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
> 
> So the old code was returning negative error codes here?  Given that the
> only caller is free_cached_objects which adds it to the 'freed' count...
> wow.

Actually, no. the error from xfs_reclaim_inodes_ag() can only come
from xfs_reclaim_inode(), which always returned 0.

> > +	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
> > +	return 0;
> 
> Why do we return zero freed items here?  The VFS asked us to clear
> shrink_control->nr_to_scan (passed in here as nr_to_scan) and we're
> supposed to report what we did, right?

It's the same behaviour as we currently have.

ISTR that I did the accounting this way originally so we didn't
double count inodes being freed. The inode has already been
accounted as freed by the VFS inode shrinker when it runs
->destroy_inode, so double counting every inode that is freed (once
for the VFS cache removal, once for the XFS cache removal) seemed
like a bad thing to be doing...

> Or is there some odd subtlety here where we hate the shrinker and that's
> why we return zero?

Oh, that too :P

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, back to index

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-05-22  3:50 ` [PATCH 01/24] xfs: remove logged flag from inode log item Dave Chinner
2020-05-22  7:25   ` Christoph Hellwig
2020-05-22 21:13   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 02/24] xfs: add an inode item lock Dave Chinner
2020-05-22  6:45   ` Amir Goldstein
2020-05-22 21:24   ` Darrick J. Wong
2020-05-23  8:45   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 03/24] xfs: mark inode buffers in cache Dave Chinner
2020-05-22  7:45   ` Amir Goldstein
2020-05-22 21:35   ` Darrick J. Wong
2020-05-24 23:41     ` Dave Chinner
2020-05-23  8:48   ` Christoph Hellwig
2020-05-25  0:06     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 04/24] xfs: mark dquot " Dave Chinner
2020-05-22  7:46   ` Amir Goldstein
2020-05-22 21:38   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 05/24] xfs: mark log recovery buffers for completion Dave Chinner
2020-05-22  7:41   ` Amir Goldstein
2020-05-24 23:54     ` Dave Chinner
2020-05-22 21:41   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 06/24] xfs: call xfs_buf_iodone directly Dave Chinner
2020-05-22  7:56   ` Amir Goldstein
2020-05-22 21:53   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 07/24] xfs: clean up whacky buffer log item list reinit Dave Chinner
2020-05-22 22:01   ` Darrick J. Wong
2020-05-23  8:50   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 08/24] xfs: fold xfs_istale_done into xfs_iflush_done Dave Chinner
2020-05-22 22:10   ` Darrick J. Wong
2020-05-23  9:12   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 09/24] xfs: use direct calls for dquot IO completion Dave Chinner
2020-05-22 22:13   ` Darrick J. Wong
2020-05-23  9:16   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 10/24] xfs: clean up the buffer iodone callback functions Dave Chinner
2020-05-22 22:26   ` Darrick J. Wong
2020-05-25  0:37     ` Dave Chinner
2020-05-23  9:19   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 11/24] xfs: get rid of log item callbacks Dave Chinner
2020-05-22 22:27   ` Darrick J. Wong
2020-05-23  9:19   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 12/24] xfs: pin inode backing buffer to the inode log item Dave Chinner
2020-05-22 22:39   ` Darrick J. Wong
2020-05-23  9:34   ` Christoph Hellwig
2020-05-23 21:43     ` Dave Chinner
2020-05-24  5:31       ` Christoph Hellwig
2020-05-24 23:13         ` Dave Chinner
2020-05-22  3:50 ` [PATCH 13/24] xfs: make inode reclaim almost non-blocking Dave Chinner
2020-05-22 12:19   ` Amir Goldstein
2020-05-22 22:48   ` Darrick J. Wong
2020-05-23 22:29     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 14/24] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
2020-05-22 23:06   ` Darrick J. Wong
2020-05-25  3:49     ` Dave Chinner
2020-05-23  9:40   ` Christoph Hellwig
2020-05-23 22:35     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 15/24] xfs: allow multiple reclaimers per AG Dave Chinner
2020-05-22 23:10   ` Darrick J. Wong
2020-05-23 22:35     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 16/24] xfs: don't block inode reclaim on the ILOCK Dave Chinner
2020-05-22 23:11   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 17/24] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
2020-05-22 23:14   ` Darrick J. Wong
2020-05-23 22:42     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 18/24] xfs: clean up inode reclaim comments Dave Chinner
2020-05-22 23:17   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 19/24] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
2020-05-22 23:48   ` Darrick J. Wong
2020-05-23 22:59     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 20/24] xfs: xfs_iflush() is no longer necessary Dave Chinner
2020-05-22 23:54   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 21/24] xfs: rename xfs_iflush_int() Dave Chinner
2020-05-22 12:33   ` Amir Goldstein
2020-05-22 23:57   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
2020-05-23  0:13   ` Darrick J. Wong
2020-05-23 23:14     ` Dave Chinner
2020-05-23 11:31   ` Christoph Hellwig
2020-05-23 23:23     ` Dave Chinner
2020-05-24  5:32       ` Christoph Hellwig
2020-05-23 11:39   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 23/24] xfs: factor xfs_iflush_done Dave Chinner
2020-05-23  0:20   ` Darrick J. Wong
2020-05-23 11:35   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 24/24] xfs: remove xfs_inobp_check() Dave Chinner
2020-05-23  0:16   ` Darrick J. Wong
2020-05-23 11:36   ` Christoph Hellwig
2020-05-22  4:04 ` [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-05-23 16:18   ` Darrick J. Wong
2020-05-23 21:22     ` Dave Chinner
2020-05-22  6:18 ` Amir Goldstein
2020-05-22 12:01   ` Amir Goldstein

Linux-XFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-xfs/0 linux-xfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-xfs linux-xfs/ https://lore.kernel.org/linux-xfs \
		linux-xfs@vger.kernel.org
	public-inbox-index linux-xfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-xfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git