All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous
@ 2020-06-01 21:42 Dave Chinner
  2020-06-01 21:42 ` [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Dave Chinner
                   ` (29 more replies)
  0 siblings, 30 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

Inode flushing requires that we first lock an inode, then check it,
then lock the underlying buffer, flush the inode to the buffer and
finally add the inode to the buffer to be unlocked on IO completion.
We then walk all the other cached inodes in the buffer range and
optimistically lock and flush them to the buffer without blocking.

This cluster write effectively repeats the same code we do with the
initial inode, except now it has to special case that initial inode
that is already locked. Hence we have multiple copies of very
similar code, and it is a result of inode cluster flushing being
based on a specific inode rather than grabbing the buffer and
flushing all available inodes to it.

The problem with this at the moment is that we we can't look up the
buffer until we have guaranteed that an inode is held exclusively
and it's not going away while we get the buffer through an imap
lookup. Hence we are kinda stuck locking an inode before we can look
up the buffer.

This is also a result of inodes being detached from the cluster
buffer except when IO is being done. This has the further problem
that the cluster buffer can be reclaimed from memory and then the
inode can be dirtied. At this point cleaning the inode requires a
read-modify-write cycle on the cluster buffer. If we then are put
under memory pressure, cleaning that dirty inode to reclaim it
requires allocating memory for the cluster buffer and this leads to
all sorts of problems.

We used synchronous inode writeback in reclaim as a throttle that
provided a forwards progress mechanism when RMW cycles were required
to clean inodes. Async writeback of inodes (e.g. via the AIL) would
immediately exhaust remaining memory reserves trying to allocate
inode cluster after inode cluster. The synchronous writeback of an
inode cluster allowed reclaim to release the inode cluster and have
it freed almost immediately which could then be used to allocate the
next inode cluster buffer. Hence the IO based throttling mechanism
largely guaranteed forwards progress in inode reclaim. By removing
the requirement for require memory allocation for inode writeback
filesystem level, we can issue writeback asynchrnously and not have
to worry about the memory exhaustion anymore.

Another issue is that if we have slow disks, we can build up dirty
inodes in memory that can then take hours for an operation like
unmount to flush. A RMW cycle per inode on a slow RAID6 device can
mean we only clean 50 inodes a second, and when there are hundreds
of thousands of dirty inodes that need to be cleaned this can take a
long time. PInning the cluster buffers will greatly speed up inode
writeback on slow storage systems like this.

These limitations all stem from the same source: inode writeback is
inode centric, And they are largely solved by the same architectural
change: make inode writeback cluster buffer centric.  This series is
makes that architectural change.

Firstly, we start by pinning the inode backing buffer in memory
when an inode is marked dirty (i.e. when it is logged). By tracking
the number of dirty inodes on a buffer as a counter rather than a
flag, we avoid the problem of overlapping inode dirtying and buffer
flushing racing to set/clear the dirty flag. Hence as long as there
is a dirty inode in memory, the buffer will not be able to be
reclaimed. We can safely do this inode cluster buffer lookup when we
dirty an inode as we do not hold the buffer locked - we merely take
a reference to it and then release it - and hence we don't cause any
new lock order issues.

When the inode is finally cleaned, the reference to the buffer can
be removed from the inode log item and the buffer released. This is
done from the inode completion callbacks that are attached to the
buffer when the inode is flushed.

Pinning the cluster buffer in this way immediately avoids the RMW
problem in inode writeback and reclaim contexts by moving the memory
allocation and the blocking buffer read into the transaction context
that dirties the inode.  This inverts our dirty inode throttling
mechanism - we now throttle the rate at which we can dirty inodes to
rate at which we can allocate memory and read inode cluster buffers
into memory rather than via throttling reclaim to rate at which we
can clean dirty inodes.

Hence if we are under memory pressure, we'll block on memory
allocation when trying to dirty the referenced inode, rather than in
the memory reclaim path where we are trying to clean unreferenced
inodes to free memory.  Hence we no longer have to guarantee
forwards progress in inode reclaim as we aren't doing memory
allocation, and that means we can remove inode writeback from the
XFS inode shrinker completely without changing the system tolerance
for low memory operation.

Tracking the buffers via the inode log item also allows us to
completely rework the inode flushing mechanism. While the inode log
item is in the AIL, it is safe for the AIL to access any member of
the log item. Hence the AIL push mechanisms can access the buffer
attached to the inode without first having to lock the inode.

This means we can essentially lock the buffer directly and then
call xfs_iflush_cluster() without first going through xfs_iflush()
to find the buffer. Hence we can remove xfs_iflush() altogether,
because the two places that call it - the inode item push code and
inode reclaim - no longer need to flush inodes directly.

This can be further optimised by attaching the inode to the cluster
buffer when the inode is dirtied. i.e. when we add the buffer
reference to the inode log item, we also attach the inode to the
buffer for IO processing. This leads to the dirty inodes always
being attached to the buffer and hence we no longer need to add them
when we flush the inode and remove them when IO completes. Instead
the inodes are attached when the node log item is dirtied, and
removed when the inode log item is cleaned.

With this structure in place, we no longer need to do
lookups to find the dirty inodes in the cache to attach to the
buffer in xfs_iflush_cluster() - they are already attached to the
buffer. Hence when the AIL pushes an inode, we just grab the buffer
from the log item, and then walk the buffer log item list to lock
and flush the dirty inodes attached to the buffer.

This greatly simplifies inode writeback, and removes another memory
allocation from the inode writeback path (the array used for the
radix tree gang lookup). And while the radix tree lookups are fast,
walking the linked list of dirty inodes is faster.

There is followup work I am doing that uses the inode cluster buffer
as a replacement in the AIL for tracking dirty inodes. This part of
the series is not ready yet as it has some intricate locking
requirements. That is an optimisation, so I've left that out because
solving the inode reclaim blocking problems is the important part of
this work.

In short, this series simplifies inode writeback and fixes the long
standing inode reclaim blocking issues without requiring any changes
to the memory reclaim infrastructure.

Note: dquots should probably be converted to cluster flushing in a
similar way, as they have many of the same issues as inode flushing.

Thoughts, comments and improvemnts welcome.

-Dave.


Version 2

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-2

- describe ili_lock better (p2)
- clean up inode logging code some more (p2)
- move "early read completion" for xfs_buf_ioend() up into p3 from
  p4.
- fixed conflicts in p4 due to p3 changes.
- fixed conflicts in p5 due to p4 changes.
- s/_XBF_LOGRCVY/_XBF_LOG_RECOVERY/ (p5)
- renamed the buf log item iodone callback to xfs_buf_item_iodone and
  reused the xfs_buf_iodone() name for the catch-all buffer write
  iodone completion. (p6)
- history update for commit message (p7)
- subject update for p8
- rework loop in xfs_dquot_done() (p9)
- Fixed conflicts in p10 due to p6 changes
- got rid of entire comments around li_cb (p11)
- new patch to rework buffer io error callbacks
- new patch to unwind ->iop_error calls and remove ->iop_error
- new patch to lift xfs_clear_li_failed() out of
  xfs_ail_delete_one()
- rebased p12 on all the prior changes
- reworked LI_FAILED handling when pinning inodes to the cluster
  buffer (p12) 
- fixed comment about holding buffer references in
  xfs_trans_log_inode() (p12)
- fixed indenting of xfs_iflush_abort() (p12)
- added comments explaining "skipped" indoe reclaim return value
  (p14)
- cleaned up error return stack in xfs_reclaim_inode() (p14)
- cleaned up skipped return in xfs_reclaim_inodes() (p14)
- fixed bug where skipped wasn't incremented if reclaim cursor was
  not zero. This could leave inodes between the start of the AG and
  the cursor unreclaimed (p15)
- reinstate the patch removing SYNC_WAIT from xfs_reclaim_inodes().
  Exposed "skipped" bug in p15.
- cleaned up inode reclaim comments (p18)
- split p19 into two - one to change xfs_ifree_cluster(), one
  for the buffer pinning.
- xfs_ifree_mark_inode_stale() now takes the cluster buffer and we
  get the perag from that rather than having to do a lookup in
  xfs_ifree_cluster().
- moved extra IO reference for xfs_iflush_cluster() from AIL pushing
  to initial xfs_iflush_cluster rework (p22 -> p20)
- fixed static declaration on xfs_iflush() (p22)
- fixed incorrect EIO return from xfs_iflush_cluster()
- rebase p23 because it all rejects now.
- fix INODE_ITEM() usage in p23
- removed long lines from commit message in p24
- new patch to fix logging of XFS_ISTALE inodes which pushes dirty
  inodes through reclaim.



Dave Chinner (30):
  xfs: Don't allow logging of XFS_ISTALE inodes
  xfs: remove logged flag from inode log item
  xfs: add an inode item lock
  xfs: mark inode buffers in cache
  xfs: mark dquot buffers in cache
  xfs: mark log recovery buffers for completion
  xfs: call xfs_buf_iodone directly
  xfs: clean up whacky buffer log item list reinit
  xfs: make inode IO completion buffer centric
  xfs: use direct calls for dquot IO completion
  xfs: clean up the buffer iodone callback functions
  xfs: get rid of log item callbacks
  xfs: handle buffer log item IO errors directly
  xfs: unwind log item error flagging
  xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()
  xfs: pin inode backing buffer to the inode log item
  xfs: make inode reclaim almost non-blocking
  xfs: remove IO submission from xfs_reclaim_inode()
  xfs: allow multiple reclaimers per AG
  xfs: don't block inode reclaim on the ILOCK
  xfs: remove SYNC_TRYLOCK from inode reclaim
  xfs: remove SYNC_WAIT from xfs_reclaim_inodes()
  xfs: clean up inode reclaim comments
  xfs: rework stale inodes in xfs_ifree_cluster
  xfs: attach inodes to the cluster buffer when dirtied
  xfs: xfs_iflush() is no longer necessary
  xfs: rename xfs_iflush_int()
  xfs: rework xfs_iflush_cluster() dirty inode iteration
  xfs: factor xfs_iflush_done
  xfs: remove xfs_inobp_check()

 fs/xfs/libxfs/xfs_inode_buf.c   |  27 +-
 fs/xfs/libxfs/xfs_inode_buf.h   |   6 -
 fs/xfs/libxfs/xfs_trans_inode.c | 112 +++++--
 fs/xfs/xfs_buf.c                |  40 ++-
 fs/xfs/xfs_buf.h                |  48 +--
 fs/xfs/xfs_buf_item.c           | 376 +++++++++++-----------
 fs/xfs/xfs_buf_item.h           |   8 +-
 fs/xfs/xfs_buf_item_recover.c   |   5 +-
 fs/xfs/xfs_dquot.c              |  29 +-
 fs/xfs/xfs_dquot.h              |   1 +
 fs/xfs/xfs_dquot_item.c         |  18 --
 fs/xfs/xfs_dquot_item_recover.c |   2 +-
 fs/xfs/xfs_file.c               |   9 +-
 fs/xfs/xfs_icache.c             | 333 ++++++-------------
 fs/xfs/xfs_icache.h             |   2 +-
 fs/xfs/xfs_inode.c              | 554 ++++++++++++--------------------
 fs/xfs/xfs_inode.h              |   2 +-
 fs/xfs/xfs_inode_item.c         | 303 ++++++++---------
 fs/xfs/xfs_inode_item.h         |  24 +-
 fs/xfs/xfs_inode_item_recover.c |   2 +-
 fs/xfs/xfs_log_recover.c        |   5 +-
 fs/xfs/xfs_mount.c              |  15 +-
 fs/xfs/xfs_mount.h              |   1 -
 fs/xfs/xfs_super.c              |   3 -
 fs/xfs/xfs_trans.h              |   5 -
 fs/xfs/xfs_trans_ail.c          |  10 +-
 fs/xfs/xfs_trans_buf.c          |  15 +-
 27 files changed, 854 insertions(+), 1101 deletions(-)

-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02  4:30   ` Darrick J. Wong
  2020-06-02 16:32   ` Brian Foster
  2020-06-01 21:42 ` [PATCH 02/30] xfs: remove logged flag from inode log item Dave Chinner
                   ` (28 subsequent siblings)
  29 siblings, 2 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

In tracking down a problem in this patchset, I discovered we are
reclaiming dirty stale inodes. This wasn't discovered until inodes
were always attached to the cluster buffer and then the rcu callback
that freed inodes was assert failing because the inode still had an
active pointer to the cluster buffer after it had been reclaimed.

Debugging the issue indicated that this was a pre-existing issue
resulting from the way the inodes are handled in xfs_inactive_ifree.
When we free a cluster buffer from xfs_ifree_cluster, all the inodes
in cache are marked XFS_ISTALE. Those that are clean have nothing
else done to them and so eventually get cleaned up by background
reclaim. i.e. it is assumed we'll never dirty/relog an inode marked
XFS_ISTALE.

On journal commit dirty stale inodes as are handled by both
buffer and inode log items to run though xfs_istale_done() and
removed from the AIL (buffer log item commit) or the log item will
simply unpin it because the buffer log item will clean it. What happens
to any specific inode is entirely dependent on which log item wins
the commit race, but the result is the same - stale inodes are
clean, not attached to the cluster buffer, and not in the AIL. Hence
inode reclaim can just free these inodes without further care.

However, if the stale inode is relogged, it gets dirtied again and
relogged into the CIL. Most of the time this isn't an issue, because
relogging simply changes the inode's location in the current
checkpoint. Problems arise, however, when the CIL checkpoints
between two transactions in the xfs_inactive_ifree() deferops
processing. This results in the XFS_ISTALE inode being redirtied
and inserted into the CIL without any of the other stale cluster
buffer infrastructure being in place.

Hence on journal commit, it simply gets unpinned, so it remains
dirty in memory. Everything in inode writeback avoids XFS_ISTALE
inodes so it can't be written back, and it is not tracked in the AIL
so there's not even a trigger to attempt to clean the inode. Hence
the inode just sits dirty in memory until inode reclaim comes along,
sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming
of a dirty inode caused use after free, list corruptions and other
nasty issues later in this patchset.

Hence this patch addresses a violation of the "never log XFS_ISTALE
inodes" caused by the deferops processing rolling a transaction
and relogging a stale inode in xfs_inactive_free. It also adds a
bunch of asserts to catch this problem in debug kernels so that
we don't reintroduce this problem in future.

Reproducer for this issue was generic/558 on a v4 filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_inode.c |  2 ++
 fs/xfs/xfs_icache.c             |  3 ++-
 fs/xfs/xfs_inode.c              | 25 ++++++++++++++++++++++---
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index b5dfb66548422..4504d215cd590 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -36,6 +36,7 @@ xfs_trans_ijoin(
 
 	ASSERT(iip->ili_lock_flags == 0);
 	iip->ili_lock_flags = lock_flags;
+	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
 
 	/*
 	 * Get a log_item_desc to point at the new item.
@@ -89,6 +90,7 @@ xfs_trans_log_inode(
 
 	ASSERT(ip->i_itemp != NULL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
 
 	/*
 	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0a5ac6f9a5834..dbba4c1946386 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1141,7 +1141,7 @@ xfs_reclaim_inode(
 			goto out_ifunlock;
 		xfs_iunpin_wait(ip);
 	}
-	if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) {
+	if (xfs_inode_clean(ip)) {
 		xfs_ifunlock(ip);
 		goto reclaim;
 	}
@@ -1228,6 +1228,7 @@ xfs_reclaim_inode(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_qm_dqdetach(ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	ASSERT(xfs_inode_clean(ip));
 
 	__xfs_inode_free(ip);
 	return error;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 64f5f9a440aed..53a1d64782c35 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1740,10 +1740,31 @@ xfs_inactive_ifree(
 		return error;
 	}
 
+	/*
+	 * We do not hold the inode locked across the entire rolling transaction
+	 * here. We only need to hold it for the first transaction that
+	 * xfs_ifree() builds, which may mark the inode XFS_ISTALE if the
+	 * underlying cluster buffer is freed. Relogging an XFS_ISTALE inode
+	 * here breaks the relationship between cluster buffer invalidation and
+	 * stale inode invalidation on cluster buffer item journal commit
+	 * completion, and can result in leaving dirty stale inodes hanging
+	 * around in memory.
+	 *
+	 * We have no need for serialising this inode operation against other
+	 * operations - we freed the inode and hence reallocation is required
+	 * and that will serialise on reallocating the space the deferops need
+	 * to free. Hence we can unlock the inode on the first commit of
+	 * the transaction rather than roll it right through the deferops. This
+	 * avoids relogging the XFS_ISTALE inode.
+	 *
+	 * We check that xfs_ifree() hasn't grown an internal transaction roll
+	 * by asserting that the inode is still locked when it returns.
+	 */
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	xfs_trans_ijoin(tp, ip, 0);
+	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
 
 	error = xfs_ifree(tp, ip);
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	if (error) {
 		/*
 		 * If we fail to free the inode, shut down.  The cancel
@@ -1756,7 +1777,6 @@ xfs_inactive_ifree(
 			xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
 		}
 		xfs_trans_cancel(tp);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		return error;
 	}
 
@@ -1774,7 +1794,6 @@ xfs_inactive_ifree(
 		xfs_notice(mp, "%s: xfs_trans_commit returned error %d",
 			__func__, error);
 
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return 0;
 }
 
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 02/30] xfs: remove logged flag from inode log item
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
  2020-06-01 21:42 ` [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 16:32   ` Brian Foster
  2020-06-01 21:42 ` [PATCH 03/30] xfs: add an inode item lock Dave Chinner
                   ` (27 subsequent siblings)
  29 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

This was used to track if the item had logged fields being flushed
to disk. We log everything in the inode these days, so this logic is
no longer needed. Remove it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c      | 13 ++++---------
 fs/xfs/xfs_inode_item.c | 35 ++++++++++-------------------------
 fs/xfs/xfs_inode_item.h |  1 -
 3 files changed, 14 insertions(+), 35 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 53a1d64782c35..4fa12775ac146 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2677,7 +2677,6 @@ xfs_ifree_cluster(
 		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
 			if (lip->li_type == XFS_LI_INODE) {
 				iip = (struct xfs_inode_log_item *)lip;
-				ASSERT(iip->ili_logged == 1);
 				lip->li_cb = xfs_istale_done;
 				xfs_trans_ail_copy_lsn(mp->m_ail,
 							&iip->ili_flush_lsn,
@@ -2706,7 +2705,6 @@ xfs_ifree_cluster(
 			iip->ili_last_fields = iip->ili_fields;
 			iip->ili_fields = 0;
 			iip->ili_fsync_fields = 0;
-			iip->ili_logged = 1;
 			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 						&iip->ili_item.li_lsn);
 
@@ -3838,19 +3836,16 @@ xfs_iflush_int(
 	 *
 	 * We can play with the ili_fields bits here, because the inode lock
 	 * must be held exclusively in order to set bits there and the flush
-	 * lock protects the ili_last_fields bits.  Set ili_logged so the flush
-	 * done routine can tell whether or not to look in the AIL.  Also, store
-	 * the current LSN of the inode so that we can tell whether the item has
-	 * moved in the AIL from xfs_iflush_done().  In order to read the lsn we
-	 * need the AIL lock, because it is a 64 bit value that cannot be read
-	 * atomically.
+	 * lock protects the ili_last_fields bits.  Store the current LSN of the
+	 * inode so that we can tell whether the item has moved in the AIL from
+	 * xfs_iflush_done().  In order to read the lsn we need the AIL lock,
+	 * because it is a 64 bit value that cannot be read atomically.
 	 */
 	error = 0;
 flush_out:
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
 	iip->ili_fsync_fields = 0;
-	iip->ili_logged = 1;
 
 	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 				&iip->ili_item.li_lsn);
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index ba47bf65b772b..b17384aa8df40 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -528,8 +528,6 @@ xfs_inode_item_push(
 	}
 
 	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
-	ASSERT(iip->ili_logged == 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
-
 	spin_unlock(&lip->li_ailp->ail_lock);
 
 	error = xfs_iflush(ip, &bp);
@@ -690,30 +688,24 @@ xfs_iflush_done(
 			continue;
 
 		list_move_tail(&blip->li_bio_list, &tmp);
-		/*
-		 * while we have the item, do the unlocked check for needing
-		 * the AIL lock.
-		 */
+
+		/* Do an unlocked check for needing the AIL lock. */
 		iip = INODE_ITEM(blip);
-		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
+		if (blip->li_lsn == iip->ili_flush_lsn ||
 		    test_bit(XFS_LI_FAILED, &blip->li_flags))
 			need_ail++;
 	}
 
 	/* make sure we capture the state of the initial inode. */
 	iip = INODE_ITEM(lip);
-	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
+	if (lip->li_lsn == iip->ili_flush_lsn ||
 	    test_bit(XFS_LI_FAILED, &lip->li_flags))
 		need_ail++;
 
 	/*
-	 * We only want to pull the item from the AIL if it is
-	 * actually there and its location in the log has not
-	 * changed since we started the flush.  Thus, we only bother
-	 * if the ili_logged flag is set and the inode's lsn has not
-	 * changed.  First we check the lsn outside
-	 * the lock since it's cheaper, and then we recheck while
-	 * holding the lock before removing the inode from the AIL.
+	 * We only want to pull the item from the AIL if it is actually there
+	 * and its location in the log has not changed since we started the
+	 * flush.  Thus, we only bother if the inode's lsn has not changed.
 	 */
 	if (need_ail) {
 		xfs_lsn_t	tail_lsn = 0;
@@ -721,8 +713,7 @@ xfs_iflush_done(
 		/* this is an opencoded batch version of xfs_trans_ail_delete */
 		spin_lock(&ailp->ail_lock);
 		list_for_each_entry(blip, &tmp, li_bio_list) {
-			if (INODE_ITEM(blip)->ili_logged &&
-			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
+			if (blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
 				/*
 				 * xfs_ail_update_finish() only cares about the
 				 * lsn of the first tail item removed, any
@@ -740,14 +731,13 @@ xfs_iflush_done(
 	}
 
 	/*
-	 * clean up and unlock the flush lock now we are done. We can clear the
+	 * Clean up and unlock the flush lock now we are done. We can clear the
 	 * ili_last_fields bits now that we know that the data corresponding to
 	 * them is safely on disk.
 	 */
 	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
 		list_del_init(&blip->li_bio_list);
 		iip = INODE_ITEM(blip);
-		iip->ili_logged = 0;
 		iip->ili_last_fields = 0;
 		xfs_ifunlock(iip->ili_inode);
 	}
@@ -768,16 +758,11 @@ xfs_iflush_abort(
 
 	if (iip) {
 		xfs_trans_ail_delete(&iip->ili_item, 0);
-		iip->ili_logged = 0;
-		/*
-		 * Clear the ili_last_fields bits now that we know that the
-		 * data corresponding to them is safely on disk.
-		 */
-		iip->ili_last_fields = 0;
 		/*
 		 * Clear the inode logging fields so no more flushes are
 		 * attempted.
 		 */
+		iip->ili_last_fields = 0;
 		iip->ili_fields = 0;
 		iip->ili_fsync_fields = 0;
 	}
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 60b34bb66e8ed..4de5070e07655 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -19,7 +19,6 @@ struct xfs_inode_log_item {
 	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
 	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
 	unsigned short		ili_lock_flags;	   /* lock flags */
-	unsigned short		ili_logged;	   /* flushed logged data */
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
 	unsigned int		ili_fsync_fields;  /* logged since last fsync */
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 03/30] xfs: add an inode item lock
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
  2020-06-01 21:42 ` [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Dave Chinner
  2020-06-01 21:42 ` [PATCH 02/30] xfs: remove logged flag from inode log item Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 16:34   ` Brian Foster
  2020-06-01 21:42 ` [PATCH 04/30] xfs: mark inode buffers in cache Dave Chinner
                   ` (26 subsequent siblings)
  29 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The inode log item is kind of special in that it can be aggregating
new changes in memory at the same time time existing changes are
being written back to disk. This means there are fields in the log
item that are accessed concurrently from contexts that don't share
any locking at all.

e.g. updating ili_last_fields occurs at flush time under the
ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
completion time, and is read under the ILOCK_EXCL when the inode is
logged.  Hence there is no actual serialisation between reading the
field during logging of the inode in transactions vs clearing the
field in IO completion.

We currently get away with this by the fact that we are only
clearing fields in IO completion, and nothing bad happens if we
accidentally log more of the inode than we actually modify. Worst
case is we consume a tiny bit more memory and log bandwidth.

However, if we want to do more complex state manipulations on the
log item that requires updates at all three of these potential
locations, we need to have some mechanism of serialising those
operations. To do this, introduce a spinlock into the log item to
serialise internal state.

This could be done via the xfs_inode i_flags_lock, but this then
leads to potential lock inversion issues where inode flag updates
need to occur inside locks that best nest inside the inode log item
locks (e.g. marking inodes stale during inode cluster freeing).
Using a separate spinlock avoids these sorts of problems and
simplifies future code.

This does not touch the use of ili_fields in the item formatting
code - that is entirely protected by the ILOCK_EXCL at this point in
time, so it remains untouched.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_inode.c | 54 +++++++++++++++++----------------
 fs/xfs/xfs_file.c               |  9 ++++--
 fs/xfs/xfs_inode.c              | 20 +++++++-----
 fs/xfs/xfs_inode_item.c         |  7 +++++
 fs/xfs/xfs_inode_item.h         | 18 +++++++++--
 5 files changed, 68 insertions(+), 40 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 4504d215cd590..fe6c2e39be85d 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -82,16 +82,20 @@ xfs_trans_ichgtime(
  */
 void
 xfs_trans_log_inode(
-	xfs_trans_t	*tp,
-	xfs_inode_t	*ip,
-	uint		flags)
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	uint			flags)
 {
-	struct inode	*inode = VFS_I(ip);
+	struct xfs_inode_log_item *iip = ip->i_itemp;
+	struct inode		*inode = VFS_I(ip);
+	uint			iversion_flags = 0;
 
-	ASSERT(ip->i_itemp != NULL);
+	ASSERT(iip);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
 
+	tp->t_flags |= XFS_TRANS_DIRTY;
+
 	/*
 	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
 	 * don't matter - we either will need an extra transaction in 24 hours
@@ -104,15 +108,6 @@ xfs_trans_log_inode(
 		spin_unlock(&inode->i_lock);
 	}
 
-	/*
-	 * Record the specific change for fdatasync optimisation. This
-	 * allows fdatasync to skip log forces for inodes that are only
-	 * timestamp dirty. We do this before the change count so that
-	 * the core being logged in this case does not impact on fdatasync
-	 * behaviour.
-	 */
-	ip->i_itemp->ili_fsync_fields |= flags;
-
 	/*
 	 * First time we log the inode in a transaction, bump the inode change
 	 * counter if it is configured for this to occur. While we have the
@@ -122,23 +117,30 @@ xfs_trans_log_inode(
 	 * set however, then go ahead and bump the i_version counter
 	 * unconditionally.
 	 */
-	if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
-	    IS_I_VERSION(VFS_I(ip))) {
-		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
-			flags |= XFS_ILOG_CORE;
+	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
+		if (IS_I_VERSION(inode) &&
+		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
+			iversion_flags = XFS_ILOG_CORE;
 	}
 
-	tp->t_flags |= XFS_TRANS_DIRTY;
+	/*
+	 * Record the specific change for fdatasync optimisation. This allows
+	 * fdatasync to skip log forces for inodes that are only timestamp
+	 * dirty. We do this before the change count so that the core being
+	 * logged in this case does not impact on fdatasync behaviour.
+	 */
+	spin_lock(&iip->ili_lock);
+	iip->ili_fsync_fields |= flags;
 
 	/*
-	 * Always OR in the bits from the ili_last_fields field.
-	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
-	 * routines in the eventual clearing of the ili_fields bits.
-	 * See the big comment in xfs_iflush() for an explanation of
-	 * this coordination mechanism.
+	 * Always OR in the bits from the ili_last_fields field.  This is to
+	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
+	 * the eventual clearing of the ili_fields bits.  See the big comment in
+	 * xfs_iflush() for an explanation of this coordination mechanism.
 	 */
-	flags |= ip->i_itemp->ili_last_fields;
-	ip->i_itemp->ili_fields |= flags;
+	iip->ili_fields |= (flags | iip->ili_last_fields |
+			    iversion_flags);
+	spin_unlock(&iip->ili_lock);
 }
 
 int
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 403c90309a8ff..0abf770b77498 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -94,6 +94,7 @@ xfs_file_fsync(
 {
 	struct inode		*inode = file->f_mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_inode_log_item *iip = ip->i_itemp;
 	struct xfs_mount	*mp = ip->i_mount;
 	int			error = 0;
 	int			log_flushed = 0;
@@ -137,13 +138,15 @@ xfs_file_fsync(
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	if (xfs_ipincount(ip)) {
 		if (!datasync ||
-		    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
-			lsn = ip->i_itemp->ili_last_lsn;
+		    (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
+			lsn = iip->ili_last_lsn;
 	}
 
 	if (lsn) {
 		error = xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
-		ip->i_itemp->ili_fsync_fields = 0;
+		spin_lock(&iip->ili_lock);
+		iip->ili_fsync_fields = 0;
+		spin_unlock(&iip->ili_lock);
 	}
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4fa12775ac146..ac3c8af8c9a14 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2702,9 +2702,11 @@ xfs_ifree_cluster(
 				continue;
 
 			iip = ip->i_itemp;
+			spin_lock(&iip->ili_lock);
 			iip->ili_last_fields = iip->ili_fields;
 			iip->ili_fields = 0;
 			iip->ili_fsync_fields = 0;
+			spin_unlock(&iip->ili_lock);
 			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 						&iip->ili_item.li_lsn);
 
@@ -2740,6 +2742,7 @@ xfs_ifree(
 {
 	int			error;
 	struct xfs_icluster	xic = { 0 };
+	struct xfs_inode_log_item *iip = ip->i_itemp;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(VFS_I(ip)->i_nlink == 0);
@@ -2777,7 +2780,9 @@ xfs_ifree(
 	ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS;
 
 	/* Don't attempt to replay owner changes for a deleted inode */
-	ip->i_itemp->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);
+	spin_lock(&iip->ili_lock);
+	iip->ili_fields &= ~(XFS_ILOG_AOWNER | XFS_ILOG_DOWNER);
+	spin_unlock(&iip->ili_lock);
 
 	/*
 	 * Bump the generation count so no one will be confused
@@ -3833,20 +3838,19 @@ xfs_iflush_int(
 	 * know that the information those bits represent is permanently on
 	 * disk.  As long as the flush completes before the inode is logged
 	 * again, then both ili_fields and ili_last_fields will be cleared.
-	 *
-	 * We can play with the ili_fields bits here, because the inode lock
-	 * must be held exclusively in order to set bits there and the flush
-	 * lock protects the ili_last_fields bits.  Store the current LSN of the
-	 * inode so that we can tell whether the item has moved in the AIL from
-	 * xfs_iflush_done().  In order to read the lsn we need the AIL lock,
-	 * because it is a 64 bit value that cannot be read atomically.
 	 */
 	error = 0;
 flush_out:
+	spin_lock(&iip->ili_lock);
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
 	iip->ili_fsync_fields = 0;
+	spin_unlock(&iip->ili_lock);
 
+	/*
+	 * Store the current LSN of the inode so that we can tell whether the
+	 * item has moved in the AIL from xfs_iflush_done().
+	 */
 	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 				&iip->ili_item.li_lsn);
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index b17384aa8df40..6ef9cbcfc94a7 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -637,6 +637,7 @@ xfs_inode_item_init(
 	iip = ip->i_itemp = kmem_zone_zalloc(xfs_ili_zone, 0);
 
 	iip->ili_inode = ip;
+	spin_lock_init(&iip->ili_lock);
 	xfs_log_item_init(mp, &iip->ili_item, XFS_LI_INODE,
 						&xfs_inode_item_ops);
 }
@@ -738,7 +739,11 @@ xfs_iflush_done(
 	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
 		list_del_init(&blip->li_bio_list);
 		iip = INODE_ITEM(blip);
+
+		spin_lock(&iip->ili_lock);
 		iip->ili_last_fields = 0;
+		spin_unlock(&iip->ili_lock);
+
 		xfs_ifunlock(iip->ili_inode);
 	}
 	list_del(&tmp);
@@ -762,9 +767,11 @@ xfs_iflush_abort(
 		 * Clear the inode logging fields so no more flushes are
 		 * attempted.
 		 */
+		spin_lock(&iip->ili_lock);
 		iip->ili_last_fields = 0;
 		iip->ili_fields = 0;
 		iip->ili_fsync_fields = 0;
+		spin_unlock(&iip->ili_lock);
 	}
 	/*
 	 * Release the inode's flush lock since we're done with it.
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 4de5070e07655..44c47c08b0b59 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -16,12 +16,24 @@ struct xfs_mount;
 struct xfs_inode_log_item {
 	struct xfs_log_item	ili_item;	   /* common portion */
 	struct xfs_inode	*ili_inode;	   /* inode ptr */
-	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
-	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
-	unsigned short		ili_lock_flags;	   /* lock flags */
+	unsigned short		ili_lock_flags;	   /* inode lock flags */
+	/*
+	 * The ili_lock protects the interactions between the dirty state and
+	 * the flush state of the inode log item. This allows us to do atomic
+	 * modifications of multiple state fields without having to hold a
+	 * specific inode lock to serialise them.
+	 *
+	 * We need atomic changes between indoe dirtying, inode flushing and
+	 * inode completion, but these all hold different combinations of
+	 * ILOCK and iflock and hence we need some other method of serialising
+	 * updates to the flush state.
+	 */
+	spinlock_t		ili_lock;	   /* flush state lock */
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
 	unsigned int		ili_fsync_fields;  /* logged since last fsync */
+	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
+	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
 };
 
 static inline int xfs_inode_clean(xfs_inode_t *ip)
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 04/30] xfs: mark inode buffers in cache
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (2 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 03/30] xfs: add an inode item lock Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 16:45   ` Brian Foster
  2020-06-01 21:42 ` [PATCH 05/30] xfs: mark dquot " Dave Chinner
                   ` (25 subsequent siblings)
  29 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Inode buffers always have write IO callbacks, so by marking them
directly we can avoid needing to attach ->b_iodone functions to
them. This avoids an indirect call, and makes future modifications
much simpler.

This is largely a rearrangement of the code at this point - no IO
completion functionality changes at this point, just how the
code is run is modified.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c       | 21 ++++++++++++++++-----
 fs/xfs/xfs_buf.h       | 38 +++++++++++++++++++++++++-------------
 fs/xfs/xfs_buf_item.c  | 42 +++++++++++++++++++++++++++++++-----------
 fs/xfs/xfs_buf_item.h  |  1 +
 fs/xfs/xfs_inode.c     |  2 +-
 fs/xfs/xfs_trans_buf.c |  3 +++
 6 files changed, 77 insertions(+), 30 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 9c2fbb6bbf89d..fcf650575be61 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -14,6 +14,8 @@
 #include "xfs_mount.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
+#include "xfs_trans.h"
+#include "xfs_buf_item.h"
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 
@@ -1202,12 +1204,21 @@ xfs_buf_ioend(
 		bp->b_flags |= XBF_DONE;
 	}
 
-	if (bp->b_iodone)
+	if (read)
+		goto out_finish;
+
+	if (bp->b_flags & _XBF_INODES) {
+		xfs_buf_inode_iodone(bp);
+		return;
+	}
+
+	if (bp->b_iodone) {
 		(*(bp->b_iodone))(bp);
-	else if (bp->b_flags & XBF_ASYNC)
-		xfs_buf_relse(bp);
-	else
-		complete(&bp->b_iowait);
+		return;
+	}
+
+out_finish:
+	xfs_buf_ioend_finish(bp);
 }
 
 static void
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 050c53b739e24..2400cb90a04c6 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -30,15 +30,18 @@
 #define XBF_STALE	 (1 << 6) /* buffer has been staled, do not find it */
 #define XBF_WRITE_FAIL	 (1 << 7) /* async writes have failed on this buffer */
 
-/* flags used only as arguments to access routines */
-#define XBF_TRYLOCK	 (1 << 16)/* lock requested, but do not wait */
-#define XBF_UNMAPPED	 (1 << 17)/* do not map the buffer */
+/* buffer type flags for write callbacks */
+#define _XBF_INODES	 (1 << 16)/* inode buffer */
 
 /* flags used only internally */
 #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
 #define _XBF_KMEM	 (1 << 21)/* backed by heap memory */
 #define _XBF_DELWRI_Q	 (1 << 22)/* buffer on a delwri queue */
 
+/* flags used only as arguments to access routines */
+#define XBF_TRYLOCK	 (1 << 30)/* lock requested, but do not wait */
+#define XBF_UNMAPPED	 (1 << 31)/* do not map the buffer */
+
 typedef unsigned int xfs_buf_flags_t;
 
 #define XFS_BUF_FLAGS \
@@ -50,12 +53,13 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_DONE,		"DONE" }, \
 	{ XBF_STALE,		"STALE" }, \
 	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
-	{ XBF_TRYLOCK,		"TRYLOCK" },	/* should never be set */\
-	{ XBF_UNMAPPED,		"UNMAPPED" },	/* ditto */\
+	{ _XBF_INODES,		"INODES" }, \
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
-	{ _XBF_DELWRI_Q,	"DELWRI_Q" }
-
+	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
+	/* The following interface flags should never be set */ \
+	{ XBF_TRYLOCK,		"TRYLOCK" }, \
+	{ XBF_UNMAPPED,		"UNMAPPED" }
 
 /*
  * Internal state flags.
@@ -257,9 +261,23 @@ extern void xfs_buf_unlock(xfs_buf_t *);
 #define xfs_buf_islocked(bp) \
 	((bp)->b_sema.count <= 0)
 
+static inline void xfs_buf_relse(xfs_buf_t *bp)
+{
+	xfs_buf_unlock(bp);
+	xfs_buf_rele(bp);
+}
+
 /* Buffer Read and Write Routines */
 extern int xfs_bwrite(struct xfs_buf *bp);
 extern void xfs_buf_ioend(struct xfs_buf *bp);
+static inline void xfs_buf_ioend_finish(struct xfs_buf *bp)
+{
+	if (bp->b_flags & XBF_ASYNC)
+		xfs_buf_relse(bp);
+	else
+		complete(&bp->b_iowait);
+}
+
 extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error,
 		xfs_failaddr_t failaddr);
 #define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address)
@@ -324,12 +342,6 @@ static inline int xfs_buf_ispinned(struct xfs_buf *bp)
 	return atomic_read(&bp->b_pin_count);
 }
 
-static inline void xfs_buf_relse(xfs_buf_t *bp)
-{
-	xfs_buf_unlock(bp);
-	xfs_buf_rele(bp);
-}
-
 static inline int
 xfs_buf_verify_cksum(struct xfs_buf *bp, unsigned long cksum_offset)
 {
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 9e75e8d6042ec..8659cf4282a64 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1158,20 +1158,15 @@ xfs_buf_iodone_callback_error(
 	return false;
 }
 
-/*
- * This is the iodone() function for buffers which have had callbacks attached
- * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
- * callback list, mark the buffer as having no more callbacks and then push the
- * buffer through IO completion processing.
- */
-void
-xfs_buf_iodone_callbacks(
+static void
+xfs_buf_run_callbacks(
 	struct xfs_buf		*bp)
 {
+
 	/*
-	 * If there is an error, process it. Some errors require us
-	 * to run callbacks after failure processing is done so we
-	 * detect that and take appropriate action.
+	 * If there is an error, process it. Some errors require us to run
+	 * callbacks after failure processing is done so we detect that and take
+	 * appropriate action.
 	 */
 	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
 		return;
@@ -1188,9 +1183,34 @@ xfs_buf_iodone_callbacks(
 	bp->b_log_item = NULL;
 	list_del_init(&bp->b_li_list);
 	bp->b_iodone = NULL;
+}
+
+/*
+ * This is the iodone() function for buffers which have had callbacks attached
+ * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
+ * callback list, mark the buffer as having no more callbacks and then push the
+ * buffer through IO completion processing.
+ */
+void
+xfs_buf_iodone_callbacks(
+	struct xfs_buf		*bp)
+{
+	xfs_buf_run_callbacks(bp);
 	xfs_buf_ioend(bp);
 }
 
+/*
+ * Inode buffer iodone callback function.
+ */
+void
+xfs_buf_inode_iodone(
+	struct xfs_buf		*bp)
+{
+	xfs_buf_run_callbacks(bp);
+	xfs_buf_ioend_finish(bp);
+}
+
+
 /*
  * This is the iodone() function for buffers which have been
  * logged.  It is called when they are eventually flushed out.
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index c9c57e2da9327..a342933ad9b8d 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -59,6 +59,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
 			      struct xfs_log_item *);
 void	xfs_buf_iodone_callbacks(struct xfs_buf *);
 void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
+void	xfs_buf_inode_iodone(struct xfs_buf *);
 bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
 
 extern kmem_zone_t	*xfs_buf_item_zone;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ac3c8af8c9a14..d5dee57f914a9 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3860,13 +3860,13 @@ xfs_iflush_int(
 	 * completion on the buffer to remove the inode from the AIL and release
 	 * the flush lock.
 	 */
+	bp->b_flags |= _XBF_INODES;
 	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);
 
 	/* generate the checksum. */
 	xfs_dinode_calc_crc(mp, dip);
 
 	ASSERT(!list_empty(&bp->b_li_list));
-	ASSERT(bp->b_iodone != NULL);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 08174ffa21189..552d0869aa0fe 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -626,6 +626,7 @@ xfs_trans_inode_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_INODE_BUF;
+	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
 
@@ -651,6 +652,7 @@ xfs_trans_stale_inode_buf(
 
 	bip->bli_flags |= XFS_BLI_STALE_INODE;
 	bip->bli_item.li_cb = xfs_buf_iodone;
+	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
 
@@ -675,6 +677,7 @@ xfs_trans_inode_alloc_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_INODE_ALLOC_BUF;
+	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
 
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 05/30] xfs: mark dquot buffers in cache
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (3 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 04/30] xfs: mark inode buffers in cache Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 16:45   ` Brian Foster
  2020-06-02 19:00   ` Darrick J. Wong
  2020-06-01 21:42 ` [PATCH 06/30] xfs: mark log recovery buffers for completion Dave Chinner
                   ` (24 subsequent siblings)
  29 siblings, 2 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

dquot buffers always have write IO callbacks, so by marking them
directly we can avoid needing to attach ->b_iodone functions to
them. This avoids an indirect call, and makes future modifications
much simpler.

This is largely a rearrangement of the code at this point - no IO
completion functionality changes at this point, just how the
code is run is modified.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c       |  5 +++++
 fs/xfs/xfs_buf.h       |  2 ++
 fs/xfs/xfs_buf_item.c  | 10 ++++++++++
 fs/xfs/xfs_buf_item.h  |  1 +
 fs/xfs/xfs_dquot.c     |  1 +
 fs/xfs/xfs_trans_buf.c |  1 +
 6 files changed, 20 insertions(+)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index fcf650575be61..3bffde8640a52 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1212,6 +1212,11 @@ xfs_buf_ioend(
 		return;
 	}
 
+	if (bp->b_flags & _XBF_DQUOTS) {
+		xfs_buf_dquot_iodone(bp);
+		return;
+	}
+
 	if (bp->b_iodone) {
 		(*(bp->b_iodone))(bp);
 		return;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 2400cb90a04c6..c1d0843206dd6 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -32,6 +32,7 @@
 
 /* buffer type flags for write callbacks */
 #define _XBF_INODES	 (1 << 16)/* inode buffer */
+#define _XBF_DQUOTS	 (1 << 17)/* dquot buffer */
 
 /* flags used only internally */
 #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
@@ -54,6 +55,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_STALE,		"STALE" }, \
 	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
 	{ _XBF_INODES,		"INODES" }, \
+	{ _XBF_DQUOTS,		"DQUOTS" }, \
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 8659cf4282a64..a42cdf9ccc47d 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1210,6 +1210,16 @@ xfs_buf_inode_iodone(
 	xfs_buf_ioend_finish(bp);
 }
 
+/*
+ * Dquot buffer iodone callback function.
+ */
+void
+xfs_buf_dquot_iodone(
+	struct xfs_buf		*bp)
+{
+	xfs_buf_run_callbacks(bp);
+	xfs_buf_ioend_finish(bp);
+}
 
 /*
  * This is the iodone() function for buffers which have been
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index a342933ad9b8d..27d13d29b5bbb 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -60,6 +60,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
 void	xfs_buf_iodone_callbacks(struct xfs_buf *);
 void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
 void	xfs_buf_inode_iodone(struct xfs_buf *);
+void	xfs_buf_dquot_iodone(struct xfs_buf *);
 bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
 
 extern kmem_zone_t	*xfs_buf_item_zone;
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index d5b7f03e93c8d..2e2146fa0914c 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1179,6 +1179,7 @@ xfs_qm_dqflush(
 	 * Attach an iodone routine so that we can remove this dquot from the
 	 * AIL and release the flush lock once the dquot is synced to disk.
 	 */
+	bp->b_flags |= _XBF_DQUOTS;
 	xfs_buf_attach_iodone(bp, xfs_qm_dqflush_done,
 				  &dqp->q_logitem.qli_item);
 
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 552d0869aa0fe..93d62cb864c15 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -788,5 +788,6 @@ xfs_trans_dquot_buf(
 		break;
 	}
 
+	bp->b_flags |= _XBF_DQUOTS;
 	xfs_trans_buf_set_type(tp, bp, type);
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 06/30] xfs: mark log recovery buffers for completion
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (4 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 05/30] xfs: mark dquot " Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 16:45   ` Brian Foster
  2020-06-02 19:24   ` Darrick J. Wong
  2020-06-01 21:42 ` [PATCH 07/30] xfs: call xfs_buf_iodone directly Dave Chinner
                   ` (23 subsequent siblings)
  29 siblings, 2 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Log recovery has it's own buffer write completion handler for
buffers that it directly recovers. Convert these to direct calls by
flagging these buffers as being log recovery buffers. The flag will
get cleared by the log recovery IO completion routine, so it will
never leak out of log recovery.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c                | 10 ++++++++++
 fs/xfs/xfs_buf.h                |  2 ++
 fs/xfs/xfs_buf_item_recover.c   |  5 ++---
 fs/xfs/xfs_dquot_item_recover.c |  2 +-
 fs/xfs/xfs_inode_item_recover.c |  2 +-
 fs/xfs/xfs_log_recover.c        |  5 ++---
 6 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 3bffde8640a52..0a69de674af9d 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -14,6 +14,7 @@
 #include "xfs_mount.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
+#include "xfs_log_recover.h"
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
 #include "xfs_errortag.h"
@@ -1207,6 +1208,15 @@ xfs_buf_ioend(
 	if (read)
 		goto out_finish;
 
+	/*
+	 * If this is a log recovery buffer, we aren't doing transactional IO
+	 * yet so we need to let it handle IO completions.
+	 */
+	if (bp->b_flags & _XBF_LOGRECOVERY) {
+		xlog_recover_iodone(bp);
+		return;
+	}
+
 	if (bp->b_flags & _XBF_INODES) {
 		xfs_buf_inode_iodone(bp);
 		return;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index c1d0843206dd6..30dabc5bae96d 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -33,6 +33,7 @@
 /* buffer type flags for write callbacks */
 #define _XBF_INODES	 (1 << 16)/* inode buffer */
 #define _XBF_DQUOTS	 (1 << 17)/* dquot buffer */
+#define _XBF_LOGRECOVERY	 (1 << 18)/* log recovery buffer */
 
 /* flags used only internally */
 #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
@@ -56,6 +57,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
 	{ _XBF_INODES,		"INODES" }, \
 	{ _XBF_DQUOTS,		"DQUOTS" }, \
+	{ _XBF_LOGRECOVERY,		"LOG_RECOVERY" }, \
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
index 04faa7310c4f0..74c851f60eeeb 100644
--- a/fs/xfs/xfs_buf_item_recover.c
+++ b/fs/xfs/xfs_buf_item_recover.c
@@ -419,8 +419,7 @@ xlog_recover_validate_buf_type(
 	if (bp->b_ops) {
 		struct xfs_buf_log_item	*bip;
 
-		ASSERT(!bp->b_iodone || bp->b_iodone == xlog_recover_iodone);
-		bp->b_iodone = xlog_recover_iodone;
+		bp->b_flags |= _XBF_LOGRECOVERY;
 		xfs_buf_item_init(bp, mp);
 		bip = bp->b_log_item;
 		bip->bli_item.li_lsn = current_lsn;
@@ -963,7 +962,7 @@ xlog_recover_buf_commit_pass2(
 		error = xfs_bwrite(bp);
 	} else {
 		ASSERT(bp->b_mount == mp);
-		bp->b_iodone = xlog_recover_iodone;
+		bp->b_flags |= _XBF_LOGRECOVERY;
 		xfs_buf_delwri_queue(bp, buffer_list);
 	}
 
diff --git a/fs/xfs/xfs_dquot_item_recover.c b/fs/xfs/xfs_dquot_item_recover.c
index 3400be4c88f08..f9ea9f55aa7cc 100644
--- a/fs/xfs/xfs_dquot_item_recover.c
+++ b/fs/xfs/xfs_dquot_item_recover.c
@@ -153,7 +153,7 @@ xlog_recover_dquot_commit_pass2(
 
 	ASSERT(dq_f->qlf_size == 2);
 	ASSERT(bp->b_mount == mp);
-	bp->b_iodone = xlog_recover_iodone;
+	bp->b_flags |= _XBF_LOGRECOVERY;
 	xfs_buf_delwri_queue(bp, buffer_list);
 
 out_release:
diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
index dc3e26ff16c90..5e0d291835b35 100644
--- a/fs/xfs/xfs_inode_item_recover.c
+++ b/fs/xfs/xfs_inode_item_recover.c
@@ -376,7 +376,7 @@ xlog_recover_inode_commit_pass2(
 	xfs_dinode_calc_crc(log->l_mp, dip);
 
 	ASSERT(bp->b_mount == mp);
-	bp->b_iodone = xlog_recover_iodone;
+	bp->b_flags |= _XBF_LOGRECOVERY;
 	xfs_buf_delwri_queue(bp, buffer_list);
 
 out_release:
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index ec015df55b77a..52a65a74208ff 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -287,9 +287,8 @@ xlog_recover_iodone(
 	if (bp->b_log_item)
 		xfs_buf_item_relse(bp);
 	ASSERT(bp->b_log_item == NULL);
-
-	bp->b_iodone = NULL;
-	xfs_buf_ioend(bp);
+	bp->b_flags &= ~_XBF_LOGRECOVERY;
+	xfs_buf_ioend_finish(bp);
 }
 
 /*
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 07/30] xfs: call xfs_buf_iodone directly
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (5 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 06/30] xfs: mark log recovery buffers for completion Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 16:47   ` Brian Foster
  2020-06-01 21:42 ` [PATCH 08/30] xfs: clean up whacky buffer log item list reinit Dave Chinner
                   ` (22 subsequent siblings)
  29 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

All unmarked dirty buffers should be in the AIL and have log items
attached to them. Hence when they are written, we will run a
callback to remove the item from the AIL if appropriate. Now that
we've handled inode and dquot buffers, all remaining calls are to
xfs_buf_iodone() and so we can hard code this rather than use an
indirect call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/xfs/xfs_buf.c       | 24 ++++++++----------------
 fs/xfs/xfs_buf.h       |  6 +-----
 fs/xfs/xfs_buf_item.c  | 40 ++++++++++------------------------------
 fs/xfs/xfs_buf_item.h  |  4 ++--
 fs/xfs/xfs_trans_buf.c | 13 +++----------
 5 files changed, 24 insertions(+), 63 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 0a69de674af9d..d7695b638e994 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -658,7 +658,6 @@ xfs_buf_find(
 	 */
 	if (bp->b_flags & XBF_STALE) {
 		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
-		ASSERT(bp->b_iodone == NULL);
 		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
 		bp->b_ops = NULL;
 	}
@@ -1194,10 +1193,13 @@ xfs_buf_ioend(
 	if (!bp->b_error && bp->b_io_error)
 		xfs_buf_ioerror(bp, bp->b_io_error);
 
-	/* Only validate buffers that were read without errors */
-	if (read && !bp->b_error && bp->b_ops) {
-		ASSERT(!bp->b_iodone);
-		bp->b_ops->verify_read(bp);
+	if (read) {
+		if (!bp->b_error && bp->b_ops)
+			bp->b_ops->verify_read(bp);
+		if (!bp->b_error)
+			bp->b_flags |= XBF_DONE;
+		xfs_buf_ioend_finish(bp);
+		return;
 	}
 
 	if (!bp->b_error) {
@@ -1205,9 +1207,6 @@ xfs_buf_ioend(
 		bp->b_flags |= XBF_DONE;
 	}
 
-	if (read)
-		goto out_finish;
-
 	/*
 	 * If this is a log recovery buffer, we aren't doing transactional IO
 	 * yet so we need to let it handle IO completions.
@@ -1226,14 +1225,7 @@ xfs_buf_ioend(
 		xfs_buf_dquot_iodone(bp);
 		return;
 	}
-
-	if (bp->b_iodone) {
-		(*(bp->b_iodone))(bp);
-		return;
-	}
-
-out_finish:
-	xfs_buf_ioend_finish(bp);
+	xfs_buf_iodone(bp);
 }
 
 static void
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 30dabc5bae96d..755b652e695ac 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -18,6 +18,7 @@
 /*
  *	Base types
  */
+struct xfs_buf;
 
 #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
 
@@ -102,10 +103,6 @@ typedef struct xfs_buftarg {
 	struct ratelimit_state	bt_ioerror_rl;
 } xfs_buftarg_t;
 
-struct xfs_buf;
-typedef void (*xfs_buf_iodone_t)(struct xfs_buf *);
-
-
 #define XB_PAGES	2
 
 struct xfs_buf_map {
@@ -158,7 +155,6 @@ typedef struct xfs_buf {
 	xfs_buftarg_t		*b_target;	/* buffer target (device) */
 	void			*b_addr;	/* virtual address of buffer */
 	struct work_struct	b_ioend_work;
-	xfs_buf_iodone_t	b_iodone;	/* I/O completion function */
 	struct completion	b_iowait;	/* queue for I/O waiters */
 	struct xfs_buf_log_item	*b_log_item;
 	struct list_head	b_li_list;	/* Log items list head */
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index a42cdf9ccc47d..d87ae6363a130 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -460,7 +460,6 @@ xfs_buf_item_unpin(
 			xfs_buf_do_callbacks(bp);
 			bp->b_log_item = NULL;
 			list_del_init(&bp->b_li_list);
-			bp->b_iodone = NULL;
 		} else {
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
 			xfs_buf_item_relse(bp);
@@ -936,11 +935,7 @@ xfs_buf_item_free(
 }
 
 /*
- * This is called when the buf log item is no longer needed.  It should
- * free the buf log item associated with the given buffer and clear
- * the buffer's pointer to the buf log item.  If there are no more
- * items in the list, clear the b_iodone field of the buffer (see
- * xfs_buf_attach_iodone() below).
+ * xfs_buf_item_relse() is called when the buf log item is no longer needed.
  */
 void
 xfs_buf_item_relse(
@@ -952,9 +947,6 @@ xfs_buf_item_relse(
 	ASSERT(!test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags));
 
 	bp->b_log_item = NULL;
-	if (list_empty(&bp->b_li_list))
-		bp->b_iodone = NULL;
-
 	xfs_buf_rele(bp);
 	xfs_buf_item_free(bip);
 }
@@ -962,10 +954,7 @@ xfs_buf_item_relse(
 
 /*
  * Add the given log item with its callback to the list of callbacks
- * to be called when the buffer's I/O completes.  If it is not set
- * already, set the buffer's b_iodone() routine to be
- * xfs_buf_iodone_callbacks() and link the log item into the list of
- * items rooted at b_li_list.
+ * to be called when the buffer's I/O completes.
  */
 void
 xfs_buf_attach_iodone(
@@ -977,10 +966,6 @@ xfs_buf_attach_iodone(
 
 	lip->li_cb = cb;
 	list_add_tail(&lip->li_bio_list, &bp->b_li_list);
-
-	ASSERT(bp->b_iodone == NULL ||
-	       bp->b_iodone == xfs_buf_iodone_callbacks);
-	bp->b_iodone = xfs_buf_iodone_callbacks;
 }
 
 /*
@@ -1096,7 +1081,6 @@ xfs_buf_iodone_callback_error(
 		goto out_stale;
 
 	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
-	ASSERT(bp->b_iodone != NULL);
 
 	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
 
@@ -1182,28 +1166,24 @@ xfs_buf_run_callbacks(
 	xfs_buf_do_callbacks(bp);
 	bp->b_log_item = NULL;
 	list_del_init(&bp->b_li_list);
-	bp->b_iodone = NULL;
 }
 
 /*
- * This is the iodone() function for buffers which have had callbacks attached
- * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
- * callback list, mark the buffer as having no more callbacks and then push the
- * buffer through IO completion processing.
+ * Inode buffer iodone callback function.
  */
 void
-xfs_buf_iodone_callbacks(
+xfs_buf_inode_iodone(
 	struct xfs_buf		*bp)
 {
 	xfs_buf_run_callbacks(bp);
-	xfs_buf_ioend(bp);
+	xfs_buf_ioend_finish(bp);
 }
 
 /*
- * Inode buffer iodone callback function.
+ * Dquot buffer iodone callback function.
  */
 void
-xfs_buf_inode_iodone(
+xfs_buf_dquot_iodone(
 	struct xfs_buf		*bp)
 {
 	xfs_buf_run_callbacks(bp);
@@ -1211,10 +1191,10 @@ xfs_buf_inode_iodone(
 }
 
 /*
- * Dquot buffer iodone callback function.
+ * Dirty buffer iodone callback function.
  */
 void
-xfs_buf_dquot_iodone(
+xfs_buf_iodone(
 	struct xfs_buf		*bp)
 {
 	xfs_buf_run_callbacks(bp);
@@ -1229,7 +1209,7 @@ xfs_buf_dquot_iodone(
  * care of cleaning up the buffer itself.
  */
 void
-xfs_buf_iodone(
+xfs_buf_item_iodone(
 	struct xfs_buf		*bp,
 	struct xfs_log_item	*lip)
 {
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 27d13d29b5bbb..610cd00193289 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -57,10 +57,10 @@ bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
 void	xfs_buf_attach_iodone(struct xfs_buf *,
 			      void(*)(struct xfs_buf *, struct xfs_log_item *),
 			      struct xfs_log_item *);
-void	xfs_buf_iodone_callbacks(struct xfs_buf *);
-void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
+void	xfs_buf_item_iodone(struct xfs_buf *, struct xfs_log_item *);
 void	xfs_buf_inode_iodone(struct xfs_buf *);
 void	xfs_buf_dquot_iodone(struct xfs_buf *);
+void	xfs_buf_iodone(struct xfs_buf *);
 bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
 
 extern kmem_zone_t	*xfs_buf_item_zone;
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 93d62cb864c15..6752676b94fe7 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -465,24 +465,17 @@ xfs_trans_dirty_buf(
 
 	ASSERT(bp->b_transp == tp);
 	ASSERT(bip != NULL);
-	ASSERT(bp->b_iodone == NULL ||
-	       bp->b_iodone == xfs_buf_iodone_callbacks);
 
 	/*
 	 * Mark the buffer as needing to be written out eventually,
 	 * and set its iodone function to remove the buffer's buf log
 	 * item from the AIL and free it when the buffer is flushed
-	 * to disk.  See xfs_buf_attach_iodone() for more details
-	 * on li_cb and xfs_buf_iodone_callbacks().
-	 * If we end up aborting this transaction, we trap this buffer
-	 * inside the b_bdstrat callback so that this won't get written to
-	 * disk.
+	 * to disk.
 	 */
 	bp->b_flags |= XBF_DONE;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
-	bp->b_iodone = xfs_buf_iodone_callbacks;
-	bip->bli_item.li_cb = xfs_buf_iodone;
+	bip->bli_item.li_cb = xfs_buf_item_iodone;
 
 	/*
 	 * If we invalidated the buffer within this transaction, then
@@ -651,7 +644,7 @@ xfs_trans_stale_inode_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_STALE_INODE;
-	bip->bli_item.li_cb = xfs_buf_iodone;
+	bip->bli_item.li_cb = xfs_buf_item_iodone;
 	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 08/30] xfs: clean up whacky buffer log item list reinit
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (6 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 07/30] xfs: call xfs_buf_iodone directly Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 16:47   ` Brian Foster
  2020-06-01 21:42 ` [PATCH 09/30] xfs: make inode IO completion buffer centric Dave Chinner
                   ` (21 subsequent siblings)
  29 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we've emptied the buffer log item list, it does a list_del_init
on itself to reset it's pointers to itself. This is unnecessary as
the list is already empty at this point - it was a left-over
fragment from the list_head conversion of the buffer log item list.
Remove them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_buf_item.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index d87ae6363a130..5b3cd5e90947c 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -459,7 +459,6 @@ xfs_buf_item_unpin(
 		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
 			xfs_buf_do_callbacks(bp);
 			bp->b_log_item = NULL;
-			list_del_init(&bp->b_li_list);
 		} else {
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
 			xfs_buf_item_relse(bp);
@@ -1165,7 +1164,6 @@ xfs_buf_run_callbacks(
 
 	xfs_buf_do_callbacks(bp);
 	bp->b_log_item = NULL;
-	list_del_init(&bp->b_li_list);
 }
 
 /*
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 09/30] xfs: make inode IO completion buffer centric
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (7 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 08/30] xfs: clean up whacky buffer log item list reinit Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-03 14:58   ` Brian Foster
  2020-06-01 21:42 ` [PATCH 10/30] xfs: use direct calls for dquot IO completion Dave Chinner
                   ` (20 subsequent siblings)
  29 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Having different io completion callbacks for different inode states
makes things complex. We can detect if the inode is stale via the
XFS_ISTALE flag in IO completion, so we don't need a special
callback just for this.

This means inodes only have a single iodone callback, and inode IO
completion is entirely buffer centric at this point. Hence we no
longer need to use a log item callback at all as we can just call
xfs_iflush_done() directly from the buffer completions and walk the
buffer log item list to complete the all inodes under IO.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_buf_item.c   | 35 ++++++++++++++++++----
 fs/xfs/xfs_inode.c      |  6 ++--
 fs/xfs/xfs_inode_item.c | 65 ++++++++++++++---------------------------
 fs/xfs/xfs_inode_item.h |  5 ++--
 4 files changed, 56 insertions(+), 55 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 5b3cd5e90947c..a4e416af5c614 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -13,6 +13,8 @@
 #include "xfs_mount.h"
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
+#include "xfs_inode.h"
+#include "xfs_inode_item.h"
 #include "xfs_trans_priv.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
@@ -457,7 +459,8 @@ xfs_buf_item_unpin(
 		 * the AIL lock.
 		 */
 		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
-			xfs_buf_do_callbacks(bp);
+			lip->li_cb(bp, lip);
+			xfs_iflush_done(bp);
 			bp->b_log_item = NULL;
 		} else {
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
@@ -1141,8 +1144,8 @@ xfs_buf_iodone_callback_error(
 	return false;
 }
 
-static void
-xfs_buf_run_callbacks(
+static inline bool
+xfs_buf_had_callback_errors(
 	struct xfs_buf		*bp)
 {
 
@@ -1152,7 +1155,7 @@ xfs_buf_run_callbacks(
 	 * appropriate action.
 	 */
 	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
-		return;
+		return true;
 
 	/*
 	 * Successful IO or permanent error. Either way, we can clear the
@@ -1161,7 +1164,16 @@ xfs_buf_run_callbacks(
 	bp->b_last_error = 0;
 	bp->b_retries = 0;
 	bp->b_first_retry_time = 0;
+	return false;
+}
 
+static void
+xfs_buf_run_callbacks(
+	struct xfs_buf		*bp)
+{
+
+	if (xfs_buf_had_callback_errors(bp))
+		return;
 	xfs_buf_do_callbacks(bp);
 	bp->b_log_item = NULL;
 }
@@ -1173,7 +1185,20 @@ void
 xfs_buf_inode_iodone(
 	struct xfs_buf		*bp)
 {
-	xfs_buf_run_callbacks(bp);
+	struct xfs_buf_log_item *blip = bp->b_log_item;
+	struct xfs_log_item	*lip;
+
+	if (xfs_buf_had_callback_errors(bp))
+		return;
+
+	/* If there is a buf_log_item attached, run its callback */
+	if (blip) {
+		lip = &blip->bli_item;
+		lip->li_cb(bp, lip);
+		bp->b_log_item = NULL;
+	}
+
+	xfs_iflush_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d5dee57f914a9..1b4e8e0bb0cf0 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2677,7 +2677,6 @@ xfs_ifree_cluster(
 		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
 			if (lip->li_type == XFS_LI_INODE) {
 				iip = (struct xfs_inode_log_item *)lip;
-				lip->li_cb = xfs_istale_done;
 				xfs_trans_ail_copy_lsn(mp->m_ail,
 							&iip->ili_flush_lsn,
 							&iip->ili_item.li_lsn);
@@ -2710,8 +2709,7 @@ xfs_ifree_cluster(
 			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 						&iip->ili_item.li_lsn);
 
-			xfs_buf_attach_iodone(bp, xfs_istale_done,
-						  &iip->ili_item);
+			xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
 
 			if (ip != free_ip)
 				xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -3861,7 +3859,7 @@ xfs_iflush_int(
 	 * the flush lock.
 	 */
 	bp->b_flags |= _XBF_INODES;
-	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);
+	xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
 
 	/* generate the checksum. */
 	xfs_dinode_calc_crc(mp, dip);
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6ef9cbcfc94a7..7049f2ae8d186 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -668,40 +668,34 @@ xfs_inode_item_destroy(
  */
 void
 xfs_iflush_done(
-	struct xfs_buf		*bp,
-	struct xfs_log_item	*lip)
+	struct xfs_buf		*bp)
 {
 	struct xfs_inode_log_item *iip;
-	struct xfs_log_item	*blip, *n;
-	struct xfs_ail		*ailp = lip->li_ailp;
+	struct xfs_log_item	*lip, *n;
+	struct xfs_ail		*ailp = bp->b_mount->m_ail;
 	int			need_ail = 0;
 	LIST_HEAD(tmp);
 
 	/*
-	 * Scan the buffer IO completions for other inodes being completed and
-	 * attach them to the current inode log item.
+	 * Pull the attached inodes from the buffer one at a time and take the
+	 * appropriate action on them.
 	 */
-
-	list_add_tail(&lip->li_bio_list, &tmp);
-
-	list_for_each_entry_safe(blip, n, &bp->b_li_list, li_bio_list) {
-		if (lip->li_cb != xfs_iflush_done)
+	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
+		iip = INODE_ITEM(lip);
+		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
+			list_del_init(&lip->li_bio_list);
+			xfs_iflush_abort(iip->ili_inode);
 			continue;
+		}
 
-		list_move_tail(&blip->li_bio_list, &tmp);
+		list_move_tail(&lip->li_bio_list, &tmp);
 
 		/* Do an unlocked check for needing the AIL lock. */
-		iip = INODE_ITEM(blip);
-		if (blip->li_lsn == iip->ili_flush_lsn ||
-		    test_bit(XFS_LI_FAILED, &blip->li_flags))
+		if (lip->li_lsn == iip->ili_flush_lsn ||
+		    test_bit(XFS_LI_FAILED, &lip->li_flags))
 			need_ail++;
 	}
-
-	/* make sure we capture the state of the initial inode. */
-	iip = INODE_ITEM(lip);
-	if (lip->li_lsn == iip->ili_flush_lsn ||
-	    test_bit(XFS_LI_FAILED, &lip->li_flags))
-		need_ail++;
+	ASSERT(list_empty(&bp->b_li_list));
 
 	/*
 	 * We only want to pull the item from the AIL if it is actually there
@@ -713,19 +707,13 @@ xfs_iflush_done(
 
 		/* this is an opencoded batch version of xfs_trans_ail_delete */
 		spin_lock(&ailp->ail_lock);
-		list_for_each_entry(blip, &tmp, li_bio_list) {
-			if (blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
-				/*
-				 * xfs_ail_update_finish() only cares about the
-				 * lsn of the first tail item removed, any
-				 * others will be at the same or higher lsn so
-				 * we just ignore them.
-				 */
-				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip);
+		list_for_each_entry(lip, &tmp, li_bio_list) {
+			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
+				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
 				if (!tail_lsn && lsn)
 					tail_lsn = lsn;
 			} else {
-				xfs_clear_li_failed(blip);
+				xfs_clear_li_failed(lip);
 			}
 		}
 		xfs_ail_update_finish(ailp, tail_lsn);
@@ -736,9 +724,9 @@ xfs_iflush_done(
 	 * ili_last_fields bits now that we know that the data corresponding to
 	 * them is safely on disk.
 	 */
-	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
-		list_del_init(&blip->li_bio_list);
-		iip = INODE_ITEM(blip);
+	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
+		list_del_init(&lip->li_bio_list);
+		iip = INODE_ITEM(lip);
 
 		spin_lock(&iip->ili_lock);
 		iip->ili_last_fields = 0;
@@ -746,7 +734,6 @@ xfs_iflush_done(
 
 		xfs_ifunlock(iip->ili_inode);
 	}
-	list_del(&tmp);
 }
 
 /*
@@ -779,14 +766,6 @@ xfs_iflush_abort(
 	xfs_ifunlock(ip);
 }
 
-void
-xfs_istale_done(
-	struct xfs_buf		*bp,
-	struct xfs_log_item	*lip)
-{
-	xfs_iflush_abort(INODE_ITEM(lip)->ili_inode);
-}
-
 /*
  * convert an xfs_inode_log_format struct from the old 32 bit version
  * (which can have different field alignments) to the native 64 bit version
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 44c47c08b0b59..1545fccad4eeb 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -36,15 +36,14 @@ struct xfs_inode_log_item {
 	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
 };
 
-static inline int xfs_inode_clean(xfs_inode_t *ip)
+static inline int xfs_inode_clean(struct xfs_inode *ip)
 {
 	return !ip->i_itemp || !(ip->i_itemp->ili_fields & XFS_ILOG_ALL);
 }
 
 extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *);
 extern void xfs_inode_item_destroy(struct xfs_inode *);
-extern void xfs_iflush_done(struct xfs_buf *, struct xfs_log_item *);
-extern void xfs_istale_done(struct xfs_buf *, struct xfs_log_item *);
+extern void xfs_iflush_done(struct xfs_buf *);
 extern void xfs_iflush_abort(struct xfs_inode *);
 extern int xfs_inode_item_format_convert(xfs_log_iovec_t *,
 					 struct xfs_inode_log_format *);
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 10/30] xfs: use direct calls for dquot IO completion
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (8 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 09/30] xfs: make inode IO completion buffer centric Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 19:25   ` Darrick J. Wong
  2020-06-03 14:58   ` Brian Foster
  2020-06-01 21:42 ` [PATCH 11/30] xfs: clean up the buffer iodone callback functions Dave Chinner
                   ` (19 subsequent siblings)
  29 siblings, 2 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Similar to inodes, we can call the dquot IO completion functions
directly from the buffer completion code, removing another user of
log item callbacks for IO completion processing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 18 +++++++++++++++++-
 fs/xfs/xfs_dquot.c    | 18 ++++++++++++++----
 fs/xfs/xfs_dquot.h    |  1 +
 3 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index a4e416af5c614..f46e5ec28111c 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -15,6 +15,9 @@
 #include "xfs_buf_item.h"
 #include "xfs_inode.h"
 #include "xfs_inode_item.h"
+#include "xfs_quota.h"
+#include "xfs_dquot_item.h"
+#include "xfs_dquot.h"
 #include "xfs_trans_priv.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
@@ -1209,7 +1212,20 @@ void
 xfs_buf_dquot_iodone(
 	struct xfs_buf		*bp)
 {
-	xfs_buf_run_callbacks(bp);
+	struct xfs_buf_log_item *blip = bp->b_log_item;
+	struct xfs_log_item	*lip;
+
+	if (xfs_buf_had_callback_errors(bp))
+		return;
+
+	/* a newly allocated dquot buffer might have a log item attached */
+	if (blip) {
+		lip = &blip->bli_item;
+		lip->li_cb(bp, lip);
+		bp->b_log_item = NULL;
+	}
+
+	xfs_dquot_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
 
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 2e2146fa0914c..403bc4e9f21ff 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1048,9 +1048,8 @@ xfs_qm_dqrele(
  * from the AIL if it has not been re-logged, and unlocking the dquot's
  * flush lock. This behavior is very similar to that of inodes..
  */
-STATIC void
+static void
 xfs_qm_dqflush_done(
-	struct xfs_buf		*bp,
 	struct xfs_log_item	*lip)
 {
 	struct xfs_dq_logitem	*qip = (struct xfs_dq_logitem *)lip;
@@ -1091,6 +1090,18 @@ xfs_qm_dqflush_done(
 	xfs_dqfunlock(dqp);
 }
 
+void
+xfs_dquot_done(
+	struct xfs_buf		*bp)
+{
+	struct xfs_log_item	*lip, *n;
+
+	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
+		list_del_init(&lip->li_bio_list);
+		xfs_qm_dqflush_done(lip);
+	}
+}
+
 /*
  * Write a modified dquot to disk.
  * The dquot must be locked and the flush lock too taken by caller.
@@ -1180,8 +1191,7 @@ xfs_qm_dqflush(
 	 * AIL and release the flush lock once the dquot is synced to disk.
 	 */
 	bp->b_flags |= _XBF_DQUOTS;
-	xfs_buf_attach_iodone(bp, xfs_qm_dqflush_done,
-				  &dqp->q_logitem.qli_item);
+	xfs_buf_attach_iodone(bp, NULL, &dqp->q_logitem.qli_item);
 
 	/*
 	 * If the buffer is pinned then push on the log so we won't
diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
index 71e36c85e20b6..fe9cc3e08ed6d 100644
--- a/fs/xfs/xfs_dquot.h
+++ b/fs/xfs/xfs_dquot.h
@@ -174,6 +174,7 @@ void		xfs_qm_dqput(struct xfs_dquot *dqp);
 void		xfs_dqlock2(struct xfs_dquot *, struct xfs_dquot *);
 
 void		xfs_dquot_set_prealloc_limits(struct xfs_dquot *);
+void		xfs_dquot_done(struct xfs_buf *);
 
 static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp)
 {
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 11/30] xfs: clean up the buffer iodone callback functions
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (9 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 10/30] xfs: use direct calls for dquot IO completion Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-03 14:58   ` Brian Foster
  2020-06-01 21:42 ` [PATCH 12/30] xfs: get rid of log item callbacks Dave Chinner
                   ` (18 subsequent siblings)
  29 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we've sorted inode and dquot buffers, we can apply the same
cleanups to dirty buffers with buffer log items. They only have one
callback, too, so we don't need the log item callback. Collapse the
iodone functions and remove all the now unnecessary infrastructure
around callback processing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_buf_item.c  | 140 +++++++++--------------------------------
 fs/xfs/xfs_buf_item.h  |   1 -
 fs/xfs/xfs_trans_buf.c |   2 -
 3 files changed, 29 insertions(+), 114 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index f46e5ec28111c..0ece5de9dd711 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -30,7 +30,7 @@ static inline struct xfs_buf_log_item *BUF_ITEM(struct xfs_log_item *lip)
 	return container_of(lip, struct xfs_buf_log_item, bli_item);
 }
 
-STATIC void	xfs_buf_do_callbacks(struct xfs_buf *bp);
+static void xfs_buf_item_done(struct xfs_buf *bp);
 
 /* Is this log iovec plausibly large enough to contain the buffer log format? */
 bool
@@ -462,9 +462,8 @@ xfs_buf_item_unpin(
 		 * the AIL lock.
 		 */
 		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
-			lip->li_cb(bp, lip);
+			xfs_buf_item_done(bp);
 			xfs_iflush_done(bp);
-			bp->b_log_item = NULL;
 		} else {
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
 			xfs_buf_item_relse(bp);
@@ -973,46 +972,6 @@ xfs_buf_attach_iodone(
 	list_add_tail(&lip->li_bio_list, &bp->b_li_list);
 }
 
-/*
- * We can have many callbacks on a buffer. Running the callbacks individually
- * can cause a lot of contention on the AIL lock, so we allow for a single
- * callback to be able to scan the remaining items in bp->b_li_list for other
- * items of the same type and callback to be processed in the first call.
- *
- * As a result, the loop walking the callback list below will also modify the
- * list. it removes the first item from the list and then runs the callback.
- * The loop then restarts from the new first item int the list. This allows the
- * callback to scan and modify the list attached to the buffer and we don't
- * have to care about maintaining a next item pointer.
- */
-STATIC void
-xfs_buf_do_callbacks(
-	struct xfs_buf		*bp)
-{
-	struct xfs_buf_log_item *blip = bp->b_log_item;
-	struct xfs_log_item	*lip;
-
-	/* If there is a buf_log_item attached, run its callback */
-	if (blip) {
-		lip = &blip->bli_item;
-		lip->li_cb(bp, lip);
-	}
-
-	while (!list_empty(&bp->b_li_list)) {
-		lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
-				       li_bio_list);
-
-		/*
-		 * Remove the item from the list, so we don't have any
-		 * confusion if the item is added to another buf.
-		 * Don't touch the log item after calling its
-		 * callback, because it could have freed itself.
-		 */
-		list_del_init(&lip->li_bio_list);
-		lip->li_cb(bp, lip);
-	}
-}
-
 /*
  * Invoke the error state callback for each log item affected by the failed I/O.
  *
@@ -1025,8 +984,8 @@ STATIC void
 xfs_buf_do_callbacks_fail(
 	struct xfs_buf		*bp)
 {
+	struct xfs_ail		*ailp = bp->b_mount->m_ail;
 	struct xfs_log_item	*lip;
-	struct xfs_ail		*ailp;
 
 	/*
 	 * Buffer log item errors are handled directly by xfs_buf_item_push()
@@ -1036,9 +995,6 @@ xfs_buf_do_callbacks_fail(
 	if (list_empty(&bp->b_li_list))
 		return;
 
-	lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
-			li_bio_list);
-	ailp = lip->li_ailp;
 	spin_lock(&ailp->ail_lock);
 	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
 		if (lip->li_ops->iop_error)
@@ -1051,22 +1007,11 @@ static bool
 xfs_buf_iodone_callback_error(
 	struct xfs_buf		*bp)
 {
-	struct xfs_buf_log_item	*bip = bp->b_log_item;
-	struct xfs_log_item	*lip;
-	struct xfs_mount	*mp;
+	struct xfs_mount	*mp = bp->b_mount;
 	static ulong		lasttime;
 	static xfs_buftarg_t	*lasttarg;
 	struct xfs_error_cfg	*cfg;
 
-	/*
-	 * The failed buffer might not have a buf_log_item attached or the
-	 * log_item list might be empty. Get the mp from the available
-	 * xfs_log_item
-	 */
-	lip = list_first_entry_or_null(&bp->b_li_list, struct xfs_log_item,
-				       li_bio_list);
-	mp = lip ? lip->li_mountp : bip->bli_item.li_mountp;
-
 	/*
 	 * If we've already decided to shutdown the filesystem because of
 	 * I/O errors, there's no point in giving this a retry.
@@ -1171,14 +1116,27 @@ xfs_buf_had_callback_errors(
 }
 
 static void
-xfs_buf_run_callbacks(
+xfs_buf_item_done(
 	struct xfs_buf		*bp)
 {
+	struct xfs_buf_log_item	*bip = bp->b_log_item;
 
-	if (xfs_buf_had_callback_errors(bp))
+	if (!bip)
 		return;
-	xfs_buf_do_callbacks(bp);
+
+	/*
+	 * If we are forcibly shutting down, this may well be off the AIL
+	 * already. That's because we simulate the log-committed callbacks to
+	 * unpin these buffers. Or we may never have put this item on AIL
+	 * because of the transaction was aborted forcibly.
+	 * xfs_trans_ail_delete() takes care of these.
+	 *
+	 * Either way, AIL is useless if we're forcing a shutdown.
+	 */
+	xfs_trans_ail_delete(&bip->bli_item, SHUTDOWN_CORRUPT_INCORE);
 	bp->b_log_item = NULL;
+	xfs_buf_item_free(bip);
+	xfs_buf_rele(bp);
 }
 
 /*
@@ -1188,19 +1146,10 @@ void
 xfs_buf_inode_iodone(
 	struct xfs_buf		*bp)
 {
-	struct xfs_buf_log_item *blip = bp->b_log_item;
-	struct xfs_log_item	*lip;
-
 	if (xfs_buf_had_callback_errors(bp))
 		return;
 
-	/* If there is a buf_log_item attached, run its callback */
-	if (blip) {
-		lip = &blip->bli_item;
-		lip->li_cb(bp, lip);
-		bp->b_log_item = NULL;
-	}
-
+	xfs_buf_item_done(bp);
 	xfs_iflush_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
@@ -1212,59 +1161,28 @@ void
 xfs_buf_dquot_iodone(
 	struct xfs_buf		*bp)
 {
-	struct xfs_buf_log_item *blip = bp->b_log_item;
-	struct xfs_log_item	*lip;
-
 	if (xfs_buf_had_callback_errors(bp))
 		return;
 
 	/* a newly allocated dquot buffer might have a log item attached */
-	if (blip) {
-		lip = &blip->bli_item;
-		lip->li_cb(bp, lip);
-		bp->b_log_item = NULL;
-	}
-
+	xfs_buf_item_done(bp);
 	xfs_dquot_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
 
 /*
  * Dirty buffer iodone callback function.
+ *
+ * Note that for things like remote attribute buffers, there may not be a buffer
+ * log item here, so processing the buffer log item must remain be optional.
  */
 void
 xfs_buf_iodone(
 	struct xfs_buf		*bp)
 {
-	xfs_buf_run_callbacks(bp);
-	xfs_buf_ioend_finish(bp);
-}
-
-/*
- * This is the iodone() function for buffers which have been
- * logged.  It is called when they are eventually flushed out.
- * It should remove the buf item from the AIL, and free the buf item.
- * It is called by xfs_buf_iodone_callbacks() above which will take
- * care of cleaning up the buffer itself.
- */
-void
-xfs_buf_item_iodone(
-	struct xfs_buf		*bp,
-	struct xfs_log_item	*lip)
-{
-	ASSERT(BUF_ITEM(lip)->bli_buf == bp);
-
-	xfs_buf_rele(bp);
+	if (xfs_buf_had_callback_errors(bp))
+		return;
 
-	/*
-	 * If we are forcibly shutting down, this may well be off the AIL
-	 * already. That's because we simulate the log-committed callbacks to
-	 * unpin these buffers. Or we may never have put this item on AIL
-	 * because of the transaction was aborted forcibly.
-	 * xfs_trans_ail_delete() takes care of these.
-	 *
-	 * Either way, AIL is useless if we're forcing a shutdown.
-	 */
-	xfs_trans_ail_delete(lip, SHUTDOWN_CORRUPT_INCORE);
-	xfs_buf_item_free(BUF_ITEM(lip));
+	xfs_buf_item_done(bp);
+	xfs_buf_ioend_finish(bp);
 }
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 610cd00193289..7c0bd2a210aff 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -57,7 +57,6 @@ bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
 void	xfs_buf_attach_iodone(struct xfs_buf *,
 			      void(*)(struct xfs_buf *, struct xfs_log_item *),
 			      struct xfs_log_item *);
-void	xfs_buf_item_iodone(struct xfs_buf *, struct xfs_log_item *);
 void	xfs_buf_inode_iodone(struct xfs_buf *);
 void	xfs_buf_dquot_iodone(struct xfs_buf *);
 void	xfs_buf_iodone(struct xfs_buf *);
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 6752676b94fe7..11cd666cd99a6 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -475,7 +475,6 @@ xfs_trans_dirty_buf(
 	bp->b_flags |= XBF_DONE;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
-	bip->bli_item.li_cb = xfs_buf_item_iodone;
 
 	/*
 	 * If we invalidated the buffer within this transaction, then
@@ -644,7 +643,6 @@ xfs_trans_stale_inode_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_STALE_INODE;
-	bip->bli_item.li_cb = xfs_buf_item_iodone;
 	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 12/30] xfs: get rid of log item callbacks
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (10 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 11/30] xfs: clean up the buffer iodone callback functions Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-03 14:58   ` Brian Foster
  2020-06-01 21:42 ` [PATCH 13/30] xfs: handle buffer log item IO errors directly Dave Chinner
                   ` (17 subsequent siblings)
  29 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

They are not used anymore, so remove them from the log item and the
buffer iodone attachment interfaces.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_buf_item.c | 17 -----------------
 fs/xfs/xfs_buf_item.h |  3 ---
 fs/xfs/xfs_dquot.c    |  6 +++---
 fs/xfs/xfs_inode.c    |  5 +++--
 fs/xfs/xfs_trans.h    |  4 ----
 5 files changed, 6 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 0ece5de9dd711..09bfe9c52dbdb 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -955,23 +955,6 @@ xfs_buf_item_relse(
 	xfs_buf_item_free(bip);
 }
 
-
-/*
- * Add the given log item with its callback to the list of callbacks
- * to be called when the buffer's I/O completes.
- */
-void
-xfs_buf_attach_iodone(
-	struct xfs_buf		*bp,
-	void			(*cb)(struct xfs_buf *, struct xfs_log_item *),
-	struct xfs_log_item	*lip)
-{
-	ASSERT(xfs_buf_islocked(bp));
-
-	lip->li_cb = cb;
-	list_add_tail(&lip->li_bio_list, &bp->b_li_list);
-}
-
 /*
  * Invoke the error state callback for each log item affected by the failed I/O.
  *
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 7c0bd2a210aff..23507cbb4c413 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -54,9 +54,6 @@ void	xfs_buf_item_relse(struct xfs_buf *);
 bool	xfs_buf_item_put(struct xfs_buf_log_item *);
 void	xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint);
 bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
-void	xfs_buf_attach_iodone(struct xfs_buf *,
-			      void(*)(struct xfs_buf *, struct xfs_log_item *),
-			      struct xfs_log_item *);
 void	xfs_buf_inode_iodone(struct xfs_buf *);
 void	xfs_buf_dquot_iodone(struct xfs_buf *);
 void	xfs_buf_iodone(struct xfs_buf *);
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 403bc4e9f21ff..d5984a926d1d0 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1187,11 +1187,11 @@ xfs_qm_dqflush(
 	}
 
 	/*
-	 * Attach an iodone routine so that we can remove this dquot from the
-	 * AIL and release the flush lock once the dquot is synced to disk.
+	 * Attach the dquot to the buffer so that we can remove this dquot from
+	 * the AIL and release the flush lock once the dquot is synced to disk.
 	 */
 	bp->b_flags |= _XBF_DQUOTS;
-	xfs_buf_attach_iodone(bp, NULL, &dqp->q_logitem.qli_item);
+	list_add_tail(&dqp->q_logitem.qli_item.li_bio_list, &bp->b_li_list);
 
 	/*
 	 * If the buffer is pinned then push on the log so we won't
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 1b4e8e0bb0cf0..272b54cf97000 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2709,7 +2709,8 @@ xfs_ifree_cluster(
 			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 						&iip->ili_item.li_lsn);
 
-			xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
+			list_add_tail(&iip->ili_item.li_bio_list,
+						&bp->b_li_list);
 
 			if (ip != free_ip)
 				xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -3859,7 +3860,7 @@ xfs_iflush_int(
 	 * the flush lock.
 	 */
 	bp->b_flags |= _XBF_INODES;
-	xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
+	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
 
 	/* generate the checksum. */
 	xfs_dinode_calc_crc(mp, dip);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 8308bf6d7e404..99a9ab9cab25b 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -37,10 +37,6 @@ struct xfs_log_item {
 	unsigned long			li_flags;	/* misc flags */
 	struct xfs_buf			*li_buf;	/* real buffer pointer */
 	struct list_head		li_bio_list;	/* buffer item list */
-	void				(*li_cb)(struct xfs_buf *,
-						 struct xfs_log_item *);
-							/* buffer item iodone */
-							/* callback func */
 	const struct xfs_item_ops	*li_ops;	/* function list */
 
 	/* delayed logging */
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 13/30] xfs: handle buffer log item IO errors directly
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (11 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 12/30] xfs: get rid of log item callbacks Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 20:39   ` Darrick J. Wong
  2020-06-03 15:02   ` Brian Foster
  2020-06-01 21:42 ` [PATCH 14/30] xfs: unwind log item error flagging Dave Chinner
                   ` (16 subsequent siblings)
  29 siblings, 2 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Currently when a buffer with attached log items has an IO error
it called ->iop_error for each attched log item. These all call
xfs_set_li_failed() to handle the error, but we are about to change
the way log items manage buffers. hence we first need to remove the
per-item dependency on buffer handling done by xfs_set_li_failed().

We already have specific buffer type IO completion routines, so move
the log item error handling out of the generic error handling and
into the log item specific functions so we can implement per-type
error handling easily.

This requires a more complex return value from the error handling
code so that we can take the correct action the failure handling
requires.  This results in some repeated boilerplate in the
functions, but that can be cleaned up later once all the changes
cascade through this code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 167 ++++++++++++++++++++++++++++--------------
 1 file changed, 112 insertions(+), 55 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 09bfe9c52dbdb..b6995719e877b 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -987,20 +987,18 @@ xfs_buf_do_callbacks_fail(
 }
 
 static bool
-xfs_buf_iodone_callback_error(
+xfs_buf_ioerror_sync(
 	struct xfs_buf		*bp)
 {
 	struct xfs_mount	*mp = bp->b_mount;
 	static ulong		lasttime;
 	static xfs_buftarg_t	*lasttarg;
-	struct xfs_error_cfg	*cfg;
-
 	/*
 	 * If we've already decided to shutdown the filesystem because of
 	 * I/O errors, there's no point in giving this a retry.
 	 */
 	if (XFS_FORCED_SHUTDOWN(mp))
-		goto out_stale;
+		return true;
 
 	if (bp->b_target != lasttarg ||
 	    time_after(jiffies, (lasttime + 5*HZ))) {
@@ -1011,19 +1009,15 @@ xfs_buf_iodone_callback_error(
 
 	/* synchronous writes will have callers process the error */
 	if (!(bp->b_flags & XBF_ASYNC))
-		goto out_stale;
-
-	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
-
-	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
+		return true;
+	return false;
+}
 
-	/*
-	 * If the write was asynchronous then no one will be looking for the
-	 * error.  If this is the first failure of this type, clear the error
-	 * state and write the buffer out again. This means we always retry an
-	 * async write failure at least once, but we also need to set the buffer
-	 * up to behave correctly now for repeated failures.
-	 */
+static bool
+xfs_buf_ioerror_retry(
+	struct xfs_buf		*bp,
+	struct xfs_error_cfg	*cfg)
+{
 	if (!(bp->b_flags & (XBF_STALE | XBF_WRITE_FAIL)) ||
 	     bp->b_last_error != bp->b_error) {
 		bp->b_flags |= (XBF_WRITE | XBF_DONE | XBF_WRITE_FAIL);
@@ -1031,36 +1025,80 @@ xfs_buf_iodone_callback_error(
 		if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER &&
 		    !bp->b_first_retry_time)
 			bp->b_first_retry_time = jiffies;
-
-		xfs_buf_ioerror(bp, 0);
-		xfs_buf_submit(bp);
 		return true;
 	}
+	return false;
+}
 
-	/*
-	 * Repeated failure on an async write. Take action according to the
-	 * error configuration we have been set up to use.
-	 */
+static bool
+xfs_buf_ioerror_permanent(
+	struct xfs_buf		*bp,
+	struct xfs_error_cfg	*cfg)
+{
+	struct xfs_mount	*mp = bp->b_mount;
 
 	if (cfg->max_retries != XFS_ERR_RETRY_FOREVER &&
 	    ++bp->b_retries > cfg->max_retries)
-			goto permanent_error;
+			return true;
 	if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER &&
 	    time_after(jiffies, cfg->retry_timeout + bp->b_first_retry_time))
-			goto permanent_error;
+			return true;
 
 	/* At unmount we may treat errors differently */
 	if ((mp->m_flags & XFS_MOUNT_UNMOUNTING) && mp->m_fail_unmount)
+		return true;
+
+	return false;
+}
+
+/*
+ * On a sync write or shutdown we just want to stale the buffer and let the
+ * caller handle the error in bp->b_error appropriately.
+ *
+ * If the write was asynchronous then no one will be looking for the error.  If
+ * this is the first failure of this type, clear the error state and write the
+ * buffer out again. This means we always retry an async write failure at least
+ * once, but we also need to set the buffer up to behave correctly now for
+ * repeated failures.
+ *
+ * If we get repeated async write failures, then we take action according to the
+ * error configuration we have been set up to use.
+ *
+ * Multi-state return value:
+ *
+ * 0: clear IO error retry state and run callback completions
+ * 1: resubmitted immediately, do not run any completions
+ * 2: transient error, run failure callback completions and then
+ *    release the buffer
+ */
+static int
+xfs_buf_iodone_error(
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_error_cfg	*cfg;
+
+	if (xfs_buf_ioerror_sync(bp))
+		goto out_stale;
+
+	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
+
+	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
+	if (xfs_buf_ioerror_retry(bp, cfg)) {
+		xfs_buf_ioerror(bp, 0);
+		xfs_buf_submit(bp);
+		return 1;
+	}
+
+	if (xfs_buf_ioerror_permanent(bp, cfg))
 		goto permanent_error;
 
 	/*
 	 * Still a transient error, run IO completion failure callbacks and let
 	 * the higher layers retry the buffer.
 	 */
-	xfs_buf_do_callbacks_fail(bp);
 	xfs_buf_ioerror(bp, 0);
-	xfs_buf_relse(bp);
-	return true;
+	return 2;
 
 	/*
 	 * Permanent error - we need to trigger a shutdown if we haven't already
@@ -1072,30 +1110,7 @@ xfs_buf_iodone_callback_error(
 	xfs_buf_stale(bp);
 	bp->b_flags |= XBF_DONE;
 	trace_xfs_buf_error_relse(bp, _RET_IP_);
-	return false;
-}
-
-static inline bool
-xfs_buf_had_callback_errors(
-	struct xfs_buf		*bp)
-{
-
-	/*
-	 * If there is an error, process it. Some errors require us to run
-	 * callbacks after failure processing is done so we detect that and take
-	 * appropriate action.
-	 */
-	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
-		return true;
-
-	/*
-	 * Successful IO or permanent error. Either way, we can clear the
-	 * retry state here in preparation for the next error that may occur.
-	 */
-	bp->b_last_error = 0;
-	bp->b_retries = 0;
-	bp->b_first_retry_time = 0;
-	return false;
+	return 0;
 }
 
 static void
@@ -1122,6 +1137,15 @@ xfs_buf_item_done(
 	xfs_buf_rele(bp);
 }
 
+static inline void
+xfs_buf_clear_ioerror_retry_state(
+	struct xfs_buf		*bp)
+{
+	bp->b_last_error = 0;
+	bp->b_retries = 0;
+	bp->b_first_retry_time = 0;
+}
+
 /*
  * Inode buffer iodone callback function.
  */
@@ -1129,9 +1153,20 @@ void
 xfs_buf_inode_iodone(
 	struct xfs_buf		*bp)
 {
-	if (xfs_buf_had_callback_errors(bp))
+	if (bp->b_error) {
+		int ret = xfs_buf_iodone_error(bp);
+		if (!ret)
+			goto finish_iodone;
+		if (ret == 1)
+			return;
+		ASSERT(ret == 2);
+		xfs_buf_do_callbacks_fail(bp);
+		xfs_buf_relse(bp);
 		return;
+	}
 
+finish_iodone:
+	xfs_buf_clear_ioerror_retry_state(bp);
 	xfs_buf_item_done(bp);
 	xfs_iflush_done(bp);
 	xfs_buf_ioend_finish(bp);
@@ -1144,9 +1179,20 @@ void
 xfs_buf_dquot_iodone(
 	struct xfs_buf		*bp)
 {
-	if (xfs_buf_had_callback_errors(bp))
+	if (bp->b_error) {
+		int ret = xfs_buf_iodone_error(bp);
+		if (!ret)
+			goto finish_iodone;
+		if (ret == 1)
+			return;
+		ASSERT(ret == 2);
+		xfs_buf_do_callbacks_fail(bp);
+		xfs_buf_relse(bp);
 		return;
+	}
 
+finish_iodone:
+	xfs_buf_clear_ioerror_retry_state(bp);
 	/* a newly allocated dquot buffer might have a log item attached */
 	xfs_buf_item_done(bp);
 	xfs_dquot_done(bp);
@@ -1163,9 +1209,20 @@ void
 xfs_buf_iodone(
 	struct xfs_buf		*bp)
 {
-	if (xfs_buf_had_callback_errors(bp))
+	if (bp->b_error) {
+		int ret = xfs_buf_iodone_error(bp);
+		if (!ret)
+			goto finish_iodone;
+		if (ret == 1)
+			return;
+		ASSERT(ret == 2);
+		xfs_buf_do_callbacks_fail(bp);
+		xfs_buf_relse(bp);
 		return;
+	}
 
+finish_iodone:
+	xfs_buf_clear_ioerror_retry_state(bp);
 	xfs_buf_item_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 14/30] xfs: unwind log item error flagging
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (12 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 13/30] xfs: handle buffer log item IO errors directly Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 20:45   ` Darrick J. Wong
  2020-06-03 15:02   ` Brian Foster
  2020-06-01 21:42 ` [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() Dave Chinner
                   ` (15 subsequent siblings)
  29 siblings, 2 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When an buffer IO error occurs, we want to mark all
the log items attached to the buffer as failed. Open code
the error handling loop so that we can modify the flagging for the
different types of objects directly and independently of each other.

This also allows us to remove the ->iop_error method from the log
item operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c   | 48 ++++++++++++-----------------------------
 fs/xfs/xfs_dquot_item.c | 18 ----------------
 fs/xfs/xfs_inode_item.c | 18 ----------------
 fs/xfs/xfs_trans.h      |  1 -
 4 files changed, 14 insertions(+), 71 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index b6995719e877b..2364a9aa2d71a 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -12,6 +12,7 @@
 #include "xfs_bit.h"
 #include "xfs_mount.h"
 #include "xfs_trans.h"
+#include "xfs_trans_priv.h"
 #include "xfs_buf_item.h"
 #include "xfs_inode.h"
 #include "xfs_inode_item.h"
@@ -955,37 +956,6 @@ xfs_buf_item_relse(
 	xfs_buf_item_free(bip);
 }
 
-/*
- * Invoke the error state callback for each log item affected by the failed I/O.
- *
- * If a metadata buffer write fails with a non-permanent error, the buffer is
- * eventually resubmitted and so the completion callbacks are not run. The error
- * state may need to be propagated to the log items attached to the buffer,
- * however, so the next AIL push of the item knows hot to handle it correctly.
- */
-STATIC void
-xfs_buf_do_callbacks_fail(
-	struct xfs_buf		*bp)
-{
-	struct xfs_ail		*ailp = bp->b_mount->m_ail;
-	struct xfs_log_item	*lip;
-
-	/*
-	 * Buffer log item errors are handled directly by xfs_buf_item_push()
-	 * and xfs_buf_iodone_callback_error, and they have no IO error
-	 * callbacks. Check only for items in b_li_list.
-	 */
-	if (list_empty(&bp->b_li_list))
-		return;
-
-	spin_lock(&ailp->ail_lock);
-	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
-		if (lip->li_ops->iop_error)
-			lip->li_ops->iop_error(lip, bp);
-	}
-	spin_unlock(&ailp->ail_lock);
-}
-
 static bool
 xfs_buf_ioerror_sync(
 	struct xfs_buf		*bp)
@@ -1154,13 +1124,18 @@ xfs_buf_inode_iodone(
 	struct xfs_buf		*bp)
 {
 	if (bp->b_error) {
+		struct xfs_log_item *lip;
 		int ret = xfs_buf_iodone_error(bp);
 		if (!ret)
 			goto finish_iodone;
 		if (ret == 1)
 			return;
 		ASSERT(ret == 2);
-		xfs_buf_do_callbacks_fail(bp);
+		spin_lock(&bp->b_mount->m_ail->ail_lock);
+		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
+			xfs_set_li_failed(lip, bp);
+		}
+		spin_unlock(&bp->b_mount->m_ail->ail_lock);
 		xfs_buf_relse(bp);
 		return;
 	}
@@ -1180,13 +1155,18 @@ xfs_buf_dquot_iodone(
 	struct xfs_buf		*bp)
 {
 	if (bp->b_error) {
+		struct xfs_log_item *lip;
 		int ret = xfs_buf_iodone_error(bp);
 		if (!ret)
 			goto finish_iodone;
 		if (ret == 1)
 			return;
 		ASSERT(ret == 2);
-		xfs_buf_do_callbacks_fail(bp);
+		spin_lock(&bp->b_mount->m_ail->ail_lock);
+		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
+			xfs_set_li_failed(lip, bp);
+		}
+		spin_unlock(&bp->b_mount->m_ail->ail_lock);
 		xfs_buf_relse(bp);
 		return;
 	}
@@ -1216,7 +1196,7 @@ xfs_buf_iodone(
 		if (ret == 1)
 			return;
 		ASSERT(ret == 2);
-		xfs_buf_do_callbacks_fail(bp);
+		ASSERT(list_empty(&bp->b_li_list));
 		xfs_buf_relse(bp);
 		return;
 	}
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 349c92d26570c..d7e4de7151d7f 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -113,23 +113,6 @@ xfs_qm_dqunpin_wait(
 	wait_event(dqp->q_pinwait, (atomic_read(&dqp->q_pincount) == 0));
 }
 
-/*
- * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer
- * have been failed during writeback
- *
- * this informs the AIL that the dquot is already flush locked on the next push,
- * and acquires a hold on the buffer to ensure that it isn't reclaimed before
- * dirty data makes it to disk.
- */
-STATIC void
-xfs_dquot_item_error(
-	struct xfs_log_item	*lip,
-	struct xfs_buf		*bp)
-{
-	ASSERT(!completion_done(&DQUOT_ITEM(lip)->qli_dquot->q_flush));
-	xfs_set_li_failed(lip, bp);
-}
-
 STATIC uint
 xfs_qm_dquot_logitem_push(
 	struct xfs_log_item	*lip,
@@ -216,7 +199,6 @@ static const struct xfs_item_ops xfs_dquot_item_ops = {
 	.iop_release	= xfs_qm_dquot_logitem_release,
 	.iop_committing	= xfs_qm_dquot_logitem_committing,
 	.iop_push	= xfs_qm_dquot_logitem_push,
-	.iop_error	= xfs_dquot_item_error
 };
 
 /*
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 7049f2ae8d186..86c783dec2bac 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -464,23 +464,6 @@ xfs_inode_item_unpin(
 		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
 }
 
-/*
- * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer
- * have been failed during writeback
- *
- * This informs the AIL that the inode is already flush locked on the next push,
- * and acquires a hold on the buffer to ensure that it isn't reclaimed before
- * dirty data makes it to disk.
- */
-STATIC void
-xfs_inode_item_error(
-	struct xfs_log_item	*lip,
-	struct xfs_buf		*bp)
-{
-	ASSERT(xfs_isiflocked(INODE_ITEM(lip)->ili_inode));
-	xfs_set_li_failed(lip, bp);
-}
-
 STATIC uint
 xfs_inode_item_push(
 	struct xfs_log_item	*lip,
@@ -619,7 +602,6 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
 	.iop_committed	= xfs_inode_item_committed,
 	.iop_push	= xfs_inode_item_push,
 	.iop_committing	= xfs_inode_item_committing,
-	.iop_error	= xfs_inode_item_error
 };
 
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 99a9ab9cab25b..b752501818d25 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -74,7 +74,6 @@ struct xfs_item_ops {
 	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
 	void (*iop_release)(struct xfs_log_item *);
 	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
-	void (*iop_error)(struct xfs_log_item *, xfs_buf_t *);
 	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
 	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
 };
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (13 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 14/30] xfs: unwind log item error flagging Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 20:47   ` Darrick J. Wong
  2020-06-03 15:02   ` Brian Foster
  2020-06-01 21:42 ` [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Dave Chinner
                   ` (14 subsequent siblings)
  29 siblings, 2 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_ail_delete_one() is called directly from dquot and inode IO
completion, as well as from the generic xfs_trans_ail_delete()
function. Inodes are about to have their own failure handling, and
dquots will in future, too. Pull the clearing of the LI_FAILED flag
up into the callers so we can customise the code appropriately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_dquot.c      | 6 +-----
 fs/xfs/xfs_inode_item.c | 3 +--
 fs/xfs/xfs_trans_ail.c  | 2 +-
 3 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index d5984a926d1d0..76353c9a723ee 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1070,16 +1070,12 @@ xfs_qm_dqflush_done(
 	     test_bit(XFS_LI_FAILED, &lip->li_flags))) {
 
 		spin_lock(&ailp->ail_lock);
+		xfs_clear_li_failed(lip);
 		if (lip->li_lsn == qip->qli_flush_lsn) {
 			/* xfs_ail_update_finish() drops the AIL lock */
 			tail_lsn = xfs_ail_delete_one(ailp, lip);
 			xfs_ail_update_finish(ailp, tail_lsn);
 		} else {
-			/*
-			 * Clear the failed state since we are about to drop the
-			 * flush lock
-			 */
-			xfs_clear_li_failed(lip);
 			spin_unlock(&ailp->ail_lock);
 		}
 	}
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 86c783dec2bac..0ba75764a8dc5 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -690,12 +690,11 @@ xfs_iflush_done(
 		/* this is an opencoded batch version of xfs_trans_ail_delete */
 		spin_lock(&ailp->ail_lock);
 		list_for_each_entry(lip, &tmp, li_bio_list) {
+			xfs_clear_li_failed(lip);
 			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
 				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
 				if (!tail_lsn && lsn)
 					tail_lsn = lsn;
-			} else {
-				xfs_clear_li_failed(lip);
 			}
 		}
 		xfs_ail_update_finish(ailp, tail_lsn);
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index ac5019361a139..ac33f6393f99c 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -843,7 +843,6 @@ xfs_ail_delete_one(
 
 	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
 	xfs_ail_delete(ailp, lip);
-	xfs_clear_li_failed(lip);
 	clear_bit(XFS_LI_IN_AIL, &lip->li_flags);
 	lip->li_lsn = 0;
 
@@ -874,6 +873,7 @@ xfs_trans_ail_delete(
 	}
 
 	/* xfs_ail_update_finish() drops the AIL lock */
+	xfs_clear_li_failed(lip);
 	tail_lsn = xfs_ail_delete_one(ailp, lip);
 	xfs_ail_update_finish(ailp, tail_lsn);
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 16/30] xfs: pin inode backing buffer to the inode log item
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (14 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 22:30   ` Darrick J. Wong
  2020-06-03 18:58   ` Brian Foster
  2020-06-01 21:42 ` [PATCH 17/30] xfs: make inode reclaim almost non-blocking Dave Chinner
                   ` (13 subsequent siblings)
  29 siblings, 2 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we dirty an inode, we are going to have to write it disk at
some point in the near future. This requires the inode cluster
backing buffer to be present in memory. Unfortunately, under severe
memory pressure we can reclaim the inode backing buffer while the
inode is dirty in memory, resulting in stalling the AIL pushing
because it has to do a read-modify-write cycle on the cluster
buffer.

When we have no memory available, the read of the cluster buffer
blocks the AIL pushing process, and this causes all sorts of issues
for memory reclaim as it requires inode writeback to make forwards
progress. Allocating a cluster buffer causes more memory pressure,
and results in more cluster buffers to be reclaimed, resulting in
more RMW cycles to be done in the AIL context and everything then
backs up on AIL progress. Only the synchronous inode cluster
writeback in the the inode reclaim code provides some level of
forwards progress guarantees that prevent OOM-killer rampages in
this situation.

Fix this by pinning the inode backing buffer to the inode log item
when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
This may mean the first modification of an inode that has been held
in cache for a long time may block on a cluster buffer read, but
we can do that in transaction context and block safely until the
buffer has been allocated and read.

Once we have the cluster buffer, the inode log item takes a
reference to it, pinning it in memory, and attaches it to the log
item for future reference. This means we can always grab the cluster
buffer from the inode log item when we need it.

When the inode is finally cleaned and removed from the AIL, we can
drop the reference the inode log item holds on the cluster buffer.
Once all inodes on the cluster buffer are clean, the cluster buffer
will be unpinned and it will be available for memory reclaim to
reclaim again.

This avoids the issues with needing to do RMW cycles in the AIL
pushing context, and hence allows complete non-blocking inode
flushing to be performed by the AIL pushing context.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c   |  3 +-
 fs/xfs/libxfs/xfs_trans_inode.c | 53 +++++++++++++++++++++---
 fs/xfs/xfs_buf_item.c           |  4 +-
 fs/xfs/xfs_inode_item.c         | 73 +++++++++++++++++++++++++++------
 fs/xfs/xfs_trans_ail.c          |  8 +++-
 5 files changed, 117 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 6f84ea85fdd83..1af97235785c8 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -176,7 +176,8 @@ xfs_imap_to_bp(
 	}
 
 	*bpp = bp;
-	*dipp = xfs_buf_offset(bp, imap->im_boffset);
+	if (dipp)
+		*dipp = xfs_buf_offset(bp, imap->im_boffset);
 	return 0;
 }
 
diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index fe6c2e39be85d..1e7147b90725e 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -8,6 +8,8 @@
 #include "xfs_shared.h"
 #include "xfs_format.h"
 #include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
 #include "xfs_inode.h"
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
@@ -72,13 +74,19 @@ xfs_trans_ichgtime(
 }
 
 /*
- * This is called to mark the fields indicated in fieldmask as needing
- * to be logged when the transaction is committed.  The inode must
- * already be associated with the given transaction.
+ * This is called to mark the fields indicated in fieldmask as needing to be
+ * logged when the transaction is committed.  The inode must already be
+ * associated with the given transaction.
  *
- * The values for fieldmask are defined in xfs_inode_item.h.  We always
- * log all of the core inode if any of it has changed, and we always log
- * all of the inline data/extents/b-tree root if any of them has changed.
+ * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
+ * of the core inode if any of it has changed, and we always log all of the
+ * inline data/extents/b-tree root if any of them has changed.
+ *
+ * Grab and pin the cluster buffer associated with this inode to avoid RMW
+ * cycles at inode writeback time. Avoid the need to add error handling to every
+ * xfs_trans_log_inode() call by shutting down on read error.  This will cause
+ * transactions to fail and everything to error out, just like if we return a
+ * read error in a dirty transaction and cancel it.
  */
 void
 xfs_trans_log_inode(
@@ -132,6 +140,39 @@ xfs_trans_log_inode(
 	spin_lock(&iip->ili_lock);
 	iip->ili_fsync_fields |= flags;
 
+	if (!iip->ili_item.li_buf) {
+		struct xfs_buf	*bp;
+		int		error;
+
+		/*
+		 * We hold the ILOCK here, so this inode is not going to be
+		 * flushed while we are here. Further, because there is no
+		 * buffer attached to the item, we know that there is no IO in
+		 * progress, so nothing will clear the ili_fields while we read
+		 * in the buffer. Hence we can safely drop the spin lock and
+		 * read the buffer knowing that the state will not change from
+		 * here.
+		 */
+		spin_unlock(&iip->ili_lock);
+		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, NULL,
+					&bp, 0);
+		if (error) {
+			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
+			return;
+		}
+
+		/*
+		 * We need an explicit buffer reference for the log item but
+		 * don't want the buffer to remain attached to the transaction.
+		 * Hold the buffer but release the transaction reference.
+		 */
+		xfs_buf_hold(bp);
+		xfs_trans_brelse(tp, bp);
+
+		spin_lock(&iip->ili_lock);
+		iip->ili_item.li_buf = bp;
+	}
+
 	/*
 	 * Always OR in the bits from the ili_last_fields field.  This is to
 	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 2364a9aa2d71a..9739d64a46443 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1131,11 +1131,9 @@ xfs_buf_inode_iodone(
 		if (ret == 1)
 			return;
 		ASSERT(ret == 2);
-		spin_lock(&bp->b_mount->m_ail->ail_lock);
 		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
-			xfs_set_li_failed(lip, bp);
+			set_bit(XFS_LI_FAILED, &lip->li_flags);
 		}
-		spin_unlock(&bp->b_mount->m_ail->ail_lock);
 		xfs_buf_relse(bp);
 		return;
 	}
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 0ba75764a8dc5..0a7720b7a821a 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -130,6 +130,8 @@ xfs_inode_item_size(
 	xfs_inode_item_data_fork_size(iip, nvecs, nbytes);
 	if (XFS_IFORK_Q(ip))
 		xfs_inode_item_attr_fork_size(iip, nvecs, nbytes);
+
+	ASSERT(iip->ili_item.li_buf);
 }
 
 STATIC void
@@ -439,6 +441,7 @@ xfs_inode_item_pin(
 	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+	ASSERT(lip->li_buf);
 
 	trace_xfs_inode_pin(ip, _RET_IP_);
 	atomic_inc(&ip->i_pincount);
@@ -450,6 +453,12 @@ xfs_inode_item_pin(
  * item which was previously pinned with a call to xfs_inode_item_pin().
  *
  * Also wake up anyone in xfs_iunpin_wait() if the count goes to 0.
+ *
+ * Note that unpin can race with inode cluster buffer freeing marking the buffer
+ * stale. In that case, flush completions are run from the buffer unpin call,
+ * which may happen before the inode is unpinned. If we lose the race, there
+ * will be no buffer attached to the log item, but the inode will be marked
+ * XFS_ISTALE.
  */
 STATIC void
 xfs_inode_item_unpin(
@@ -459,6 +468,7 @@ xfs_inode_item_unpin(
 	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
 
 	trace_xfs_inode_unpin(ip, _RET_IP_);
+	ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE));
 	ASSERT(atomic_read(&ip->i_pincount) > 0);
 	if (atomic_dec_and_test(&ip->i_pincount))
 		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
@@ -629,10 +639,15 @@ xfs_inode_item_init(
  */
 void
 xfs_inode_item_destroy(
-	xfs_inode_t	*ip)
+	struct xfs_inode	*ip)
 {
-	kmem_free(ip->i_itemp->ili_item.li_lv_shadow);
-	kmem_cache_free(xfs_ili_zone, ip->i_itemp);
+	struct xfs_inode_log_item *iip = ip->i_itemp;
+
+	ASSERT(iip->ili_item.li_buf == NULL);
+
+	ip->i_itemp = NULL;
+	kmem_free(iip->ili_item.li_lv_shadow);
+	kmem_cache_free(xfs_ili_zone, iip);
 }
 
 
@@ -647,6 +662,13 @@ xfs_inode_item_destroy(
  * list for other inodes that will run this function. We remove them from the
  * buffer list so we can process all the inode IO completions in one AIL lock
  * traversal.
+ *
+ * Note: Now that we attach the log item to the buffer when we first log the
+ * inode in memory, we can have unflushed inodes on the buffer list here. These
+ * inodes will have a zero ili_last_fields, so skip over them here. We do
+ * this check -after- we've checked for stale inodes, because we're guaranteed
+ * to have XFS_ISTALE set in the case that dirty inodes are in the CIL and have
+ * not yet had their dirtying transactions committed to disk.
  */
 void
 xfs_iflush_done(
@@ -670,14 +692,16 @@ xfs_iflush_done(
 			continue;
 		}
 
+		if (!iip->ili_last_fields)
+			continue;
+
 		list_move_tail(&lip->li_bio_list, &tmp);
 
 		/* Do an unlocked check for needing the AIL lock. */
-		if (lip->li_lsn == iip->ili_flush_lsn ||
+		if (iip->ili_flush_lsn == lip->li_lsn ||
 		    test_bit(XFS_LI_FAILED, &lip->li_flags))
 			need_ail++;
 	}
-	ASSERT(list_empty(&bp->b_li_list));
 
 	/*
 	 * We only want to pull the item from the AIL if it is actually there
@@ -690,7 +714,7 @@ xfs_iflush_done(
 		/* this is an opencoded batch version of xfs_trans_ail_delete */
 		spin_lock(&ailp->ail_lock);
 		list_for_each_entry(lip, &tmp, li_bio_list) {
-			xfs_clear_li_failed(lip);
+			clear_bit(XFS_LI_FAILED, &lip->li_flags);
 			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
 				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
 				if (!tail_lsn && lsn)
@@ -706,14 +730,29 @@ xfs_iflush_done(
 	 * them is safely on disk.
 	 */
 	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
+		bool	drop_buffer = false;
+
 		list_del_init(&lip->li_bio_list);
 		iip = INODE_ITEM(lip);
 
 		spin_lock(&iip->ili_lock);
+
+		/*
+		 * Remove the reference to the cluster buffer if the inode is
+		 * clean in memory. Drop the buffer reference once we've dropped
+		 * the locks we hold.
+		 */
+		ASSERT(iip->ili_item.li_buf == bp);
+		if (!iip->ili_fields) {
+			iip->ili_item.li_buf = NULL;
+			drop_buffer = true;
+		}
 		iip->ili_last_fields = 0;
+		iip->ili_flush_lsn = 0;
 		spin_unlock(&iip->ili_lock);
-
 		xfs_ifunlock(iip->ili_inode);
+		if (drop_buffer)
+			xfs_buf_rele(bp);
 	}
 }
 
@@ -725,12 +764,20 @@ xfs_iflush_done(
  */
 void
 xfs_iflush_abort(
-	struct xfs_inode		*ip)
+	struct xfs_inode	*ip)
 {
-	struct xfs_inode_log_item	*iip = ip->i_itemp;
+	struct xfs_inode_log_item *iip = ip->i_itemp;
+	struct xfs_buf		*bp = NULL;
 
 	if (iip) {
+		/*
+		 * Clear the failed bit before removing the item from the AIL so
+		 * xfs_trans_ail_delete() doesn't try to clear and release the
+		 * buffer attached to the log item before we are done with it.
+		 */
+		clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags);
 		xfs_trans_ail_delete(&iip->ili_item, 0);
+
 		/*
 		 * Clear the inode logging fields so no more flushes are
 		 * attempted.
@@ -739,12 +786,14 @@ xfs_iflush_abort(
 		iip->ili_last_fields = 0;
 		iip->ili_fields = 0;
 		iip->ili_fsync_fields = 0;
+		iip->ili_flush_lsn = 0;
+		bp = iip->ili_item.li_buf;
+		iip->ili_item.li_buf = NULL;
 		spin_unlock(&iip->ili_lock);
 	}
-	/*
-	 * Release the inode's flush lock since we're done with it.
-	 */
 	xfs_ifunlock(ip);
+	if (bp)
+		xfs_buf_rele(bp);
 }
 
 /*
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index ac33f6393f99c..c3be6e4401343 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -377,8 +377,12 @@ xfsaild_resubmit_item(
 	}
 
 	/* protected by ail_lock */
-	list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
-		xfs_clear_li_failed(lip);
+	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
+		if (bp->b_flags & _XBF_INODES)
+			clear_bit(XFS_LI_FAILED, &lip->li_flags);
+		else
+			xfs_clear_li_failed(lip);
+	}
 
 	xfs_buf_unlock(bp);
 	return XFS_ITEM_SUCCESS;
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 17/30] xfs: make inode reclaim almost non-blocking
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (15 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-01 21:42 ` [PATCH 18/30] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that dirty inode writeback doesn't cause read-modify-write
cycles on the inode cluster buffer under memory pressure, the need
to throttle memory reclaim to the rate at which we can clean dirty
inodes goes away. That is due to the fact that we no longer thrash
inode cluster buffers under memory pressure to clean dirty inodes.

This means inode writeback no longer stalls on memory allocation
or read IO, and hence can be done asynchronously without generating
memory pressure. As a result, blocking inode writeback in reclaim is
no longer necessary to prevent reclaim priority windup as cleaning
dirty inodes is no longer dependent on having memory reserves
available for the filesystem to make progress reclaiming inodes.

Hence we can convert inode reclaim to be non-blocking for shrinker
callouts, both for direct reclaim and kswapd.

On a vanilla kernel, running a 16-way fsmark create workload on a
4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
userspace mlock(). The OOM killer gets invoked at 15GB of
pinned RAM.

With this patch alone, pinning memory triggers premature OOM
killer invocation, sometimes with as much as 45% of RAM being free.
It's trivially easy to trigger the OOM killer when reclaim does not
block.

With pinning inode clusters in RAM and then adding this patch, I can
reliably pin 14.5GB of RAM and still have the fsmark workload run to
completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
is only a small amount of memory less than the vanilla kernel. It is
much more reliable than just with async reclaim alone.

simoops shows that allocation stalls go away when async reclaim is
used. Vanilla kernel:

Run time: 1924 seconds
Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)

With inode cluster pinning and async reclaim:

Run time: 1924 seconds
Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)

Latencies don't really change much, nor does the work rate. However,
allocation almost never stalls with these changes, whilst the
vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
period. This difference is due to inode reclaim being largely
non-blocking now.

IOWs, once we have pinned inode cluster buffers, we can make inode
reclaim non-blocking without a major risk of premature and/or
spurious OOM killer invocation, and without any changes to memory
reclaim infrastructure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_icache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index dbba4c1946386..a6780942034fc 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1402,7 +1402,7 @@ xfs_reclaim_inodes_nr(
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
 }
 
 /*
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 18/30] xfs: remove IO submission from xfs_reclaim_inode()
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (16 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 17/30] xfs: make inode reclaim almost non-blocking Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 22:36   ` Darrick J. Wong
  2020-06-01 21:42 ` [PATCH 19/30] xfs: allow multiple reclaimers per AG Dave Chinner
                   ` (11 subsequent siblings)
  29 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We no longer need to issue IO from shrinker based inode reclaim to
prevent spurious OOM killer invocation. This leaves only the global
filesystem management operations such as unmount needing to
writeback dirty inodes and reclaim them.

Instead of using the reclaim pass to write dirty inodes before
reclaiming them, use the AIL to push all the dirty inodes before we
try to reclaim them. This allows us to remove all the conditional
SYNC_WAIT locking and the writeback code from xfs_reclaim_inode()
and greatly simplify the checks we need to do to reclaim an inode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 117 ++++++++++++--------------------------------
 1 file changed, 31 insertions(+), 86 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index a6780942034fc..74032316ce5cc 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1111,24 +1111,17 @@ xfs_reclaim_inode_grab(
  *	dirty, async	=> requeue
  *	dirty, sync	=> flush, wait and reclaim
  */
-STATIC int
+static bool
 xfs_reclaim_inode(
 	struct xfs_inode	*ip,
 	struct xfs_perag	*pag,
 	int			sync_mode)
 {
-	struct xfs_buf		*bp = NULL;
 	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
-	int			error;
 
-restart:
-	error = 0;
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	if (!xfs_iflock_nowait(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
-			goto out;
-		xfs_iflock(ip);
-	}
+	if (!xfs_iflock_nowait(ip))
+		goto out;
 
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
 		xfs_iunpin_wait(ip);
@@ -1136,52 +1129,12 @@ xfs_reclaim_inode(
 		xfs_iflush_abort(ip);
 		goto reclaim;
 	}
-	if (xfs_ipincount(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
-			goto out_ifunlock;
-		xfs_iunpin_wait(ip);
-	}
-	if (xfs_inode_clean(ip)) {
-		xfs_ifunlock(ip);
-		goto reclaim;
-	}
-
-	/*
-	 * Never flush out dirty data during non-blocking reclaim, as it would
-	 * just contend with AIL pushing trying to do the same job.
-	 */
-	if (!(sync_mode & SYNC_WAIT))
+	if (xfs_ipincount(ip))
+		goto out_ifunlock;
+	if (!xfs_inode_clean(ip))
 		goto out_ifunlock;
 
-	/*
-	 * Now we have an inode that needs flushing.
-	 *
-	 * Note that xfs_iflush will never block on the inode buffer lock, as
-	 * xfs_ifree_cluster() can lock the inode buffer before it locks the
-	 * ip->i_lock, and we are doing the exact opposite here.  As a result,
-	 * doing a blocking xfs_imap_to_bp() to get the cluster buffer would
-	 * result in an ABBA deadlock with xfs_ifree_cluster().
-	 *
-	 * As xfs_ifree_cluser() must gather all inodes that are active in the
-	 * cache to mark them stale, if we hit this case we don't actually want
-	 * to do IO here - we want the inode marked stale so we can simply
-	 * reclaim it.  Hence if we get an EAGAIN error here,  just unlock the
-	 * inode, back off and try again.  Hopefully the next pass through will
-	 * see the stale flag set on the inode.
-	 */
-	error = xfs_iflush(ip, &bp);
-	if (error == -EAGAIN) {
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		/* backoff longer than in xfs_ifree_cluster */
-		delay(2);
-		goto restart;
-	}
-
-	if (!error) {
-		error = xfs_bwrite(bp);
-		xfs_buf_relse(bp);
-	}
-
+	xfs_ifunlock(ip);
 reclaim:
 	ASSERT(!xfs_isiflocked(ip));
 
@@ -1231,21 +1184,14 @@ xfs_reclaim_inode(
 	ASSERT(xfs_inode_clean(ip));
 
 	__xfs_inode_free(ip);
-	return error;
+	return true;
 
 out_ifunlock:
 	xfs_ifunlock(ip);
 out:
-	xfs_iflags_clear(ip, XFS_IRECLAIM);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	/*
-	 * We could return -EAGAIN here to make reclaim rescan the inode tree in
-	 * a short while. However, this just burns CPU time scanning the tree
-	 * waiting for IO to complete and the reclaim work never goes back to
-	 * the idle state. Instead, return 0 to let the next scheduled
-	 * background reclaim attempt to reclaim the inode again.
-	 */
-	return 0;
+	xfs_iflags_clear(ip, XFS_IRECLAIM);
+	return false;
 }
 
 /*
@@ -1253,21 +1199,22 @@ xfs_reclaim_inode(
  * corrupted, we still want to try to reclaim all the inodes. If we don't,
  * then a shut down during filesystem unmount reclaim walk leak all the
  * unreclaimed inodes.
+ *
+ * Returns non-zero if any AGs or inodes were skipped in the reclaim pass
+ * so that callers that want to block until all dirty inodes are written back
+ * and reclaimed can sanely loop.
  */
-STATIC int
+static int
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
 	int			flags,
 	int			*nr_to_scan)
 {
 	struct xfs_perag	*pag;
-	int			error = 0;
-	int			last_error = 0;
 	xfs_agnumber_t		ag;
 	int			trylock = flags & SYNC_TRYLOCK;
 	int			skipped;
 
-restart:
 	ag = 0;
 	skipped = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
@@ -1341,9 +1288,8 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				if (!batch[i])
 					continue;
-				error = xfs_reclaim_inode(batch[i], pag, flags);
-				if (error && last_error != -EFSCORRUPTED)
-					last_error = error;
+				if (!xfs_reclaim_inode(batch[i], pag, flags))
+					skipped++;
 			}
 
 			*nr_to_scan -= XFS_LOOKUP_BATCH;
@@ -1359,19 +1305,7 @@ xfs_reclaim_inodes_ag(
 		mutex_unlock(&pag->pag_ici_reclaim_lock);
 		xfs_perag_put(pag);
 	}
-
-	/*
-	 * if we skipped any AG, and we still have scan count remaining, do
-	 * another pass this time using blocking reclaim semantics (i.e
-	 * waiting on the reclaim locks and ignoring the reclaim cursors). This
-	 * ensure that when we get more reclaimers than AGs we block rather
-	 * than spin trying to execute reclaim.
-	 */
-	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
-		trylock = 0;
-		goto restart;
-	}
-	return last_error;
+	return skipped;
 }
 
 int
@@ -1380,8 +1314,18 @@ xfs_reclaim_inodes(
 	int		mode)
 {
 	int		nr_to_scan = INT_MAX;
+	int		skipped;
 
-	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	if (!(mode & SYNC_WAIT))
+		return 0;
+
+	do {
+		xfs_ail_push_all_sync(mp->m_ail);
+		skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	} while (skipped > 0);
+
+	return 0;
 }
 
 /*
@@ -1402,7 +1346,8 @@ xfs_reclaim_inodes_nr(
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
+	return 0;
 }
 
 /*
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 19/30] xfs: allow multiple reclaimers per AG
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (17 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 18/30] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-01 21:42 ` [PATCH 20/30] xfs: don't block inode reclaim on the ILOCK Dave Chinner
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Inode reclaim will still throttle direct reclaim on the per-ag
reclaim locks. This is no longer necessary as reclaim can run
non-blocking now. Hence we can remove these locks so that we don't
arbitrarily block reclaimers just because there are more direct
reclaimers than there are AGs.

This can result in multiple reclaimers working on the same range of
an AG, but this doesn't cause any apparent issues. Optimising the
spread of concurrent reclaimers for best efficiency can be done in a
future patchset.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_icache.c | 31 ++++++++++++-------------------
 fs/xfs/xfs_mount.c  |  4 ----
 fs/xfs/xfs_mount.h  |  1 -
 3 files changed, 12 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 74032316ce5cc..c4ba8d7bc45bc 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1211,12 +1211,9 @@ xfs_reclaim_inodes_ag(
 	int			*nr_to_scan)
 {
 	struct xfs_perag	*pag;
-	xfs_agnumber_t		ag;
-	int			trylock = flags & SYNC_TRYLOCK;
-	int			skipped;
+	xfs_agnumber_t		ag = 0;
+	int			skipped = 0;
 
-	ag = 0;
-	skipped = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
 		unsigned long	first_index = 0;
 		int		done = 0;
@@ -1224,15 +1221,13 @@ xfs_reclaim_inodes_ag(
 
 		ag = pag->pag_agno + 1;
 
-		if (trylock) {
-			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
-				skipped++;
-				xfs_perag_put(pag);
-				continue;
-			}
-			first_index = pag->pag_ici_reclaim_cursor;
-		} else
-			mutex_lock(&pag->pag_ici_reclaim_lock);
+		/*
+		 * If the cursor is not zero, we haven't scanned the whole AG
+		 * so we might have skipped inodes here.
+		 */
+		first_index = READ_ONCE(pag->pag_ici_reclaim_cursor);
+		if (first_index)
+			skipped++;
 
 		do {
 			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
@@ -1298,11 +1293,9 @@ xfs_reclaim_inodes_ag(
 
 		} while (nr_found && !done && *nr_to_scan > 0);
 
-		if (trylock && !done)
-			pag->pag_ici_reclaim_cursor = first_index;
-		else
-			pag->pag_ici_reclaim_cursor = 0;
-		mutex_unlock(&pag->pag_ici_reclaim_lock);
+		if (done)
+			first_index = 0;
+		WRITE_ONCE(pag->pag_ici_reclaim_cursor, first_index);
 		xfs_perag_put(pag);
 	}
 	return skipped;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index d5dcf98698600..03158b42a1943 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -148,7 +148,6 @@ xfs_free_perag(
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
 		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
-		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
 	}
 }
@@ -200,7 +199,6 @@ xfs_initialize_perag(
 		pag->pag_agno = index;
 		pag->pag_mount = mp;
 		spin_lock_init(&pag->pag_ici_lock);
-		mutex_init(&pag->pag_ici_reclaim_lock);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 		if (xfs_buf_hash_init(pag))
 			goto out_free_pag;
@@ -242,7 +240,6 @@ xfs_initialize_perag(
 out_hash_destroy:
 	xfs_buf_hash_destroy(pag);
 out_free_pag:
-	mutex_destroy(&pag->pag_ici_reclaim_lock);
 	kmem_free(pag);
 out_unwind_new_pags:
 	/* unwind any prior newly initialized pags */
@@ -252,7 +249,6 @@ xfs_initialize_perag(
 			break;
 		xfs_buf_hash_destroy(pag);
 		xfs_iunlink_destroy(pag);
-		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		kmem_free(pag);
 	}
 	return error;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 3725d25ad97e8..a72cfcaa4ad12 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -354,7 +354,6 @@ typedef struct xfs_perag {
 	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
-	struct mutex	pag_ici_reclaim_lock;	/* serialisation point */
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 20/30] xfs: don't block inode reclaim on the ILOCK
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (18 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 19/30] xfs: allow multiple reclaimers per AG Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-01 21:42 ` [PATCH 21/30] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we attempt to reclaim an inode, the first thing we do is take
the inode lock. This is blocking right now, so if the inode being
accessed by something else (e.g. being flushed to the cluster
buffer) we will block here.

Change this to a trylock so that we do not block inode reclaim
unnecessarily here.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_icache.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index c4ba8d7bc45bc..d1c47a0e0b0ec 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1119,9 +1119,10 @@ xfs_reclaim_inode(
 {
 	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
 
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	if (!xfs_iflock_nowait(ip))
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
 		goto out;
+	if (!xfs_iflock_nowait(ip))
+		goto out_iunlock;
 
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
 		xfs_iunpin_wait(ip);
@@ -1188,8 +1189,9 @@ xfs_reclaim_inode(
 
 out_ifunlock:
 	xfs_ifunlock(ip);
-out:
+out_iunlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out:
 	xfs_iflags_clear(ip, XFS_IRECLAIM);
 	return false;
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 21/30] xfs: remove SYNC_TRYLOCK from inode reclaim
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (19 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 20/30] xfs: don't block inode reclaim on the ILOCK Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-01 21:42 ` [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Dave Chinner
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

All background reclaim is SYNC_TRYLOCK already, and even blocking
reclaim (SYNC_WAIT) can use trylock mechanisms as
xfs_reclaim_inodes_ag() will keep cycling until there are no more
reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode
reclaim and make everything unconditionally non-blocking.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_icache.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index d1c47a0e0b0ec..ebe55124d6cb8 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -174,7 +174,7 @@ xfs_reclaim_worker(
 	struct xfs_mount *mp = container_of(to_delayed_work(work),
 					struct xfs_mount, m_reclaim_work);
 
-	xfs_reclaim_inodes(mp, SYNC_TRYLOCK);
+	xfs_reclaim_inodes(mp, 0);
 	xfs_reclaim_work_queue(mp);
 }
 
@@ -1030,10 +1030,9 @@ xfs_cowblocks_worker(
  * Grab the inode for reclaim exclusively.
  * Return 0 if we grabbed it, non-zero otherwise.
  */
-STATIC int
+static int
 xfs_reclaim_inode_grab(
-	struct xfs_inode	*ip,
-	int			flags)
+	struct xfs_inode	*ip)
 {
 	ASSERT(rcu_read_lock_held());
 
@@ -1042,12 +1041,10 @@ xfs_reclaim_inode_grab(
 		return 1;
 
 	/*
-	 * If we are asked for non-blocking operation, do unlocked checks to
-	 * see if the inode already is being flushed or in reclaim to avoid
-	 * lock traffic.
+	 * Do unlocked checks to see if the inode already is being flushed or in
+	 * reclaim to avoid lock traffic.
 	 */
-	if ((flags & SYNC_TRYLOCK) &&
-	    __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
+	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
 		return 1;
 
 	/*
@@ -1114,8 +1111,7 @@ xfs_reclaim_inode_grab(
 static bool
 xfs_reclaim_inode(
 	struct xfs_inode	*ip,
-	struct xfs_perag	*pag,
-	int			sync_mode)
+	struct xfs_perag	*pag)
 {
 	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
 
@@ -1209,7 +1205,6 @@ xfs_reclaim_inode(
 static int
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
-	int			flags,
 	int			*nr_to_scan)
 {
 	struct xfs_perag	*pag;
@@ -1254,7 +1249,7 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				struct xfs_inode *ip = batch[i];
 
-				if (done || xfs_reclaim_inode_grab(ip, flags))
+				if (done || xfs_reclaim_inode_grab(ip))
 					batch[i] = NULL;
 
 				/*
@@ -1285,7 +1280,7 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				if (!batch[i])
 					continue;
-				if (!xfs_reclaim_inode(batch[i], pag, flags))
+				if (!xfs_reclaim_inode(batch[i], pag))
 					skipped++;
 			}
 
@@ -1311,13 +1306,13 @@ xfs_reclaim_inodes(
 	int		nr_to_scan = INT_MAX;
 	int		skipped;
 
-	xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
 	if (!(mode & SYNC_WAIT))
 		return 0;
 
 	do {
 		xfs_ail_push_all_sync(mp->m_ail);
-		skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+		skipped = xfs_reclaim_inodes_ag(mp, &nr_to_scan);
 	} while (skipped > 0);
 
 	return 0;
@@ -1341,7 +1336,7 @@ xfs_reclaim_inodes_nr(
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
 	return 0;
 }
 
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes()
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (20 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 21/30] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 22:43   ` Darrick J. Wong
  2020-06-01 21:42 ` [PATCH 23/30] xfs: clean up inode reclaim comments Dave Chinner
                   ` (7 subsequent siblings)
  29 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Clean up xfs_reclaim_inodes() callers. Most callers want blocking
behaviour, so just make the existing SYNC_WAIT behaviour the
default.

For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
directly because we just want optimistic clean inode reclaim to be
done in the background.

For xfs_quiesce_attr() we can just remove the inode reclaim calls as
they are a historic relic that was required to flush dirty inodes
that contained unlogged changes. We now log all changes to the
inodes, so the sync AIL push from xfs_log_quiesce() called by
xfs_quiesce_attr() will do all the required inode writeback for
freeze.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 48 ++++++++++++++++++++-------------------------
 fs/xfs/xfs_icache.h |  2 +-
 fs/xfs/xfs_mount.c  | 11 +++++------
 fs/xfs/xfs_super.c  |  3 ---
 4 files changed, 27 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index ebe55124d6cb8..a27470fc201ff 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -160,24 +160,6 @@ xfs_reclaim_work_queue(
 	rcu_read_unlock();
 }
 
-/*
- * This is a fast pass over the inode cache to try to get reclaim moving on as
- * many inodes as possible in a short period of time. It kicks itself every few
- * seconds, as well as being kicked by the inode cache shrinker when memory
- * goes low. It scans as quickly as possible avoiding locked inodes or those
- * already being flushed, and once done schedules a future pass.
- */
-void
-xfs_reclaim_worker(
-	struct work_struct *work)
-{
-	struct xfs_mount *mp = container_of(to_delayed_work(work),
-					struct xfs_mount, m_reclaim_work);
-
-	xfs_reclaim_inodes(mp, 0);
-	xfs_reclaim_work_queue(mp);
-}
-
 static void
 xfs_perag_set_reclaim_tag(
 	struct xfs_perag	*pag)
@@ -1298,24 +1280,17 @@ xfs_reclaim_inodes_ag(
 	return skipped;
 }
 
-int
+void
 xfs_reclaim_inodes(
-	xfs_mount_t	*mp,
-	int		mode)
+	struct xfs_mount	*mp)
 {
 	int		nr_to_scan = INT_MAX;
 	int		skipped;
 
-	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
-	if (!(mode & SYNC_WAIT))
-		return 0;
-
 	do {
 		xfs_ail_push_all_sync(mp->m_ail);
 		skipped = xfs_reclaim_inodes_ag(mp, &nr_to_scan);
 	} while (skipped > 0);
-
-	return 0;
 }
 
 /*
@@ -1434,6 +1409,25 @@ xfs_inode_matches_eofb(
 	return true;
 }
 
+/*
+ * This is a fast pass over the inode cache to try to get reclaim moving on as
+ * many inodes as possible in a short period of time. It kicks itself every few
+ * seconds, as well as being kicked by the inode cache shrinker when memory
+ * goes low. It scans as quickly as possible avoiding locked inodes or those
+ * already being flushed, and once done schedules a future pass.
+ */
+void
+xfs_reclaim_worker(
+	struct work_struct *work)
+{
+	struct xfs_mount *mp = container_of(to_delayed_work(work),
+					struct xfs_mount, m_reclaim_work);
+	int		nr_to_scan = INT_MAX;
+
+	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
+	xfs_reclaim_work_queue(mp);
+}
+
 STATIC int
 xfs_inode_free_eofblocks(
 	struct xfs_inode	*ip,
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 93b54e7d55f0d..ae92ca53de423 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -51,7 +51,7 @@ void xfs_inode_free(struct xfs_inode *ip);
 
 void xfs_reclaim_worker(struct work_struct *work);
 
-int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
+void xfs_reclaim_inodes(struct xfs_mount *mp);
 int xfs_reclaim_inodes_count(struct xfs_mount *mp);
 long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 03158b42a1943..c8ae49a1e99c3 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1011,7 +1011,7 @@ xfs_mountfs(
 	 * quota inodes.
 	 */
 	cancel_delayed_work_sync(&mp->m_reclaim_work);
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
+	xfs_reclaim_inodes(mp);
 	xfs_health_unmount(mp);
  out_log_dealloc:
 	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
@@ -1088,13 +1088,12 @@ xfs_unmountfs(
 	xfs_ail_push_all_sync(mp->m_ail);
 
 	/*
-	 * And reclaim all inodes.  At this point there should be no dirty
-	 * inodes and none should be pinned or locked, but use synchronous
-	 * reclaim just to be sure. We can stop background inode reclaim
-	 * here as well if it is still running.
+	 * Reclaim all inodes. At this point there should be no dirty inodes and
+	 * none should be pinned or locked. Stop background inode reclaim here
+	 * if it is still running.
 	 */
 	cancel_delayed_work_sync(&mp->m_reclaim_work);
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
+	xfs_reclaim_inodes(mp);
 	xfs_health_unmount(mp);
 
 	xfs_qm_unmount(mp);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index fa58cb07c8fdf..9b03ea43f4fe7 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -890,9 +890,6 @@ xfs_quiesce_attr(
 	/* force the log to unpin objects from the now complete transactions */
 	xfs_log_force(mp, XFS_LOG_SYNC);
 
-	/* reclaim inodes to do any IO before the freeze completes */
-	xfs_reclaim_inodes(mp, 0);
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
 
 	/* Push the superblock and write an unmount record */
 	error = xfs_log_sbcount(mp);
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 23/30] xfs: clean up inode reclaim comments
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (21 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 22:45   ` Darrick J. Wong
  2020-06-01 21:42 ` [PATCH 24/30] xfs: rework stale inodes in xfs_ifree_cluster Dave Chinner
                   ` (6 subsequent siblings)
  29 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Inode reclaim is quite different now to the way described in various
comments, so update all the comments explaining what it does and how
it works.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 128 ++++++++++++--------------------------------
 1 file changed, 35 insertions(+), 93 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index a27470fc201ff..4fe6f250e8448 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -141,11 +141,8 @@ xfs_inode_free(
 }
 
 /*
- * Queue a new inode reclaim pass if there are reclaimable inodes and there
- * isn't a reclaim pass already in progress. By default it runs every 5s based
- * on the xfs periodic sync default of 30s. Perhaps this should have it's own
- * tunable, but that can be done if this method proves to be ineffective or too
- * aggressive.
+ * Queue background inode reclaim work if there are reclaimable inodes and there
+ * isn't reclaim work already scheduled or in progress.
  */
 static void
 xfs_reclaim_work_queue(
@@ -600,48 +597,31 @@ xfs_iget_cache_miss(
 }
 
 /*
- * Look up an inode by number in the given file system.
- * The inode is looked up in the cache held in each AG.
- * If the inode is found in the cache, initialise the vfs inode
- * if necessary.
+ * Look up an inode by number in the given file system.  The inode is looked up
+ * in the cache held in each AG.  If the inode is found in the cache, initialise
+ * the vfs inode if necessary.
  *
- * If it is not in core, read it in from the file system's device,
- * add it to the cache and initialise the vfs inode.
+ * If it is not in core, read it in from the file system's device, add it to the
+ * cache and initialise the vfs inode.
  *
  * The inode is locked according to the value of the lock_flags parameter.
- * This flag parameter indicates how and if the inode's IO lock and inode lock
- * should be taken.
- *
- * mp -- the mount point structure for the current file system.  It points
- *       to the inode hash table.
- * tp -- a pointer to the current transaction if there is one.  This is
- *       simply passed through to the xfs_iread() call.
- * ino -- the number of the inode desired.  This is the unique identifier
- *        within the file system for the inode being requested.
- * lock_flags -- flags indicating how to lock the inode.  See the comment
- *		 for xfs_ilock() for a list of valid values.
+ * Inode lookup is only done during metadata operations and not as part of the
+ * data IO path. Hence we only allow locking of the XFS_ILOCK during lookup.
  */
 int
 xfs_iget(
-	xfs_mount_t	*mp,
-	xfs_trans_t	*tp,
-	xfs_ino_t	ino,
-	uint		flags,
-	uint		lock_flags,
-	xfs_inode_t	**ipp)
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_ino_t		ino,
+	uint			flags,
+	uint			lock_flags,
+	struct xfs_inode	**ipp)
 {
-	xfs_inode_t	*ip;
-	int		error;
-	xfs_perag_t	*pag;
-	xfs_agino_t	agino;
+	struct xfs_inode	*ip;
+	struct xfs_perag	*pag;
+	xfs_agino_t		agino;
+	int			error;
 
-	/*
-	 * xfs_reclaim_inode() uses the ILOCK to ensure an inode
-	 * doesn't get freed while it's being referenced during a
-	 * radix tree traversal here.  It assumes this function
-	 * aqcuires only the ILOCK (and therefore it has no need to
-	 * involve the IOLOCK in this synchronization).
-	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED)) == 0);
 
 	/* reject inode numbers outside existing AGs */
@@ -758,15 +738,7 @@ xfs_inode_walk_ag_grab(
 
 	ASSERT(rcu_read_lock_held());
 
-	/*
-	 * check for stale RCU freed inode
-	 *
-	 * If the inode has been reallocated, it doesn't matter if it's not in
-	 * the AG we are walking - we are walking for writeback, so if it
-	 * passes all the "valid inode" checks and is dirty, then we'll write
-	 * it back anyway.  If it has been reallocated and still being
-	 * initialised, the XFS_INEW check below will catch it.
-	 */
+	/* Check for stale RCU freed inode */
 	spin_lock(&ip->i_flags_lock);
 	if (!ip->i_ino)
 		goto out_unlock_noent;
@@ -1052,43 +1024,16 @@ xfs_reclaim_inode_grab(
 }
 
 /*
- * Inodes in different states need to be treated differently. The following
- * table lists the inode states and the reclaim actions necessary:
- *
- *	inode state	     iflush ret		required action
- *      ---------------      ----------         ---------------
- *	bad			-		reclaim
- *	shutdown		EIO		unpin and reclaim
- *	clean, unpinned		0		reclaim
- *	stale, unpinned		0		reclaim
- *	clean, pinned(*)	0		requeue
- *	stale, pinned		EAGAIN		requeue
- *	dirty, async		-		requeue
- *	dirty, sync		0		reclaim
+ * Inode reclaim is non-blocking, so the default action if progress cannot be
+ * made is to "requeue" the inode for reclaim by unlocking it and clearing the
+ * XFS_IRECLAIM flag.  If we are in a shutdown state, we don't care about
+ * blocking anymore and hence we can wait for the inode to be able to reclaim
+ * it.
  *
- * (*) dgc: I don't think the clean, pinned state is possible but it gets
- * handled anyway given the order of checks implemented.
- *
- * Also, because we get the flush lock first, we know that any inode that has
- * been flushed delwri has had the flush completed by the time we check that
- * the inode is clean.
- *
- * Note that because the inode is flushed delayed write by AIL pushing, the
- * flush lock may already be held here and waiting on it can result in very
- * long latencies.  Hence for sync reclaims, where we wait on the flush lock,
- * the caller should push the AIL first before trying to reclaim inodes to
- * minimise the amount of time spent waiting.  For background relaim, we only
- * bother to reclaim clean inodes anyway.
- *
- * Hence the order of actions after gaining the locks should be:
- *	bad		=> reclaim
- *	shutdown	=> unpin and reclaim
- *	pinned, async	=> requeue
- *	pinned, sync	=> unpin
- *	stale		=> reclaim
- *	clean		=> reclaim
- *	dirty, async	=> requeue
- *	dirty, sync	=> flush, wait and reclaim
+ * We do no IO here - if callers require inodes to be cleaned they must push the
+ * AIL first to trigger writeback of dirty inodes.  This enables writeback to be
+ * done in the background in a non-blocking manner, and enables memory reclaim
+ * to make progress without blocking.
  */
 static bool
 xfs_reclaim_inode(
@@ -1294,13 +1239,11 @@ xfs_reclaim_inodes(
 }
 
 /*
- * Scan a certain number of inodes for reclaim.
- *
- * When called we make sure that there is a background (fast) inode reclaim in
- * progress, while we will throttle the speed of reclaim via doing synchronous
- * reclaim of inodes. That means if we come across dirty inodes, we wait for
- * them to be cleaned, which we hope will not be very long due to the
- * background walker having already kicked the IO off on those dirty inodes.
+ * The shrinker infrastructure determines how many inodes we should scan for
+ * reclaim. We want as many clean inodes ready to reclaim as possible, so we
+ * push the AIL here. We also want to proactively free up memory if we can to
+ * minimise the amount of work memory reclaim has to do so we kick the
+ * background reclaim if it isn't already scheduled.
  */
 long
 xfs_reclaim_inodes_nr(
@@ -1413,8 +1356,7 @@ xfs_inode_matches_eofb(
  * This is a fast pass over the inode cache to try to get reclaim moving on as
  * many inodes as possible in a short period of time. It kicks itself every few
  * seconds, as well as being kicked by the inode cache shrinker when memory
- * goes low. It scans as quickly as possible avoiding locked inodes or those
- * already being flushed, and once done schedules a future pass.
+ * goes low.
  */
 void
 xfs_reclaim_worker(
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 24/30] xfs: rework stale inodes in xfs_ifree_cluster
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (22 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 23/30] xfs: clean up inode reclaim comments Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 23:01   ` Darrick J. Wong
  2020-06-01 21:42 ` [PATCH 25/30] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
                   ` (5 subsequent siblings)
  29 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Once we have inodes pinning the cluster buffer and attached whenever
they are dirty, we no longer have a guarantee that the items are
flush locked when we lock the cluster buffer. Hence we cannot just
walk the buffer log item list and modify the attached inodes.

If the inode is not flush locked, we have to ILOCK it first and
the flush lock it and do all the prerequisite checks needed to avoid
races with other code. This is already handled by
xfs_ifree_get_one_inode(), so rework the inode iteration loop and
function to update all inodes in cache whether they are attached to
the buffer or not.

Note: we also remove the copying of the log item lsn to the
ili_flush_lsn as xfs_iflush_done() now uses the XFS_ISTALE flag to
trigger aborts and so flush lsn matching is not needed in IO
completion for processing freed inodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 158 ++++++++++++++++++---------------------------
 1 file changed, 62 insertions(+), 96 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 272b54cf97000..fb4c614c64fda 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2517,17 +2517,19 @@ xfs_iunlink_remove(
 }
 
 /*
- * Look up the inode number specified and mark it stale if it is found. If it is
- * dirty, return the inode so it can be attached to the cluster buffer so it can
- * be processed appropriately when the cluster free transaction completes.
+ * Look up the inode number specified and if it is not already marked XFS_ISTALE
+ * mark it stale. We should only find clean inodes in this lookup that aren't
+ * already stale.
  */
-static struct xfs_inode *
-xfs_ifree_get_one_inode(
-	struct xfs_perag	*pag,
+static void
+xfs_ifree_mark_inode_stale(
+	struct xfs_buf		*bp,
 	struct xfs_inode	*free_ip,
 	xfs_ino_t		inum)
 {
-	struct xfs_mount	*mp = pag->pag_mount;
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_perag	*pag = bp->b_pag;
+	struct xfs_inode_log_item *iip;
 	struct xfs_inode	*ip;
 
 retry:
@@ -2535,8 +2537,10 @@ xfs_ifree_get_one_inode(
 	ip = radix_tree_lookup(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, inum));
 
 	/* Inode not in memory, nothing to do */
-	if (!ip)
-		goto out_rcu_unlock;
+	if (!ip) {
+		rcu_read_unlock();
+		return;
+	}
 
 	/*
 	 * because this is an RCU protected lookup, we could find a recently
@@ -2547,9 +2551,9 @@ xfs_ifree_get_one_inode(
 	spin_lock(&ip->i_flags_lock);
 	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
 		spin_unlock(&ip->i_flags_lock);
-		goto out_rcu_unlock;
+		rcu_read_unlock();
+		return;
 	}
-	spin_unlock(&ip->i_flags_lock);
 
 	/*
 	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
@@ -2559,43 +2563,53 @@ xfs_ifree_get_one_inode(
 	 */
 	if (ip != free_ip) {
 		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
+			spin_unlock(&ip->i_flags_lock);
 			rcu_read_unlock();
 			delay(1);
 			goto retry;
 		}
-
-		/*
-		 * Check the inode number again in case we're racing with
-		 * freeing in xfs_reclaim_inode().  See the comments in that
-		 * function for more information as to why the initial check is
-		 * not sufficient.
-		 */
-		if (ip->i_ino != inum) {
-			xfs_iunlock(ip, XFS_ILOCK_EXCL);
-			goto out_rcu_unlock;
-		}
 	}
+	ip->i_flags |= XFS_ISTALE;
+	spin_unlock(&ip->i_flags_lock);
 	rcu_read_unlock();
 
-	xfs_iflock(ip);
-	xfs_iflags_set(ip, XFS_ISTALE);
+	/*
+	 * If we can't get the flush lock, the inode is already attached.  All
+	 * we needed to do here is mark the inode stale so buffer IO completion
+	 * will remove it from the AIL.
+	 */
+	iip = ip->i_itemp;
+	if (!xfs_iflock_nowait(ip)) {
+		ASSERT(!list_empty(&iip->ili_item.li_bio_list));
+		ASSERT(iip->ili_last_fields);
+		goto out_iunlock;
+	}
+	ASSERT(!iip || list_empty(&iip->ili_item.li_bio_list));
 
 	/*
-	 * We don't need to attach clean inodes or those only with unlogged
-	 * changes (which we throw away, anyway).
+	 * Clean inodes can be released immediately.  Everything else has to go
+	 * through xfs_iflush_abort() on journal commit as the flock
+	 * synchronises removal of the inode from the cluster buffer against
+	 * inode reclaim.
 	 */
-	if (!ip->i_itemp || xfs_inode_clean(ip)) {
-		ASSERT(ip != free_ip);
+	if (xfs_inode_clean(ip)) {
 		xfs_ifunlock(ip);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		goto out_no_inode;
+		goto out_iunlock;
 	}
-	return ip;
 
-out_rcu_unlock:
-	rcu_read_unlock();
-out_no_inode:
-	return NULL;
+	/* we have a dirty inode in memory that has not yet been flushed. */
+	ASSERT(iip->ili_fields);
+	spin_lock(&iip->ili_lock);
+	iip->ili_last_fields = iip->ili_fields;
+	iip->ili_fields = 0;
+	iip->ili_fsync_fields = 0;
+	spin_unlock(&iip->ili_lock);
+	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
+	ASSERT(iip->ili_last_fields);
+
+out_iunlock:
+	if (ip != free_ip)
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 }
 
 /*
@@ -2605,26 +2619,20 @@ xfs_ifree_get_one_inode(
  */
 STATIC int
 xfs_ifree_cluster(
-	xfs_inode_t		*free_ip,
-	xfs_trans_t		*tp,
+	struct xfs_inode	*free_ip,
+	struct xfs_trans	*tp,
 	struct xfs_icluster	*xic)
 {
-	xfs_mount_t		*mp = free_ip->i_mount;
+	struct xfs_mount	*mp = free_ip->i_mount;
+	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
+	struct xfs_buf		*bp;
+	xfs_daddr_t		blkno;
+	xfs_ino_t		inum = xic->first_ino;
 	int			nbufs;
 	int			i, j;
 	int			ioffset;
-	xfs_daddr_t		blkno;
-	xfs_buf_t		*bp;
-	xfs_inode_t		*ip;
-	struct xfs_inode_log_item *iip;
-	struct xfs_log_item	*lip;
-	struct xfs_perag	*pag;
-	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
-	xfs_ino_t		inum;
 	int			error;
 
-	inum = xic->first_ino;
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inum));
 	nbufs = igeo->ialloc_blks / igeo->blocks_per_cluster;
 
 	for (j = 0; j < nbufs; j++, inum += igeo->inodes_per_cluster) {
@@ -2668,59 +2676,16 @@ xfs_ifree_cluster(
 		bp->b_ops = &xfs_inode_buf_ops;
 
 		/*
-		 * Walk the inodes already attached to the buffer and mark them
-		 * stale. These will all have the flush locks held, so an
-		 * in-memory inode walk can't lock them. By marking them all
-		 * stale first, we will not attempt to lock them in the loop
-		 * below as the XFS_ISTALE flag will be set.
-		 */
-		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
-			if (lip->li_type == XFS_LI_INODE) {
-				iip = (struct xfs_inode_log_item *)lip;
-				xfs_trans_ail_copy_lsn(mp->m_ail,
-							&iip->ili_flush_lsn,
-							&iip->ili_item.li_lsn);
-				xfs_iflags_set(iip->ili_inode, XFS_ISTALE);
-			}
-		}
-
-
-		/*
-		 * For each inode in memory attempt to add it to the inode
-		 * buffer and set it up for being staled on buffer IO
-		 * completion.  This is safe as we've locked out tail pushing
-		 * and flushing by locking the buffer.
-		 *
-		 * We have already marked every inode that was part of a
-		 * transaction stale above, which means there is no point in
-		 * even trying to lock them.
+		 * Now we need to set all the cached clean inodes as XFS_ISTALE,
+		 * too. This requires lookups, and will skip inodes that we've
+		 * already marked XFS_ISTALE.
 		 */
-		for (i = 0; i < igeo->inodes_per_cluster; i++) {
-			ip = xfs_ifree_get_one_inode(pag, free_ip, inum + i);
-			if (!ip)
-				continue;
-
-			iip = ip->i_itemp;
-			spin_lock(&iip->ili_lock);
-			iip->ili_last_fields = iip->ili_fields;
-			iip->ili_fields = 0;
-			iip->ili_fsync_fields = 0;
-			spin_unlock(&iip->ili_lock);
-			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
-						&iip->ili_item.li_lsn);
-
-			list_add_tail(&iip->ili_item.li_bio_list,
-						&bp->b_li_list);
-
-			if (ip != free_ip)
-				xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		}
+		for (i = 0; i < igeo->inodes_per_cluster; i++)
+			xfs_ifree_mark_inode_stale(bp, free_ip, inum + i);
 
 		xfs_trans_stale_inode_buf(tp, bp);
 		xfs_trans_binval(tp, bp);
 	}
-
-	xfs_perag_put(pag);
 	return 0;
 }
 
@@ -3845,6 +3810,7 @@ xfs_iflush_int(
 	iip->ili_fields = 0;
 	iip->ili_fsync_fields = 0;
 	spin_unlock(&iip->ili_lock);
+	ASSERT(iip->ili_last_fields);
 
 	/*
 	 * Store the current LSN of the inode so that we can tell whether the
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 25/30] xfs: attach inodes to the cluster buffer when dirtied
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (23 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 24/30] xfs: rework stale inodes in xfs_ifree_cluster Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 23:03   ` Darrick J. Wong
  2020-06-01 21:42 ` [PATCH 26/30] xfs: xfs_iflush() is no longer necessary Dave Chinner
                   ` (4 subsequent siblings)
  29 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Rather than attach inodes to the cluster buffer just when we are
doing IO, attach the inodes to the cluster buffer when they are
dirtied. The means the buffer always carries a list of dirty inodes
that reference it, and we can use that list to make more fundamental
changes to inode writeback that aren't otherwise possible.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_inode.c |  9 ++++++---
 fs/xfs/xfs_buf_item.c           |  1 +
 fs/xfs/xfs_icache.c             |  1 +
 fs/xfs/xfs_inode.c              | 24 +++++-------------------
 fs/xfs/xfs_inode_item.c         | 14 ++++++++------
 5 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 1e7147b90725e..5e7634c13ce78 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -164,13 +164,16 @@ xfs_trans_log_inode(
 		/*
 		 * We need an explicit buffer reference for the log item but
 		 * don't want the buffer to remain attached to the transaction.
-		 * Hold the buffer but release the transaction reference.
+		 * Hold the buffer but release the transaction reference once
+		 * we've attached the inode log item to the buffer log item
+		 * list.
 		 */
 		xfs_buf_hold(bp);
-		xfs_trans_brelse(tp, bp);
-
 		spin_lock(&iip->ili_lock);
 		iip->ili_item.li_buf = bp;
+		bp->b_flags |= _XBF_INODES;
+		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
+		xfs_trans_brelse(tp, bp);
 	}
 
 	/*
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 9739d64a46443..6e7a2d460a675 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -465,6 +465,7 @@ xfs_buf_item_unpin(
 		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
 			xfs_buf_item_done(bp);
 			xfs_iflush_done(bp);
+			ASSERT(list_empty(&bp->b_li_list));
 		} else {
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
 			xfs_buf_item_relse(bp);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 4fe6f250e8448..ed386bc930977 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -115,6 +115,7 @@ __xfs_inode_free(
 {
 	/* asserts to verify all state is correct here */
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
+	ASSERT(!ip->i_itemp || list_empty(&ip->i_itemp->ili_item.li_bio_list));
 	XFS_STATS_DEC(ip->i_mount, vn_active);
 
 	call_rcu(&VFS_I(ip)->i_rcu, xfs_inode_free_callback);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index fb4c614c64fda..af65acd24ec4e 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2584,27 +2584,24 @@ xfs_ifree_mark_inode_stale(
 		ASSERT(iip->ili_last_fields);
 		goto out_iunlock;
 	}
-	ASSERT(!iip || list_empty(&iip->ili_item.li_bio_list));
 
 	/*
-	 * Clean inodes can be released immediately.  Everything else has to go
-	 * through xfs_iflush_abort() on journal commit as the flock
-	 * synchronises removal of the inode from the cluster buffer against
-	 * inode reclaim.
+	 * Inodes not attached to the buffer can be released immediately.
+	 * Everything else has to go through xfs_iflush_abort() on journal
+	 * commit as the flock synchronises removal of the inode from the
+	 * cluster buffer against inode reclaim.
 	 */
-	if (xfs_inode_clean(ip)) {
+	if (!iip || list_empty(&iip->ili_item.li_bio_list)) {
 		xfs_ifunlock(ip);
 		goto out_iunlock;
 	}
 
 	/* we have a dirty inode in memory that has not yet been flushed. */
-	ASSERT(iip->ili_fields);
 	spin_lock(&iip->ili_lock);
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
 	iip->ili_fsync_fields = 0;
 	spin_unlock(&iip->ili_lock);
-	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
 	ASSERT(iip->ili_last_fields);
 
 out_iunlock:
@@ -3819,19 +3816,8 @@ xfs_iflush_int(
 	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 				&iip->ili_item.li_lsn);
 
-	/*
-	 * Attach the inode item callback to the buffer whether the flush
-	 * succeeded or not. If not, the caller will shut down and fail I/O
-	 * completion on the buffer to remove the inode from the AIL and release
-	 * the flush lock.
-	 */
-	bp->b_flags |= _XBF_INODES;
-	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
-
 	/* generate the checksum. */
 	xfs_dinode_calc_crc(mp, dip);
-
-	ASSERT(!list_empty(&bp->b_li_list));
 	return error;
 }
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 0a7720b7a821a..66675b75de3ec 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -665,10 +665,7 @@ xfs_inode_item_destroy(
  *
  * Note: Now that we attach the log item to the buffer when we first log the
  * inode in memory, we can have unflushed inodes on the buffer list here. These
- * inodes will have a zero ili_last_fields, so skip over them here. We do
- * this check -after- we've checked for stale inodes, because we're guaranteed
- * to have XFS_ISTALE set in the case that dirty inodes are in the CIL and have
- * not yet had their dirtying transactions committed to disk.
+ * inodes will have a zero ili_last_fields, so skip over them here.
  */
 void
 xfs_iflush_done(
@@ -686,8 +683,8 @@ xfs_iflush_done(
 	 */
 	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
 		iip = INODE_ITEM(lip);
+
 		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
-			list_del_init(&lip->li_bio_list);
 			xfs_iflush_abort(iip->ili_inode);
 			continue;
 		}
@@ -740,12 +737,16 @@ xfs_iflush_done(
 		/*
 		 * Remove the reference to the cluster buffer if the inode is
 		 * clean in memory. Drop the buffer reference once we've dropped
-		 * the locks we hold.
+		 * the locks we hold. If the inode is dirty in memory, we need
+		 * to put the inode item back on the buffer list for another
+		 * pass through the flush machinery.
 		 */
 		ASSERT(iip->ili_item.li_buf == bp);
 		if (!iip->ili_fields) {
 			iip->ili_item.li_buf = NULL;
 			drop_buffer = true;
+		} else {
+			list_add(&lip->li_bio_list, &bp->b_li_list);
 		}
 		iip->ili_last_fields = 0;
 		iip->ili_flush_lsn = 0;
@@ -789,6 +790,7 @@ xfs_iflush_abort(
 		iip->ili_flush_lsn = 0;
 		bp = iip->ili_item.li_buf;
 		iip->ili_item.li_buf = NULL;
+		list_del_init(&iip->ili_item.li_bio_list);
 		spin_unlock(&iip->ili_lock);
 	}
 	xfs_ifunlock(ip);
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 26/30] xfs: xfs_iflush() is no longer necessary
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (24 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 25/30] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-01 21:42 ` [PATCH 27/30] xfs: rename xfs_iflush_int() Dave Chinner
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now we have a cached buffer on inode log items, we don't need
to do buffer lookups when flushing inodes anymore - all we need
to do is lock the buffer and we are ready to go.

This largely gets rid of the need for xfs_iflush(), which is
essentially just a mechanism to look up the buffer and flush the
inode to it. Instead, we can just call xfs_iflush_cluster() with a
few modifications to ensure it also flushes the inode we already
hold locked.

This allows the AIL inode item pushing to be almost entirely
non-blocking in XFS - we won't block unless memory allocation
for the cluster inode lookup blocks or the block device queues are
full.

Writeback during inode reclaim becomes a little more complex because
we now have to lock the buffer ourselves, but otherwise this change
is largely a functional no-op that removes a whole lot of code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c      | 106 ++++++----------------------------------
 fs/xfs/xfs_inode.h      |   2 +-
 fs/xfs/xfs_inode_item.c |  54 +++++++++-----------
 3 files changed, 37 insertions(+), 125 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index af65acd24ec4e..61c872e4ee157 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3450,7 +3450,18 @@ xfs_rename(
 	return error;
 }
 
-STATIC int
+/*
+ * Non-blocking flush of dirty inode metadata into the backing buffer.
+ *
+ * The caller must have a reference to the inode and hold the cluster buffer
+ * locked. The function will walk across all the inodes on the cluster buffer it
+ * can find and lock without blocking, and flush them to the cluster buffer.
+ *
+ * On success, the caller must write out the buffer returned in *bp and
+ * release it. On failure, the filesystem will be shut down, the buffer will
+ * have been unlocked and released, and EFSCORRUPTED will be returned.
+ */
+int
 xfs_iflush_cluster(
 	struct xfs_inode	*ip,
 	struct xfs_buf		*bp)
@@ -3485,8 +3496,6 @@ xfs_iflush_cluster(
 
 	for (i = 0; i < nr_found; i++) {
 		cip = cilist[i];
-		if (cip == ip)
-			continue;
 
 		/*
 		 * because this is an RCU protected lookup, we could find a
@@ -3577,99 +3586,11 @@ xfs_iflush_cluster(
 	kmem_free(cilist);
 out_put:
 	xfs_perag_put(pag);
-	return error;
-}
-
-/*
- * Flush dirty inode metadata into the backing buffer.
- *
- * The caller must have the inode lock and the inode flush lock held.  The
- * inode lock will still be held upon return to the caller, and the inode
- * flush lock will be released after the inode has reached the disk.
- *
- * The caller must write out the buffer returned in *bpp and release it.
- */
-int
-xfs_iflush(
-	struct xfs_inode	*ip,
-	struct xfs_buf		**bpp)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_buf		*bp = NULL;
-	struct xfs_dinode	*dip;
-	int			error;
-
-	XFS_STATS_INC(mp, xs_iflush_count);
-
-	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
-	ASSERT(xfs_isiflocked(ip));
-	ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_BTREE ||
-	       ip->i_df.if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
-
-	*bpp = NULL;
-
-	xfs_iunpin_wait(ip);
-
-	/*
-	 * For stale inodes we cannot rely on the backing buffer remaining
-	 * stale in cache for the remaining life of the stale inode and so
-	 * xfs_imap_to_bp() below may give us a buffer that no longer contains
-	 * inodes below. We have to check this after ensuring the inode is
-	 * unpinned so that it is safe to reclaim the stale inode after the
-	 * flush call.
-	 */
-	if (xfs_iflags_test(ip, XFS_ISTALE)) {
-		xfs_ifunlock(ip);
-		return 0;
-	}
-
-	/*
-	 * Get the buffer containing the on-disk inode. We are doing a try-lock
-	 * operation here, so we may get an EAGAIN error. In that case, return
-	 * leaving the inode dirty.
-	 *
-	 * If we get any other error, we effectively have a corruption situation
-	 * and we cannot flush the inode. Abort the flush and shut down.
-	 */
-	error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &bp, XBF_TRYLOCK);
-	if (error == -EAGAIN) {
-		xfs_ifunlock(ip);
-		return error;
-	}
-	if (error)
-		goto abort;
-
-	/*
-	 * If the buffer is pinned then push on the log now so we won't
-	 * get stuck waiting in the write for too long.
-	 */
-	if (xfs_buf_ispinned(bp))
-		xfs_log_force(mp, 0);
-
-	/*
-	 * Flush the provided inode then attempt to gather others from the
-	 * cluster into the write.
-	 *
-	 * Note: Once we attempt to flush an inode, we must run buffer
-	 * completion callbacks on any failure. If this fails, simulate an I/O
-	 * failure on the buffer and shut down.
-	 */
-	error = xfs_iflush_int(ip, bp);
-	if (!error)
-		error = xfs_iflush_cluster(ip, bp);
 	if (error) {
 		bp->b_flags |= XBF_ASYNC;
 		xfs_buf_ioend_fail(bp);
-		goto shutdown;
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 	}
-
-	*bpp = bp;
-	return 0;
-
-abort:
-	xfs_iflush_abort(ip);
-shutdown:
-	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 	return error;
 }
 
@@ -3688,6 +3609,7 @@ xfs_iflush_int(
 	ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_BTREE ||
 	       ip->i_df.if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
 	ASSERT(iip != NULL && iip->ili_fields != 0);
+	ASSERT(iip->ili_item.li_buf == bp);
 
 	dip = xfs_buf_offset(bp, ip->i_imap.im_boffset);
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index dadcf19458960..d1109eb13ba2e 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -427,7 +427,7 @@ int		xfs_log_force_inode(struct xfs_inode *ip);
 void		xfs_iunpin_wait(xfs_inode_t *);
 #define xfs_ipincount(ip)	((unsigned int) atomic_read(&ip->i_pincount))
 
-int		xfs_iflush(struct xfs_inode *, struct xfs_buf **);
+int		xfs_iflush_cluster(struct xfs_inode *, struct xfs_buf *);
 void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
 				struct xfs_inode *ip1, uint ip1_mode);
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 66675b75de3ec..e679fac944725 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -487,53 +487,42 @@ xfs_inode_item_push(
 	uint			rval = XFS_ITEM_SUCCESS;
 	int			error;
 
-	if (xfs_ipincount(ip) > 0)
+	ASSERT(iip->ili_item.li_buf);
+
+	if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp) ||
+	    (ip->i_flags & XFS_ISTALE))
 		return XFS_ITEM_PINNED;
 
-	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
-		return XFS_ITEM_LOCKED;
+	/* If the inode is already flush locked, we're already flushing. */
+	if (xfs_isiflocked(ip))
+		return XFS_ITEM_FLUSHING;
 
-	/*
-	 * Re-check the pincount now that we stabilized the value by
-	 * taking the ilock.
-	 */
-	if (xfs_ipincount(ip) > 0) {
-		rval = XFS_ITEM_PINNED;
-		goto out_unlock;
-	}
+	if (!xfs_buf_trylock(bp))
+		return XFS_ITEM_LOCKED;
 
-	/*
-	 * Stale inode items should force out the iclog.
-	 */
-	if (ip->i_flags & XFS_ISTALE) {
-		rval = XFS_ITEM_PINNED;
-		goto out_unlock;
+	if (bp->b_flags & _XBF_DELWRI_Q) {
+		xfs_buf_unlock(bp);
+		return XFS_ITEM_FLUSHING;
 	}
+	spin_unlock(&lip->li_ailp->ail_lock);
 
 	/*
-	 * Someone else is already flushing the inode.  Nothing we can do
-	 * here but wait for the flush to finish and remove the item from
-	 * the AIL.
+	 * We need to hold a reference for flushing the cluster buffer as it may
+	 * fail the buffer without IO submission. In which case, we better get a
+	 * reference for that completion because otherwise we don't get a
+	 * reference for IO until we queue the buffer for delwri submission.
 	 */
-	if (!xfs_iflock_nowait(ip)) {
-		rval = XFS_ITEM_FLUSHING;
-		goto out_unlock;
-	}
-
-	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
-	spin_unlock(&lip->li_ailp->ail_lock);
-
-	error = xfs_iflush(ip, &bp);
+	xfs_buf_hold(bp);
+	error = xfs_iflush_cluster(ip, bp);
 	if (!error) {
 		if (!xfs_buf_delwri_queue(bp, buffer_list))
 			rval = XFS_ITEM_FLUSHING;
 		xfs_buf_relse(bp);
-	} else if (error == -EAGAIN)
+	} else {
 		rval = XFS_ITEM_LOCKED;
+	}
 
 	spin_lock(&lip->li_ailp->ail_lock);
-out_unlock:
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 	return rval;
 }
 
@@ -550,6 +539,7 @@ xfs_inode_item_release(
 
 	ASSERT(ip->i_itemp != NULL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+	ASSERT(lip->li_buf || !test_bit(XFS_LI_DIRTY, &lip->li_flags));
 
 	lock_flags = iip->ili_lock_flags;
 	iip->ili_lock_flags = 0;
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 27/30] xfs: rename xfs_iflush_int()
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (25 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 26/30] xfs: xfs_iflush() is no longer necessary Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-01 21:42 ` [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

with xfs_iflush() gone, we can rename xfs_iflush_int() back to
xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't
need the forward definition any more.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c | 293 ++++++++++++++++++++++-----------------------
 1 file changed, 146 insertions(+), 147 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 61c872e4ee157..8566bd0f4334d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -44,7 +44,6 @@ kmem_zone_t *xfs_inode_zone;
  */
 #define	XFS_ITRUNC_MAX_EXTENTS	2
 
-STATIC int xfs_iflush_int(struct xfs_inode *, struct xfs_buf *);
 STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *);
 STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *);
 
@@ -3450,152 +3449,8 @@ xfs_rename(
 	return error;
 }
 
-/*
- * Non-blocking flush of dirty inode metadata into the backing buffer.
- *
- * The caller must have a reference to the inode and hold the cluster buffer
- * locked. The function will walk across all the inodes on the cluster buffer it
- * can find and lock without blocking, and flush them to the cluster buffer.
- *
- * On success, the caller must write out the buffer returned in *bp and
- * release it. On failure, the filesystem will be shut down, the buffer will
- * have been unlocked and released, and EFSCORRUPTED will be returned.
- */
-int
-xfs_iflush_cluster(
-	struct xfs_inode	*ip,
-	struct xfs_buf		*bp)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_perag	*pag;
-	unsigned long		first_index, mask;
-	int			cilist_size;
-	struct xfs_inode	**cilist;
-	struct xfs_inode	*cip;
-	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
-	int			error = 0;
-	int			nr_found;
-	int			clcount = 0;
-	int			i;
-
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-
-	cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *);
-	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
-	if (!cilist)
-		goto out_put;
-
-	mask = ~(igeo->inodes_per_cluster - 1);
-	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
-	rcu_read_lock();
-	/* really need a gang lookup range call here */
-	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
-					first_index, igeo->inodes_per_cluster);
-	if (nr_found == 0)
-		goto out_free;
-
-	for (i = 0; i < nr_found; i++) {
-		cip = cilist[i];
-
-		/*
-		 * because this is an RCU protected lookup, we could find a
-		 * recently freed or even reallocated inode during the lookup.
-		 * We need to check under the i_flags_lock for a valid inode
-		 * here. Skip it if it is not valid or the wrong inode.
-		 */
-		spin_lock(&cip->i_flags_lock);
-		if (!cip->i_ino ||
-		    __xfs_iflags_test(cip, XFS_ISTALE)) {
-			spin_unlock(&cip->i_flags_lock);
-			continue;
-		}
-
-		/*
-		 * Once we fall off the end of the cluster, no point checking
-		 * any more inodes in the list because they will also all be
-		 * outside the cluster.
-		 */
-		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
-			spin_unlock(&cip->i_flags_lock);
-			break;
-		}
-		spin_unlock(&cip->i_flags_lock);
-
-		/*
-		 * Do an un-protected check to see if the inode is dirty and
-		 * is a candidate for flushing.  These checks will be repeated
-		 * later after the appropriate locks are acquired.
-		 */
-		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
-			continue;
-
-		/*
-		 * Try to get locks.  If any are unavailable or it is pinned,
-		 * then this inode cannot be flushed and is skipped.
-		 */
-
-		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
-			continue;
-		if (!xfs_iflock_nowait(cip)) {
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
-			continue;
-		}
-		if (xfs_ipincount(cip)) {
-			xfs_ifunlock(cip);
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
-			continue;
-		}
-
-
-		/*
-		 * Check the inode number again, just to be certain we are not
-		 * racing with freeing in xfs_reclaim_inode(). See the comments
-		 * in that function for more information as to why the initial
-		 * check is not sufficient.
-		 */
-		if (!cip->i_ino) {
-			xfs_ifunlock(cip);
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
-			continue;
-		}
-
-		/*
-		 * arriving here means that this inode can be flushed.  First
-		 * re-check that it's dirty before flushing.
-		 */
-		if (!xfs_inode_clean(cip)) {
-			error = xfs_iflush_int(cip, bp);
-			if (error) {
-				xfs_iunlock(cip, XFS_ILOCK_SHARED);
-				goto out_free;
-			}
-			clcount++;
-		} else {
-			xfs_ifunlock(cip);
-		}
-		xfs_iunlock(cip, XFS_ILOCK_SHARED);
-	}
-
-	if (clcount) {
-		XFS_STATS_INC(mp, xs_icluster_flushcnt);
-		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
-	}
-
-out_free:
-	rcu_read_unlock();
-	kmem_free(cilist);
-out_put:
-	xfs_perag_put(pag);
-	if (error) {
-		bp->b_flags |= XBF_ASYNC;
-		xfs_buf_ioend_fail(bp);
-		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-	}
-	return error;
-}
-
-STATIC int
-xfs_iflush_int(
+static int
+xfs_iflush(
 	struct xfs_inode	*ip,
 	struct xfs_buf		*bp)
 {
@@ -3743,6 +3598,150 @@ xfs_iflush_int(
 	return error;
 }
 
+/*
+ * Non-blocking flush of dirty inode metadata into the backing buffer.
+ *
+ * The caller must have a reference to the inode and hold the cluster buffer
+ * locked. The function will walk across all the inodes on the cluster buffer it
+ * can find and lock without blocking, and flush them to the cluster buffer.
+ *
+ * On success, the caller must write out the buffer returned in *bp and
+ * release it. On failure, the filesystem will be shut down, the buffer will
+ * have been unlocked and released, and EFSCORRUPTED will be returned.
+ */
+int
+xfs_iflush_cluster(
+	struct xfs_inode	*ip,
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_perag	*pag;
+	unsigned long		first_index, mask;
+	int			cilist_size;
+	struct xfs_inode	**cilist;
+	struct xfs_inode	*cip;
+	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
+	int			error = 0;
+	int			nr_found;
+	int			clcount = 0;
+	int			i;
+
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
+
+	cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *);
+	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
+	if (!cilist)
+		goto out_put;
+
+	mask = ~(igeo->inodes_per_cluster - 1);
+	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
+	rcu_read_lock();
+	/* really need a gang lookup range call here */
+	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
+					first_index, igeo->inodes_per_cluster);
+	if (nr_found == 0)
+		goto out_free;
+
+	for (i = 0; i < nr_found; i++) {
+		cip = cilist[i];
+
+		/*
+		 * because this is an RCU protected lookup, we could find a
+		 * recently freed or even reallocated inode during the lookup.
+		 * We need to check under the i_flags_lock for a valid inode
+		 * here. Skip it if it is not valid or the wrong inode.
+		 */
+		spin_lock(&cip->i_flags_lock);
+		if (!cip->i_ino ||
+		    __xfs_iflags_test(cip, XFS_ISTALE)) {
+			spin_unlock(&cip->i_flags_lock);
+			continue;
+		}
+
+		/*
+		 * Once we fall off the end of the cluster, no point checking
+		 * any more inodes in the list because they will also all be
+		 * outside the cluster.
+		 */
+		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
+			spin_unlock(&cip->i_flags_lock);
+			break;
+		}
+		spin_unlock(&cip->i_flags_lock);
+
+		/*
+		 * Do an un-protected check to see if the inode is dirty and
+		 * is a candidate for flushing.  These checks will be repeated
+		 * later after the appropriate locks are acquired.
+		 */
+		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
+			continue;
+
+		/*
+		 * Try to get locks.  If any are unavailable or it is pinned,
+		 * then this inode cannot be flushed and is skipped.
+		 */
+
+		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
+			continue;
+		if (!xfs_iflock_nowait(cip)) {
+			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+			continue;
+		}
+		if (xfs_ipincount(cip)) {
+			xfs_ifunlock(cip);
+			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+			continue;
+		}
+
+
+		/*
+		 * Check the inode number again, just to be certain we are not
+		 * racing with freeing in xfs_reclaim_inode(). See the comments
+		 * in that function for more information as to why the initial
+		 * check is not sufficient.
+		 */
+		if (!cip->i_ino) {
+			xfs_ifunlock(cip);
+			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+			continue;
+		}
+
+		/*
+		 * arriving here means that this inode can be flushed.  First
+		 * re-check that it's dirty before flushing.
+		 */
+		if (!xfs_inode_clean(cip)) {
+			error = xfs_iflush(cip, bp);
+			if (error) {
+				xfs_iunlock(cip, XFS_ILOCK_SHARED);
+				goto out_free;
+			}
+			clcount++;
+		} else {
+			xfs_ifunlock(cip);
+		}
+		xfs_iunlock(cip, XFS_ILOCK_SHARED);
+	}
+
+	if (clcount) {
+		XFS_STATS_INC(mp, xs_icluster_flushcnt);
+		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
+	}
+
+out_free:
+	rcu_read_unlock();
+	kmem_free(cilist);
+out_put:
+	xfs_perag_put(pag);
+	if (error) {
+		bp->b_flags |= XBF_ASYNC;
+		xfs_buf_ioend_fail(bp);
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+	}
+	return error;
+}
+
 /* Release an inode. */
 void
 xfs_irele(
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (26 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 27/30] xfs: rename xfs_iflush_int() Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-02 23:23   ` Darrick J. Wong
  2020-06-01 21:42 ` [PATCH 29/30] xfs: factor xfs_iflush_done Dave Chinner
  2020-06-01 21:42 ` [PATCH 30/30] xfs: remove xfs_inobp_check() Dave Chinner
  29 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we have all the dirty inodes attached to the cluster
buffer, we don't actually have to do radix tree lookups to find
them. Sure, the radix tree is efficient, but walking a linked list
of just the dirty inodes attached to the buffer is much better.

We are also no longer dependent on having a locked inode passed into
the function to determine where to start the lookup. This means we
can drop it from the function call and treat all inodes the same.

We also make xfs_iflush_cluster skip inodes marked with
XFS_IRECLAIM. This we avoid races with inodes that reclaim is
actively referencing or are being re-initialised by inode lookup. If
they are actually dirty, they'll get written by a future cluster
flush....

We also add a shutdown check after obtaining the flush lock so that
we catch inodes that are dirty in memory and may have inconsistent
state due to the shutdown in progress. We abort these inodes
directly and so they remove themselves directly from the buffer list
and the AIL rather than having to wait for the buffer to be failed
and callbacks run to be processed correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c      | 148 ++++++++++++++++------------------------
 fs/xfs/xfs_inode.h      |   2 +-
 fs/xfs/xfs_inode_item.c |   2 +-
 3 files changed, 62 insertions(+), 90 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 8566bd0f4334d..931a483d5b316 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3611,117 +3611,94 @@ xfs_iflush(
  */
 int
 xfs_iflush_cluster(
-	struct xfs_inode	*ip,
 	struct xfs_buf		*bp)
 {
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_perag	*pag;
-	unsigned long		first_index, mask;
-	int			cilist_size;
-	struct xfs_inode	**cilist;
-	struct xfs_inode	*cip;
-	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
-	int			error = 0;
-	int			nr_found;
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_log_item	*lip, *n;
+	struct xfs_inode	*ip;
+	struct xfs_inode_log_item *iip;
 	int			clcount = 0;
-	int			i;
-
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-
-	cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *);
-	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
-	if (!cilist)
-		goto out_put;
-
-	mask = ~(igeo->inodes_per_cluster - 1);
-	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
-	rcu_read_lock();
-	/* really need a gang lookup range call here */
-	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
-					first_index, igeo->inodes_per_cluster);
-	if (nr_found == 0)
-		goto out_free;
+	int			error = 0;
 
-	for (i = 0; i < nr_found; i++) {
-		cip = cilist[i];
+	/*
+	 * We must use the safe variant here as on shutdown xfs_iflush_abort()
+	 * can remove itself from the list.
+	 */
+	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
+		iip = (struct xfs_inode_log_item *)lip;
+		ip = iip->ili_inode;
 
 		/*
-		 * because this is an RCU protected lookup, we could find a
-		 * recently freed or even reallocated inode during the lookup.
-		 * We need to check under the i_flags_lock for a valid inode
-		 * here. Skip it if it is not valid or the wrong inode.
+		 * Quick and dirty check to avoid locks if possible.
 		 */
-		spin_lock(&cip->i_flags_lock);
-		if (!cip->i_ino ||
-		    __xfs_iflags_test(cip, XFS_ISTALE)) {
-			spin_unlock(&cip->i_flags_lock);
+		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK))
+			continue;
+		if (xfs_ipincount(ip))
 			continue;
-		}
 
 		/*
-		 * Once we fall off the end of the cluster, no point checking
-		 * any more inodes in the list because they will also all be
-		 * outside the cluster.
+		 * The inode is still attached to the buffer, which means it is
+		 * dirty but reclaim might try to grab it. Check carefully for
+		 * that, and grab the ilock while still holding the i_flags_lock
+		 * to guarantee reclaim will not be able to reclaim this inode
+		 * once we drop the i_flags_lock.
 		 */
-		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
-			spin_unlock(&cip->i_flags_lock);
-			break;
+		spin_lock(&ip->i_flags_lock);
+		ASSERT(!__xfs_iflags_test(ip, XFS_ISTALE));
+		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK)) {
+			spin_unlock(&ip->i_flags_lock);
+			continue;
 		}
-		spin_unlock(&cip->i_flags_lock);
 
 		/*
-		 * Do an un-protected check to see if the inode is dirty and
-		 * is a candidate for flushing.  These checks will be repeated
-		 * later after the appropriate locks are acquired.
+		 * ILOCK will pin the inode against reclaim and prevent
+		 * concurrent transactions modifying the inode while we are
+		 * flushing the inode.
 		 */
-		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
+		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) {
+			spin_unlock(&ip->i_flags_lock);
 			continue;
+		}
+		spin_unlock(&ip->i_flags_lock);
 
 		/*
-		 * Try to get locks.  If any are unavailable or it is pinned,
-		 * then this inode cannot be flushed and is skipped.
+		 * Skip inodes that are already flush locked as they have
+		 * already been written to the buffer.
 		 */
-
-		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
-			continue;
-		if (!xfs_iflock_nowait(cip)) {
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
-			continue;
-		}
-		if (xfs_ipincount(cip)) {
-			xfs_ifunlock(cip);
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+		if (!xfs_iflock_nowait(ip)) {
+			xfs_iunlock(ip, XFS_ILOCK_SHARED);
 			continue;
 		}
 
-
 		/*
-		 * Check the inode number again, just to be certain we are not
-		 * racing with freeing in xfs_reclaim_inode(). See the comments
-		 * in that function for more information as to why the initial
-		 * check is not sufficient.
+		 * If we are shut down, unpin and abort the inode now as there
+		 * is no point in flushing it to the buffer just to get an IO
+		 * completion to abort the buffer and remove it from the AIL.
 		 */
-		if (!cip->i_ino) {
-			xfs_ifunlock(cip);
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+		if (XFS_FORCED_SHUTDOWN(mp)) {
+			xfs_iunpin_wait(ip);
+			/* xfs_iflush_abort() drops the flush lock */
+			xfs_iflush_abort(ip);
+			xfs_iunlock(ip, XFS_ILOCK_SHARED);
+			error = -EIO;
 			continue;
 		}
 
-		/*
-		 * arriving here means that this inode can be flushed.  First
-		 * re-check that it's dirty before flushing.
-		 */
-		if (!xfs_inode_clean(cip)) {
-			error = xfs_iflush(cip, bp);
-			if (error) {
-				xfs_iunlock(cip, XFS_ILOCK_SHARED);
-				goto out_free;
-			}
-			clcount++;
-		} else {
-			xfs_ifunlock(cip);
+		/* don't block waiting on a log force to unpin dirty inodes */
+		if (xfs_ipincount(ip)) {
+			xfs_ifunlock(ip);
+			xfs_iunlock(ip, XFS_ILOCK_SHARED);
+			continue;
 		}
-		xfs_iunlock(cip, XFS_ILOCK_SHARED);
+
+		if (!xfs_inode_clean(ip))
+			error = xfs_iflush(ip, bp);
+		else
+			xfs_ifunlock(ip);
+		xfs_iunlock(ip, XFS_ILOCK_SHARED);
+		if (error)
+			break;
+		clcount++;
 	}
 
 	if (clcount) {
@@ -3729,11 +3706,6 @@ xfs_iflush_cluster(
 		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
 	}
 
-out_free:
-	rcu_read_unlock();
-	kmem_free(cilist);
-out_put:
-	xfs_perag_put(pag);
 	if (error) {
 		bp->b_flags |= XBF_ASYNC;
 		xfs_buf_ioend_fail(bp);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index d1109eb13ba2e..b93cf9076df8a 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -427,7 +427,7 @@ int		xfs_log_force_inode(struct xfs_inode *ip);
 void		xfs_iunpin_wait(xfs_inode_t *);
 #define xfs_ipincount(ip)	((unsigned int) atomic_read(&ip->i_pincount))
 
-int		xfs_iflush_cluster(struct xfs_inode *, struct xfs_buf *);
+int		xfs_iflush_cluster(struct xfs_buf *);
 void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
 				struct xfs_inode *ip1, uint ip1_mode);
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index e679fac944725..a3a8ae5e39e12 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -513,7 +513,7 @@ xfs_inode_item_push(
 	 * reference for IO until we queue the buffer for delwri submission.
 	 */
 	xfs_buf_hold(bp);
-	error = xfs_iflush_cluster(ip, bp);
+	error = xfs_iflush_cluster(bp);
 	if (!error) {
 		if (!xfs_buf_delwri_queue(bp, buffer_list))
 			rval = XFS_ITEM_FLUSHING;
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 29/30] xfs: factor xfs_iflush_done
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (27 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  2020-06-01 21:42 ` [PATCH 30/30] xfs: remove xfs_inobp_check() Dave Chinner
  29 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_iflush_done() does 3 distinct operations to the inodes attached
to the buffer. Separate these operations out into functions so that
it is easier to modify these operations independently in future.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode_item.c | 154 +++++++++++++++++++++-------------------
 1 file changed, 81 insertions(+), 73 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index a3a8ae5e39e12..1749420a9cb97 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -642,101 +642,64 @@ xfs_inode_item_destroy(
 
 
 /*
- * This is the inode flushing I/O completion routine.  It is called
- * from interrupt level when the buffer containing the inode is
- * flushed to disk.  It is responsible for removing the inode item
- * from the AIL if it has not been re-logged, and unlocking the inode's
- * flush lock.
- *
- * To reduce AIL lock traffic as much as possible, we scan the buffer log item
- * list for other inodes that will run this function. We remove them from the
- * buffer list so we can process all the inode IO completions in one AIL lock
- * traversal.
- *
- * Note: Now that we attach the log item to the buffer when we first log the
- * inode in memory, we can have unflushed inodes on the buffer list here. These
- * inodes will have a zero ili_last_fields, so skip over them here.
+ * We only want to pull the item from the AIL if it is actually there
+ * and its location in the log has not changed since we started the
+ * flush.  Thus, we only bother if the inode's lsn has not changed.
  */
 void
-xfs_iflush_done(
-	struct xfs_buf		*bp)
+xfs_iflush_ail_updates(
+	struct xfs_ail		*ailp,
+	struct list_head	*list)
 {
-	struct xfs_inode_log_item *iip;
-	struct xfs_log_item	*lip, *n;
-	struct xfs_ail		*ailp = bp->b_mount->m_ail;
-	int			need_ail = 0;
-	LIST_HEAD(tmp);
+	struct xfs_log_item	*lip;
+	xfs_lsn_t		tail_lsn = 0;
 
-	/*
-	 * Pull the attached inodes from the buffer one at a time and take the
-	 * appropriate action on them.
-	 */
-	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
-		iip = INODE_ITEM(lip);
+	/* this is an opencoded batch version of xfs_trans_ail_delete */
+	spin_lock(&ailp->ail_lock);
+	list_for_each_entry(lip, list, li_bio_list) {
+		xfs_lsn_t	lsn;
 
-		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
-			xfs_iflush_abort(iip->ili_inode);
+		if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn) {
+			clear_bit(XFS_LI_FAILED, &lip->li_flags);
 			continue;
 		}
 
-		if (!iip->ili_last_fields)
-			continue;
-
-		list_move_tail(&lip->li_bio_list, &tmp);
-
-		/* Do an unlocked check for needing the AIL lock. */
-		if (iip->ili_flush_lsn == lip->li_lsn ||
-		    test_bit(XFS_LI_FAILED, &lip->li_flags))
-			need_ail++;
+		lsn = xfs_ail_delete_one(ailp, lip);
+		if (!tail_lsn && lsn)
+			tail_lsn = lsn;
 	}
+	xfs_ail_update_finish(ailp, tail_lsn);
+}
 
-	/*
-	 * We only want to pull the item from the AIL if it is actually there
-	 * and its location in the log has not changed since we started the
-	 * flush.  Thus, we only bother if the inode's lsn has not changed.
-	 */
-	if (need_ail) {
-		xfs_lsn_t	tail_lsn = 0;
-
-		/* this is an opencoded batch version of xfs_trans_ail_delete */
-		spin_lock(&ailp->ail_lock);
-		list_for_each_entry(lip, &tmp, li_bio_list) {
-			clear_bit(XFS_LI_FAILED, &lip->li_flags);
-			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
-				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
-				if (!tail_lsn && lsn)
-					tail_lsn = lsn;
-			}
-		}
-		xfs_ail_update_finish(ailp, tail_lsn);
-	}
+/*
+ * Walk the list of inodes that have completed their IOs. If they are clean
+ * remove them from the list and dissociate them from the buffer. Buffers that
+ * are still dirty remain linked to the buffer and on the list. Caller must
+ * handle them appropriately.
+ */
+void
+xfs_iflush_finish(
+	struct xfs_buf		*bp,
+	struct list_head	*list)
+{
+	struct xfs_log_item	*lip, *n;
 
-	/*
-	 * Clean up and unlock the flush lock now we are done. We can clear the
-	 * ili_last_fields bits now that we know that the data corresponding to
-	 * them is safely on disk.
-	 */
-	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
+	list_for_each_entry_safe(lip, n, list, li_bio_list) {
+		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
 		bool	drop_buffer = false;
 
-		list_del_init(&lip->li_bio_list);
-		iip = INODE_ITEM(lip);
-
 		spin_lock(&iip->ili_lock);
 
 		/*
 		 * Remove the reference to the cluster buffer if the inode is
-		 * clean in memory. Drop the buffer reference once we've dropped
-		 * the locks we hold. If the inode is dirty in memory, we need
-		 * to put the inode item back on the buffer list for another
-		 * pass through the flush machinery.
+		 * clean in memory and drop the buffer reference once we've
+		 * dropped the locks we hold.
 		 */
 		ASSERT(iip->ili_item.li_buf == bp);
 		if (!iip->ili_fields) {
 			iip->ili_item.li_buf = NULL;
+			list_del_init(&lip->li_bio_list);
 			drop_buffer = true;
-		} else {
-			list_add(&lip->li_bio_list, &bp->b_li_list);
 		}
 		iip->ili_last_fields = 0;
 		iip->ili_flush_lsn = 0;
@@ -747,6 +710,51 @@ xfs_iflush_done(
 	}
 }
 
+/*
+ * Inode buffer IO completion routine.  It is responsible for removing inodes
+ * attached to the buffer from the AIL if they have not been re-logged, as well
+ * as completing the flush and unlocking the inode.
+ */
+void
+xfs_iflush_done(
+	struct xfs_buf		*bp)
+{
+	struct xfs_log_item	*lip, *n;
+	LIST_HEAD(flushed_inodes);
+	LIST_HEAD(ail_updates);
+
+	/*
+	 * Pull the attached inodes from the buffer one at a time and take the
+	 * appropriate action on them.
+	 */
+	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
+		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
+
+		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
+			xfs_iflush_abort(iip->ili_inode);
+			continue;
+		}
+		if (!iip->ili_last_fields)
+			continue;
+
+		/* Do an unlocked check for needing the AIL lock. */
+		if (iip->ili_flush_lsn == lip->li_lsn ||
+		    test_bit(XFS_LI_FAILED, &lip->li_flags))
+			list_move_tail(&lip->li_bio_list, &ail_updates);
+		else
+			list_move_tail(&lip->li_bio_list, &flushed_inodes);
+	}
+
+	if (!list_empty(&ail_updates)) {
+		xfs_iflush_ail_updates(bp->b_mount->m_ail, &ail_updates);
+		list_splice_tail(&ail_updates, &flushed_inodes);
+	}
+
+	xfs_iflush_finish(bp, &flushed_inodes);
+	if (!list_empty(&flushed_inodes))
+		list_splice_tail(&flushed_inodes, &bp->b_li_list);
+}
+
 /*
  * This is the inode flushing abort routine.  It is called from xfs_iflush when
  * the filesystem is shutting down to clean up the inode state.  It is
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 30/30] xfs: remove xfs_inobp_check()
  2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (28 preceding siblings ...)
  2020-06-01 21:42 ` [PATCH 29/30] xfs: factor xfs_iflush_done Dave Chinner
@ 2020-06-01 21:42 ` Dave Chinner
  29 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

This debug code is called on every xfs_iflush() call, which then
checks every inode in the buffer for non-zero unlinked list field.
Hence it checks every inode in the cluster buffer every time a
single inode on that cluster it flushed. This is resulting in:

-   38.91%     5.33%  [kernel]  [k] xfs_iflush
   - 17.70% xfs_iflush
      - 9.93% xfs_inobp_check
           4.36% xfs_buf_offset

10% of the CPU time spent flushing inodes is repeatedly checking
unlinked fields in the buffer. We don't need to do this.

The other place we call xfs_inobp_check() is
xfs_iunlink_update_dinode(), and this is after we've done this
assert for the agino we are about to write into that inode:

	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));

which means we've already checked that the agino we are about to
write is not 0 on debug kernels. The inode buffer verifiers do
everything else we need, so let's just remove this debug code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c | 24 ------------------------
 fs/xfs/libxfs/xfs_inode_buf.h |  6 ------
 fs/xfs/xfs_inode.c            |  2 --
 3 files changed, 32 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 1af97235785c8..6b6f67595bf4e 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -20,30 +20,6 @@
 
 #include <linux/iversion.h>
 
-/*
- * Check that none of the inode's in the buffer have a next
- * unlinked field of 0.
- */
-#if defined(DEBUG)
-void
-xfs_inobp_check(
-	xfs_mount_t	*mp,
-	xfs_buf_t	*bp)
-{
-	int		i;
-	xfs_dinode_t	*dip;
-
-	for (i = 0; i < M_IGEO(mp)->inodes_per_cluster; i++) {
-		dip = xfs_buf_offset(bp, i * mp->m_sb.sb_inodesize);
-		if (!dip->di_next_unlinked)  {
-			xfs_alert(mp,
-	"Detected bogus zero next_unlinked field in inode %d buffer 0x%llx.",
-				i, (long long)bp->b_bn);
-		}
-	}
-}
-#endif
-
 /*
  * If we are doing readahead on an inode buffer, we might be in log recovery
  * reading an inode allocation buffer that hasn't yet been replayed, and hence
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 865ac493c72a2..6b08b9d060c2e 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -52,12 +52,6 @@ int	xfs_inode_from_disk(struct xfs_inode *ip, struct xfs_dinode *from);
 void	xfs_log_dinode_to_disk(struct xfs_log_dinode *from,
 			       struct xfs_dinode *to);
 
-#if defined(DEBUG)
-void	xfs_inobp_check(struct xfs_mount *, struct xfs_buf *);
-#else
-#define	xfs_inobp_check(mp, bp)
-#endif /* DEBUG */
-
 xfs_failaddr_t xfs_dinode_verify(struct xfs_mount *mp, xfs_ino_t ino,
 			   struct xfs_dinode *dip);
 xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 931a483d5b316..9400c2e0b0c4a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2165,7 +2165,6 @@ xfs_iunlink_update_dinode(
 	xfs_dinode_calc_crc(mp, dip);
 	xfs_trans_inode_buf(tp, ibp);
 	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
-	xfs_inobp_check(mp, ibp);
 }
 
 /* Set an in-core inode's unlinked pointer and return the old value. */
@@ -3559,7 +3558,6 @@ xfs_iflush(
 	xfs_iflush_fork(ip, dip, iip, XFS_DATA_FORK);
 	if (XFS_IFORK_Q(ip))
 		xfs_iflush_fork(ip, dip, iip, XFS_ATTR_FORK);
-	xfs_inobp_check(mp, bp);
 
 	/*
 	 * We've recorded everything logged in the inode, so we'd like to clear
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes
  2020-06-01 21:42 ` [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Dave Chinner
@ 2020-06-02  4:30   ` Darrick J. Wong
  2020-06-02  7:06     ` Dave Chinner
  2020-06-02 16:32   ` Brian Foster
  1 sibling, 1 reply; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02  4:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:22AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In tracking down a problem in this patchset, I discovered we are
> reclaiming dirty stale inodes. This wasn't discovered until inodes
> were always attached to the cluster buffer and then the rcu callback
> that freed inodes was assert failing because the inode still had an
> active pointer to the cluster buffer after it had been reclaimed.
> 
> Debugging the issue indicated that this was a pre-existing issue
> resulting from the way the inodes are handled in xfs_inactive_ifree.
> When we free a cluster buffer from xfs_ifree_cluster, all the inodes
> in cache are marked XFS_ISTALE. Those that are clean have nothing
> else done to them and so eventually get cleaned up by background
> reclaim. i.e. it is assumed we'll never dirty/relog an inode marked
> XFS_ISTALE.
> 
> On journal commit dirty stale inodes as are handled by both
> buffer and inode log items to run though xfs_istale_done() and
> removed from the AIL (buffer log item commit) or the log item will
> simply unpin it because the buffer log item will clean it. What happens
> to any specific inode is entirely dependent on which log item wins
> the commit race, but the result is the same - stale inodes are
> clean, not attached to the cluster buffer, and not in the AIL. Hence
> inode reclaim can just free these inodes without further care.
> 
> However, if the stale inode is relogged, it gets dirtied again and
> relogged into the CIL. Most of the time this isn't an issue, because
> relogging simply changes the inode's location in the current
> checkpoint. Problems arise, however, when the CIL checkpoints
> between two transactions in the xfs_inactive_ifree() deferops
> processing. This results in the XFS_ISTALE inode being redirtied
> and inserted into the CIL without any of the other stale cluster
> buffer infrastructure being in place.
> 
> Hence on journal commit, it simply gets unpinned, so it remains
> dirty in memory. Everything in inode writeback avoids XFS_ISTALE
> inodes so it can't be written back, and it is not tracked in the AIL
> so there's not even a trigger to attempt to clean the inode. Hence
> the inode just sits dirty in memory until inode reclaim comes along,
> sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming
> of a dirty inode caused use after free, list corruptions and other
> nasty issues later in this patchset.
> 
> Hence this patch addresses a violation of the "never log XFS_ISTALE
> inodes" caused by the deferops processing rolling a transaction
> and relogging a stale inode in xfs_inactive_free. It also adds a
> bunch of asserts to catch this problem in debug kernels so that
> we don't reintroduce this problem in future.
> 
> Reproducer for this issue was generic/558 on a v4 filesystem.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c |  2 ++
>  fs/xfs/xfs_icache.c             |  3 ++-
>  fs/xfs/xfs_inode.c              | 25 ++++++++++++++++++++++---
>  3 files changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index b5dfb66548422..4504d215cd590 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -36,6 +36,7 @@ xfs_trans_ijoin(
>  
>  	ASSERT(iip->ili_lock_flags == 0);
>  	iip->ili_lock_flags = lock_flags;
> +	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
>  
>  	/*
>  	 * Get a log_item_desc to point at the new item.
> @@ -89,6 +90,7 @@ xfs_trans_log_inode(
>  
>  	ASSERT(ip->i_itemp != NULL);
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> +	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
>  
>  	/*
>  	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 0a5ac6f9a5834..dbba4c1946386 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1141,7 +1141,7 @@ xfs_reclaim_inode(
>  			goto out_ifunlock;
>  		xfs_iunpin_wait(ip);
>  	}
> -	if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) {
> +	if (xfs_inode_clean(ip)) {
>  		xfs_ifunlock(ip);
>  		goto reclaim;
>  	}
> @@ -1228,6 +1228,7 @@ xfs_reclaim_inode(
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	xfs_qm_dqdetach(ip);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	ASSERT(xfs_inode_clean(ip));
>  
>  	__xfs_inode_free(ip);
>  	return error;
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 64f5f9a440aed..53a1d64782c35 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1740,10 +1740,31 @@ xfs_inactive_ifree(
>  		return error;
>  	}
>  
> +	/*
> +	 * We do not hold the inode locked across the entire rolling transaction
> +	 * here. We only need to hold it for the first transaction that
> +	 * xfs_ifree() builds, which may mark the inode XFS_ISTALE if the
> +	 * underlying cluster buffer is freed. Relogging an XFS_ISTALE inode
> +	 * here breaks the relationship between cluster buffer invalidation and
> +	 * stale inode invalidation on cluster buffer item journal commit
> +	 * completion, and can result in leaving dirty stale inodes hanging
> +	 * around in memory.
> +	 *
> +	 * We have no need for serialising this inode operation against other
> +	 * operations - we freed the inode and hence reallocation is required
> +	 * and that will serialise on reallocating the space the deferops need
> +	 * to free. Hence we can unlock the inode on the first commit of
> +	 * the transaction rather than roll it right through the deferops. This
> +	 * avoids relogging the XFS_ISTALE inode.

Hmm.  What defer ops causes a transaction roll?  Is it the EFI that
frees the inode cluster blocks?

> +	 *
> +	 * We check that xfs_ifree() hasn't grown an internal transaction roll
> +	 * by asserting that the inode is still locked when it returns.
> +	 */
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	xfs_trans_ijoin(tp, ip, 0);
> +	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);

This looks right to me since we should be marking the inode free in the
first transaction and therefore should not keep it attached to the
transaction...

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D


>  	error = xfs_ifree(tp, ip);
> +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  	if (error) {
>  		/*
>  		 * If we fail to free the inode, shut down.  The cancel
> @@ -1756,7 +1777,6 @@ xfs_inactive_ifree(
>  			xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
>  		}
>  		xfs_trans_cancel(tp);
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  		return error;
>  	}
>  
> @@ -1774,7 +1794,6 @@ xfs_inactive_ifree(
>  		xfs_notice(mp, "%s: xfs_trans_commit returned error %d",
>  			__func__, error);
>  
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	return 0;
>  }
>  
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes
  2020-06-02  4:30   ` Darrick J. Wong
@ 2020-06-02  7:06     ` Dave Chinner
  0 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-02  7:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Jun 01, 2020 at 09:30:52PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 02, 2020 at 07:42:22AM +1000, Dave Chinner wrote:
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1740,10 +1740,31 @@ xfs_inactive_ifree(
> >  		return error;
> >  	}
> >  
> > +	/*
> > +	 * We do not hold the inode locked across the entire rolling transaction
> > +	 * here. We only need to hold it for the first transaction that
> > +	 * xfs_ifree() builds, which may mark the inode XFS_ISTALE if the
> > +	 * underlying cluster buffer is freed. Relogging an XFS_ISTALE inode
> > +	 * here breaks the relationship between cluster buffer invalidation and
> > +	 * stale inode invalidation on cluster buffer item journal commit
> > +	 * completion, and can result in leaving dirty stale inodes hanging
> > +	 * around in memory.
> > +	 *
> > +	 * We have no need for serialising this inode operation against other
> > +	 * operations - we freed the inode and hence reallocation is required
> > +	 * and that will serialise on reallocating the space the deferops need
> > +	 * to free. Hence we can unlock the inode on the first commit of
> > +	 * the transaction rather than roll it right through the deferops. This
> > +	 * avoids relogging the XFS_ISTALE inode.
> 
> Hmm.  What defer ops causes a transaction roll?  Is it the EFI that
> frees the inode cluster blocks?

Yeah, xfs_difree_inode_chunk() calls xfs_bmap_add_free() which goes
through the deferops to free the inode chunk extent(s).

I suspect that we can also get xfs_alloc_fix_freelist() deferring
AGFL frees here too. I basically just assume anything that allocates
or frees extents is likely to have at least one deferop as a result
of this....

> > +	 *
> > +	 * We check that xfs_ifree() hasn't grown an internal transaction roll
> > +	 * by asserting that the inode is still locked when it returns.
> > +	 */
> >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > -	xfs_trans_ijoin(tp, ip, 0);
> > +	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> 
> This looks right to me since we should be marking the inode free in the
> first transaction and therefore should not keep it attached to the
> transaction...

*nod*

That was my thinking about the problem, too.

> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes
  2020-06-01 21:42 ` [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Dave Chinner
  2020-06-02  4:30   ` Darrick J. Wong
@ 2020-06-02 16:32   ` Brian Foster
  1 sibling, 0 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-02 16:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:22AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In tracking down a problem in this patchset, I discovered we are
> reclaiming dirty stale inodes. This wasn't discovered until inodes
> were always attached to the cluster buffer and then the rcu callback
> that freed inodes was assert failing because the inode still had an
> active pointer to the cluster buffer after it had been reclaimed.
> 
> Debugging the issue indicated that this was a pre-existing issue
> resulting from the way the inodes are handled in xfs_inactive_ifree.
> When we free a cluster buffer from xfs_ifree_cluster, all the inodes
> in cache are marked XFS_ISTALE. Those that are clean have nothing
> else done to them and so eventually get cleaned up by background
> reclaim. i.e. it is assumed we'll never dirty/relog an inode marked
> XFS_ISTALE.
> 
> On journal commit dirty stale inodes as are handled by both
> buffer and inode log items to run though xfs_istale_done() and
> removed from the AIL (buffer log item commit) or the log item will
> simply unpin it because the buffer log item will clean it. What happens
> to any specific inode is entirely dependent on which log item wins
> the commit race, but the result is the same - stale inodes are
> clean, not attached to the cluster buffer, and not in the AIL. Hence
> inode reclaim can just free these inodes without further care.
> 
> However, if the stale inode is relogged, it gets dirtied again and
> relogged into the CIL. Most of the time this isn't an issue, because
> relogging simply changes the inode's location in the current
> checkpoint. Problems arise, however, when the CIL checkpoints
> between two transactions in the xfs_inactive_ifree() deferops
> processing. This results in the XFS_ISTALE inode being redirtied
> and inserted into the CIL without any of the other stale cluster
> buffer infrastructure being in place.
> 
> Hence on journal commit, it simply gets unpinned, so it remains
> dirty in memory. Everything in inode writeback avoids XFS_ISTALE
> inodes so it can't be written back, and it is not tracked in the AIL
> so there's not even a trigger to attempt to clean the inode. Hence
> the inode just sits dirty in memory until inode reclaim comes along,
> sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming
> of a dirty inode caused use after free, list corruptions and other
> nasty issues later in this patchset.
> 
> Hence this patch addresses a violation of the "never log XFS_ISTALE
> inodes" caused by the deferops processing rolling a transaction
> and relogging a stale inode in xfs_inactive_free. It also adds a
> bunch of asserts to catch this problem in debug kernels so that
> we don't reintroduce this problem in future.
> 
> Reproducer for this issue was generic/558 on a v4 filesystem.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_trans_inode.c |  2 ++
>  fs/xfs/xfs_icache.c             |  3 ++-
>  fs/xfs/xfs_inode.c              | 25 ++++++++++++++++++++++---
>  3 files changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index b5dfb66548422..4504d215cd590 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -36,6 +36,7 @@ xfs_trans_ijoin(
>  
>  	ASSERT(iip->ili_lock_flags == 0);
>  	iip->ili_lock_flags = lock_flags;
> +	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
>  
>  	/*
>  	 * Get a log_item_desc to point at the new item.
> @@ -89,6 +90,7 @@ xfs_trans_log_inode(
>  
>  	ASSERT(ip->i_itemp != NULL);
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> +	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
>  
>  	/*
>  	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 0a5ac6f9a5834..dbba4c1946386 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1141,7 +1141,7 @@ xfs_reclaim_inode(
>  			goto out_ifunlock;
>  		xfs_iunpin_wait(ip);
>  	}
> -	if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) {
> +	if (xfs_inode_clean(ip)) {
>  		xfs_ifunlock(ip);
>  		goto reclaim;
>  	}
> @@ -1228,6 +1228,7 @@ xfs_reclaim_inode(
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	xfs_qm_dqdetach(ip);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	ASSERT(xfs_inode_clean(ip));
>  
>  	__xfs_inode_free(ip);
>  	return error;
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 64f5f9a440aed..53a1d64782c35 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1740,10 +1740,31 @@ xfs_inactive_ifree(
>  		return error;
>  	}
>  
> +	/*
> +	 * We do not hold the inode locked across the entire rolling transaction
> +	 * here. We only need to hold it for the first transaction that
> +	 * xfs_ifree() builds, which may mark the inode XFS_ISTALE if the
> +	 * underlying cluster buffer is freed. Relogging an XFS_ISTALE inode
> +	 * here breaks the relationship between cluster buffer invalidation and
> +	 * stale inode invalidation on cluster buffer item journal commit
> +	 * completion, and can result in leaving dirty stale inodes hanging
> +	 * around in memory.
> +	 *
> +	 * We have no need for serialising this inode operation against other
> +	 * operations - we freed the inode and hence reallocation is required
> +	 * and that will serialise on reallocating the space the deferops need
> +	 * to free. Hence we can unlock the inode on the first commit of
> +	 * the transaction rather than roll it right through the deferops. This
> +	 * avoids relogging the XFS_ISTALE inode.
> +	 *
> +	 * We check that xfs_ifree() hasn't grown an internal transaction roll
> +	 * by asserting that the inode is still locked when it returns.
> +	 */
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	xfs_trans_ijoin(tp, ip, 0);
> +	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>  
>  	error = xfs_ifree(tp, ip);
> +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  	if (error) {
>  		/*
>  		 * If we fail to free the inode, shut down.  The cancel
> @@ -1756,7 +1777,6 @@ xfs_inactive_ifree(
>  			xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
>  		}
>  		xfs_trans_cancel(tp);
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  		return error;
>  	}
>  
> @@ -1774,7 +1794,6 @@ xfs_inactive_ifree(
>  		xfs_notice(mp, "%s: xfs_trans_commit returned error %d",
>  			__func__, error);
>  
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	return 0;
>  }
>  
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 02/30] xfs: remove logged flag from inode log item
  2020-06-01 21:42 ` [PATCH 02/30] xfs: remove logged flag from inode log item Dave Chinner
@ 2020-06-02 16:32   ` Brian Foster
  0 siblings, 0 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-02 16:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:23AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> This was used to track if the item had logged fields being flushed
> to disk. We log everything in the inode these days, so this logic is
> no longer needed. Remove it.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_inode.c      | 13 ++++---------
>  fs/xfs/xfs_inode_item.c | 35 ++++++++++-------------------------
>  fs/xfs/xfs_inode_item.h |  1 -
>  3 files changed, 14 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 53a1d64782c35..4fa12775ac146 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2677,7 +2677,6 @@ xfs_ifree_cluster(
>  		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
>  			if (lip->li_type == XFS_LI_INODE) {
>  				iip = (struct xfs_inode_log_item *)lip;
> -				ASSERT(iip->ili_logged == 1);
>  				lip->li_cb = xfs_istale_done;
>  				xfs_trans_ail_copy_lsn(mp->m_ail,
>  							&iip->ili_flush_lsn,
> @@ -2706,7 +2705,6 @@ xfs_ifree_cluster(
>  			iip->ili_last_fields = iip->ili_fields;
>  			iip->ili_fields = 0;
>  			iip->ili_fsync_fields = 0;
> -			iip->ili_logged = 1;
>  			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  						&iip->ili_item.li_lsn);
>  
> @@ -3838,19 +3836,16 @@ xfs_iflush_int(
>  	 *
>  	 * We can play with the ili_fields bits here, because the inode lock
>  	 * must be held exclusively in order to set bits there and the flush
> -	 * lock protects the ili_last_fields bits.  Set ili_logged so the flush
> -	 * done routine can tell whether or not to look in the AIL.  Also, store
> -	 * the current LSN of the inode so that we can tell whether the item has
> -	 * moved in the AIL from xfs_iflush_done().  In order to read the lsn we
> -	 * need the AIL lock, because it is a 64 bit value that cannot be read
> -	 * atomically.
> +	 * lock protects the ili_last_fields bits.  Store the current LSN of the
> +	 * inode so that we can tell whether the item has moved in the AIL from
> +	 * xfs_iflush_done().  In order to read the lsn we need the AIL lock,
> +	 * because it is a 64 bit value that cannot be read atomically.
>  	 */
>  	error = 0;
>  flush_out:
>  	iip->ili_last_fields = iip->ili_fields;
>  	iip->ili_fields = 0;
>  	iip->ili_fsync_fields = 0;
> -	iip->ili_logged = 1;
>  
>  	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  				&iip->ili_item.li_lsn);
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index ba47bf65b772b..b17384aa8df40 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -528,8 +528,6 @@ xfs_inode_item_push(
>  	}
>  
>  	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
> -	ASSERT(iip->ili_logged == 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
> -
>  	spin_unlock(&lip->li_ailp->ail_lock);
>  
>  	error = xfs_iflush(ip, &bp);
> @@ -690,30 +688,24 @@ xfs_iflush_done(
>  			continue;
>  
>  		list_move_tail(&blip->li_bio_list, &tmp);
> -		/*
> -		 * while we have the item, do the unlocked check for needing
> -		 * the AIL lock.
> -		 */
> +
> +		/* Do an unlocked check for needing the AIL lock. */
>  		iip = INODE_ITEM(blip);
> -		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
> +		if (blip->li_lsn == iip->ili_flush_lsn ||
>  		    test_bit(XFS_LI_FAILED, &blip->li_flags))
>  			need_ail++;
>  	}
>  
>  	/* make sure we capture the state of the initial inode. */
>  	iip = INODE_ITEM(lip);
> -	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
> +	if (lip->li_lsn == iip->ili_flush_lsn ||
>  	    test_bit(XFS_LI_FAILED, &lip->li_flags))
>  		need_ail++;
>  
>  	/*
> -	 * We only want to pull the item from the AIL if it is
> -	 * actually there and its location in the log has not
> -	 * changed since we started the flush.  Thus, we only bother
> -	 * if the ili_logged flag is set and the inode's lsn has not
> -	 * changed.  First we check the lsn outside
> -	 * the lock since it's cheaper, and then we recheck while
> -	 * holding the lock before removing the inode from the AIL.
> +	 * We only want to pull the item from the AIL if it is actually there
> +	 * and its location in the log has not changed since we started the
> +	 * flush.  Thus, we only bother if the inode's lsn has not changed.
>  	 */
>  	if (need_ail) {
>  		xfs_lsn_t	tail_lsn = 0;
> @@ -721,8 +713,7 @@ xfs_iflush_done(
>  		/* this is an opencoded batch version of xfs_trans_ail_delete */
>  		spin_lock(&ailp->ail_lock);
>  		list_for_each_entry(blip, &tmp, li_bio_list) {
> -			if (INODE_ITEM(blip)->ili_logged &&
> -			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
> +			if (blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
>  				/*
>  				 * xfs_ail_update_finish() only cares about the
>  				 * lsn of the first tail item removed, any
> @@ -740,14 +731,13 @@ xfs_iflush_done(
>  	}
>  
>  	/*
> -	 * clean up and unlock the flush lock now we are done. We can clear the
> +	 * Clean up and unlock the flush lock now we are done. We can clear the
>  	 * ili_last_fields bits now that we know that the data corresponding to
>  	 * them is safely on disk.
>  	 */
>  	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
>  		list_del_init(&blip->li_bio_list);
>  		iip = INODE_ITEM(blip);
> -		iip->ili_logged = 0;
>  		iip->ili_last_fields = 0;
>  		xfs_ifunlock(iip->ili_inode);
>  	}
> @@ -768,16 +758,11 @@ xfs_iflush_abort(
>  
>  	if (iip) {
>  		xfs_trans_ail_delete(&iip->ili_item, 0);
> -		iip->ili_logged = 0;
> -		/*
> -		 * Clear the ili_last_fields bits now that we know that the
> -		 * data corresponding to them is safely on disk.
> -		 */
> -		iip->ili_last_fields = 0;
>  		/*
>  		 * Clear the inode logging fields so no more flushes are
>  		 * attempted.
>  		 */
> +		iip->ili_last_fields = 0;
>  		iip->ili_fields = 0;
>  		iip->ili_fsync_fields = 0;
>  	}
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 60b34bb66e8ed..4de5070e07655 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -19,7 +19,6 @@ struct xfs_inode_log_item {
>  	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
>  	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
>  	unsigned short		ili_lock_flags;	   /* lock flags */
> -	unsigned short		ili_logged;	   /* flushed logged data */
>  	unsigned int		ili_last_fields;   /* fields when flushed */
>  	unsigned int		ili_fields;	   /* fields to be logged */
>  	unsigned int		ili_fsync_fields;  /* logged since last fsync */
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 03/30] xfs: add an inode item lock
  2020-06-01 21:42 ` [PATCH 03/30] xfs: add an inode item lock Dave Chinner
@ 2020-06-02 16:34   ` Brian Foster
  2020-06-04  1:54     ` Dave Chinner
  0 siblings, 1 reply; 80+ messages in thread
From: Brian Foster @ 2020-06-02 16:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:24AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The inode log item is kind of special in that it can be aggregating
> new changes in memory at the same time time existing changes are
> being written back to disk. This means there are fields in the log
> item that are accessed concurrently from contexts that don't share
> any locking at all.
> 
> e.g. updating ili_last_fields occurs at flush time under the
> ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
> completion time, and is read under the ILOCK_EXCL when the inode is
> logged.  Hence there is no actual serialisation between reading the
> field during logging of the inode in transactions vs clearing the
> field in IO completion.
> 
> We currently get away with this by the fact that we are only
> clearing fields in IO completion, and nothing bad happens if we
> accidentally log more of the inode than we actually modify. Worst
> case is we consume a tiny bit more memory and log bandwidth.
> 
> However, if we want to do more complex state manipulations on the
> log item that requires updates at all three of these potential
> locations, we need to have some mechanism of serialising those
> operations. To do this, introduce a spinlock into the log item to
> serialise internal state.
> 
> This could be done via the xfs_inode i_flags_lock, but this then
> leads to potential lock inversion issues where inode flag updates
> need to occur inside locks that best nest inside the inode log item
> locks (e.g. marking inodes stale during inode cluster freeing).
> Using a separate spinlock avoids these sorts of problems and
> simplifies future code.
> 
> This does not touch the use of ili_fields in the item formatting
> code - that is entirely protected by the ILOCK_EXCL at this point in
> time, so it remains untouched.
> 

Thanks for pointing this out.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 54 +++++++++++++++++----------------
>  fs/xfs/xfs_file.c               |  9 ++++--
>  fs/xfs/xfs_inode.c              | 20 +++++++-----
>  fs/xfs/xfs_inode_item.c         |  7 +++++
>  fs/xfs/xfs_inode_item.h         | 18 +++++++++--
>  5 files changed, 68 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 4504d215cd590..fe6c2e39be85d 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
...
> @@ -122,23 +117,30 @@ xfs_trans_log_inode(
>  	 * set however, then go ahead and bump the i_version counter
>  	 * unconditionally.
>  	 */
> -	if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
> -	    IS_I_VERSION(VFS_I(ip))) {
> -		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
> -			flags |= XFS_ILOG_CORE;
> +	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> +		if (IS_I_VERSION(inode) &&
> +		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> +			iversion_flags = XFS_ILOG_CORE;
>  	}
>  
> -	tp->t_flags |= XFS_TRANS_DIRTY;
> +	/*
> +	 * Record the specific change for fdatasync optimisation. This allows
> +	 * fdatasync to skip log forces for inodes that are only timestamp
> +	 * dirty. We do this before the change count so that the core being
> +	 * logged in this case does not impact on fdatasync behaviour.
> +	 */

We no longer do this before the change count logic so that part of the
comment is bogus.

> +	spin_lock(&iip->ili_lock);
> +	iip->ili_fsync_fields |= flags;
>  
>  	/*
> -	 * Always OR in the bits from the ili_last_fields field.
> -	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
> -	 * routines in the eventual clearing of the ili_fields bits.
> -	 * See the big comment in xfs_iflush() for an explanation of
> -	 * this coordination mechanism.
> +	 * Always OR in the bits from the ili_last_fields field.  This is to
> +	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
> +	 * the eventual clearing of the ili_fields bits.  See the big comment in
> +	 * xfs_iflush() for an explanation of this coordination mechanism.
>  	 */
> -	flags |= ip->i_itemp->ili_last_fields;
> -	ip->i_itemp->ili_fields |= flags;
> +	iip->ili_fields |= (flags | iip->ili_last_fields |
> +			    iversion_flags);
> +	spin_unlock(&iip->ili_lock);
>  }
>  
>  int
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 403c90309a8ff..0abf770b77498 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -94,6 +94,7 @@ xfs_file_fsync(
>  {
>  	struct inode		*inode = file->f_mapping->host;
>  	struct xfs_inode	*ip = XFS_I(inode);
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
>  	struct xfs_mount	*mp = ip->i_mount;
>  	int			error = 0;
>  	int			log_flushed = 0;
> @@ -137,13 +138,15 @@ xfs_file_fsync(
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
>  	if (xfs_ipincount(ip)) {
>  		if (!datasync ||
> -		    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> -			lsn = ip->i_itemp->ili_last_lsn;
> +		    (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> +			lsn = iip->ili_last_lsn;

I am still a little confused why the lock is elided in other read cases,
such as this one or perhaps the similar check in xfs_bmbt_to_iomap()..?

Similarly, it looks like we set the ili_[flush|last]_lsn fields outside
of this lock (though last_lsn looks like it's also covered by ilock),
yet the update to the inode_log_item struct implies they should be
protected. What's the intent there?

>  	}
>  
>  	if (lsn) {
>  		error = xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
> -		ip->i_itemp->ili_fsync_fields = 0;
> +		spin_lock(&iip->ili_lock);
> +		iip->ili_fsync_fields = 0;
> +		spin_unlock(&iip->ili_lock);
>  	}
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  
...
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 4de5070e07655..44c47c08b0b59 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -16,12 +16,24 @@ struct xfs_mount;
>  struct xfs_inode_log_item {
>  	struct xfs_log_item	ili_item;	   /* common portion */
>  	struct xfs_inode	*ili_inode;	   /* inode ptr */
> -	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
> -	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
> -	unsigned short		ili_lock_flags;	   /* lock flags */
> +	unsigned short		ili_lock_flags;	   /* inode lock flags */
> +	/*
> +	 * The ili_lock protects the interactions between the dirty state and
> +	 * the flush state of the inode log item. This allows us to do atomic
> +	 * modifications of multiple state fields without having to hold a
> +	 * specific inode lock to serialise them.
> +	 *
> +	 * We need atomic changes between indoe dirtying, inode flushing and

s/indoe/inode/

Brian

> +	 * inode completion, but these all hold different combinations of
> +	 * ILOCK and iflock and hence we need some other method of serialising
> +	 * updates to the flush state.
> +	 */
> +	spinlock_t		ili_lock;	   /* flush state lock */
>  	unsigned int		ili_last_fields;   /* fields when flushed */
>  	unsigned int		ili_fields;	   /* fields to be logged */
>  	unsigned int		ili_fsync_fields;  /* logged since last fsync */
> +	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
> +	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
>  };
>  
>  static inline int xfs_inode_clean(xfs_inode_t *ip)
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 04/30] xfs: mark inode buffers in cache
  2020-06-01 21:42 ` [PATCH 04/30] xfs: mark inode buffers in cache Dave Chinner
@ 2020-06-02 16:45   ` Brian Foster
  2020-06-02 19:22     ` Darrick J. Wong
  2020-06-02 21:29     ` Dave Chinner
  0 siblings, 2 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-02 16:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:25AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Inode buffers always have write IO callbacks, so by marking them
> directly we can avoid needing to attach ->b_iodone functions to
> them. This avoids an indirect call, and makes future modifications
> much simpler.
> 
> This is largely a rearrangement of the code at this point - no IO
> completion functionality changes at this point, just how the
> code is run is modified.
> 

Ok, I was initially thinking this patch looked incomplete in that we
continue to set ->b_iodone() on inode buffers even though we'd never
call it. Looking ahead, I see that the next few patches continue to
clean that up to eventually remove ->b_iodone(), so that addresses that.

My only other curiosity is that while there may not be any functional
difference, this technically changes callback behavior in that we set
the new flag in some contexts that don't currently attach anything to
the buffer, right? E.g., xfs_trans_inode_alloc_buf() sets the flag on
inode chunk init, which means we can write out an inode buffer without
any attached/flushed inodes. Is the intent of that to support future
changes? If so, a note about that in the commit log would be helpful.

Brian

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c       | 21 ++++++++++++++++-----
>  fs/xfs/xfs_buf.h       | 38 +++++++++++++++++++++++++-------------
>  fs/xfs/xfs_buf_item.c  | 42 +++++++++++++++++++++++++++++++-----------
>  fs/xfs/xfs_buf_item.h  |  1 +
>  fs/xfs/xfs_inode.c     |  2 +-
>  fs/xfs/xfs_trans_buf.c |  3 +++
>  6 files changed, 77 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 9c2fbb6bbf89d..fcf650575be61 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -14,6 +14,8 @@
>  #include "xfs_mount.h"
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
> +#include "xfs_trans.h"
> +#include "xfs_buf_item.h"
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  
> @@ -1202,12 +1204,21 @@ xfs_buf_ioend(
>  		bp->b_flags |= XBF_DONE;
>  	}
>  
> -	if (bp->b_iodone)
> +	if (read)
> +		goto out_finish;
> +
> +	if (bp->b_flags & _XBF_INODES) {
> +		xfs_buf_inode_iodone(bp);
> +		return;
> +	}
> +
> +	if (bp->b_iodone) {
>  		(*(bp->b_iodone))(bp);
> -	else if (bp->b_flags & XBF_ASYNC)
> -		xfs_buf_relse(bp);
> -	else
> -		complete(&bp->b_iowait);
> +		return;
> +	}
> +
> +out_finish:
> +	xfs_buf_ioend_finish(bp);
>  }
>  
>  static void
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 050c53b739e24..2400cb90a04c6 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -30,15 +30,18 @@
>  #define XBF_STALE	 (1 << 6) /* buffer has been staled, do not find it */
>  #define XBF_WRITE_FAIL	 (1 << 7) /* async writes have failed on this buffer */
>  
> -/* flags used only as arguments to access routines */
> -#define XBF_TRYLOCK	 (1 << 16)/* lock requested, but do not wait */
> -#define XBF_UNMAPPED	 (1 << 17)/* do not map the buffer */
> +/* buffer type flags for write callbacks */
> +#define _XBF_INODES	 (1 << 16)/* inode buffer */
>  
>  /* flags used only internally */
>  #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
>  #define _XBF_KMEM	 (1 << 21)/* backed by heap memory */
>  #define _XBF_DELWRI_Q	 (1 << 22)/* buffer on a delwri queue */
>  
> +/* flags used only as arguments to access routines */
> +#define XBF_TRYLOCK	 (1 << 30)/* lock requested, but do not wait */
> +#define XBF_UNMAPPED	 (1 << 31)/* do not map the buffer */
> +
>  typedef unsigned int xfs_buf_flags_t;
>  
>  #define XFS_BUF_FLAGS \
> @@ -50,12 +53,13 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ XBF_DONE,		"DONE" }, \
>  	{ XBF_STALE,		"STALE" }, \
>  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
> -	{ XBF_TRYLOCK,		"TRYLOCK" },	/* should never be set */\
> -	{ XBF_UNMAPPED,		"UNMAPPED" },	/* ditto */\
> +	{ _XBF_INODES,		"INODES" }, \
>  	{ _XBF_PAGES,		"PAGES" }, \
>  	{ _XBF_KMEM,		"KMEM" }, \
> -	{ _XBF_DELWRI_Q,	"DELWRI_Q" }
> -
> +	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
> +	/* The following interface flags should never be set */ \
> +	{ XBF_TRYLOCK,		"TRYLOCK" }, \
> +	{ XBF_UNMAPPED,		"UNMAPPED" }
>  
>  /*
>   * Internal state flags.
> @@ -257,9 +261,23 @@ extern void xfs_buf_unlock(xfs_buf_t *);
>  #define xfs_buf_islocked(bp) \
>  	((bp)->b_sema.count <= 0)
>  
> +static inline void xfs_buf_relse(xfs_buf_t *bp)
> +{
> +	xfs_buf_unlock(bp);
> +	xfs_buf_rele(bp);
> +}
> +
>  /* Buffer Read and Write Routines */
>  extern int xfs_bwrite(struct xfs_buf *bp);
>  extern void xfs_buf_ioend(struct xfs_buf *bp);
> +static inline void xfs_buf_ioend_finish(struct xfs_buf *bp)
> +{
> +	if (bp->b_flags & XBF_ASYNC)
> +		xfs_buf_relse(bp);
> +	else
> +		complete(&bp->b_iowait);
> +}
> +
>  extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error,
>  		xfs_failaddr_t failaddr);
>  #define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address)
> @@ -324,12 +342,6 @@ static inline int xfs_buf_ispinned(struct xfs_buf *bp)
>  	return atomic_read(&bp->b_pin_count);
>  }
>  
> -static inline void xfs_buf_relse(xfs_buf_t *bp)
> -{
> -	xfs_buf_unlock(bp);
> -	xfs_buf_rele(bp);
> -}
> -
>  static inline int
>  xfs_buf_verify_cksum(struct xfs_buf *bp, unsigned long cksum_offset)
>  {
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 9e75e8d6042ec..8659cf4282a64 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -1158,20 +1158,15 @@ xfs_buf_iodone_callback_error(
>  	return false;
>  }
>  
> -/*
> - * This is the iodone() function for buffers which have had callbacks attached
> - * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
> - * callback list, mark the buffer as having no more callbacks and then push the
> - * buffer through IO completion processing.
> - */
> -void
> -xfs_buf_iodone_callbacks(
> +static void
> +xfs_buf_run_callbacks(
>  	struct xfs_buf		*bp)
>  {
> +
>  	/*
> -	 * If there is an error, process it. Some errors require us
> -	 * to run callbacks after failure processing is done so we
> -	 * detect that and take appropriate action.
> +	 * If there is an error, process it. Some errors require us to run
> +	 * callbacks after failure processing is done so we detect that and take
> +	 * appropriate action.
>  	 */
>  	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
>  		return;
> @@ -1188,9 +1183,34 @@ xfs_buf_iodone_callbacks(
>  	bp->b_log_item = NULL;
>  	list_del_init(&bp->b_li_list);
>  	bp->b_iodone = NULL;
> +}
> +
> +/*
> + * This is the iodone() function for buffers which have had callbacks attached
> + * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
> + * callback list, mark the buffer as having no more callbacks and then push the
> + * buffer through IO completion processing.
> + */
> +void
> +xfs_buf_iodone_callbacks(
> +	struct xfs_buf		*bp)
> +{
> +	xfs_buf_run_callbacks(bp);
>  	xfs_buf_ioend(bp);
>  }
>  
> +/*
> + * Inode buffer iodone callback function.
> + */
> +void
> +xfs_buf_inode_iodone(
> +	struct xfs_buf		*bp)
> +{
> +	xfs_buf_run_callbacks(bp);
> +	xfs_buf_ioend_finish(bp);
> +}
> +
> +
>  /*
>   * This is the iodone() function for buffers which have been
>   * logged.  It is called when they are eventually flushed out.
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index c9c57e2da9327..a342933ad9b8d 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -59,6 +59,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
>  			      struct xfs_log_item *);
>  void	xfs_buf_iodone_callbacks(struct xfs_buf *);
>  void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
> +void	xfs_buf_inode_iodone(struct xfs_buf *);
>  bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
>  
>  extern kmem_zone_t	*xfs_buf_item_zone;
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ac3c8af8c9a14..d5dee57f914a9 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3860,13 +3860,13 @@ xfs_iflush_int(
>  	 * completion on the buffer to remove the inode from the AIL and release
>  	 * the flush lock.
>  	 */
> +	bp->b_flags |= _XBF_INODES;
>  	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);
>  
>  	/* generate the checksum. */
>  	xfs_dinode_calc_crc(mp, dip);
>  
>  	ASSERT(!list_empty(&bp->b_li_list));
> -	ASSERT(bp->b_iodone != NULL);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 08174ffa21189..552d0869aa0fe 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -626,6 +626,7 @@ xfs_trans_inode_buf(
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  
>  	bip->bli_flags |= XFS_BLI_INODE_BUF;
> +	bp->b_flags |= _XBF_INODES;
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
>  }
>  
> @@ -651,6 +652,7 @@ xfs_trans_stale_inode_buf(
>  
>  	bip->bli_flags |= XFS_BLI_STALE_INODE;
>  	bip->bli_item.li_cb = xfs_buf_iodone;
> +	bp->b_flags |= _XBF_INODES;
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
>  }
>  
> @@ -675,6 +677,7 @@ xfs_trans_inode_alloc_buf(
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  
>  	bip->bli_flags |= XFS_BLI_INODE_ALLOC_BUF;
> +	bp->b_flags |= _XBF_INODES;
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
>  }
>  
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/30] xfs: mark dquot buffers in cache
  2020-06-01 21:42 ` [PATCH 05/30] xfs: mark dquot " Dave Chinner
@ 2020-06-02 16:45   ` Brian Foster
  2020-06-02 19:00   ` Darrick J. Wong
  1 sibling, 0 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-02 16:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:26AM +1000, Dave Chinner wrote:
> dquot buffers always have write IO callbacks, so by marking them
> directly we can avoid needing to attach ->b_iodone functions to
> them. This avoids an indirect call, and makes future modifications
> much simpler.
> 
> This is largely a rearrangement of the code at this point - no IO
> completion functionality changes at this point, just how the
> code is run is modified.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Similar question as on the previous patch wrt to xfs_trans_dquot_buf(),
but otherwise looks reasonable:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_buf.c       |  5 +++++
>  fs/xfs/xfs_buf.h       |  2 ++
>  fs/xfs/xfs_buf_item.c  | 10 ++++++++++
>  fs/xfs/xfs_buf_item.h  |  1 +
>  fs/xfs/xfs_dquot.c     |  1 +
>  fs/xfs/xfs_trans_buf.c |  1 +
>  6 files changed, 20 insertions(+)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index fcf650575be61..3bffde8640a52 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1212,6 +1212,11 @@ xfs_buf_ioend(
>  		return;
>  	}
>  
> +	if (bp->b_flags & _XBF_DQUOTS) {
> +		xfs_buf_dquot_iodone(bp);
> +		return;
> +	}
> +
>  	if (bp->b_iodone) {
>  		(*(bp->b_iodone))(bp);
>  		return;
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 2400cb90a04c6..c1d0843206dd6 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -32,6 +32,7 @@
>  
>  /* buffer type flags for write callbacks */
>  #define _XBF_INODES	 (1 << 16)/* inode buffer */
> +#define _XBF_DQUOTS	 (1 << 17)/* dquot buffer */
>  
>  /* flags used only internally */
>  #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
> @@ -54,6 +55,7 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ XBF_STALE,		"STALE" }, \
>  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
>  	{ _XBF_INODES,		"INODES" }, \
> +	{ _XBF_DQUOTS,		"DQUOTS" }, \
>  	{ _XBF_PAGES,		"PAGES" }, \
>  	{ _XBF_KMEM,		"KMEM" }, \
>  	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 8659cf4282a64..a42cdf9ccc47d 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -1210,6 +1210,16 @@ xfs_buf_inode_iodone(
>  	xfs_buf_ioend_finish(bp);
>  }
>  
> +/*
> + * Dquot buffer iodone callback function.
> + */
> +void
> +xfs_buf_dquot_iodone(
> +	struct xfs_buf		*bp)
> +{
> +	xfs_buf_run_callbacks(bp);
> +	xfs_buf_ioend_finish(bp);
> +}
>  
>  /*
>   * This is the iodone() function for buffers which have been
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index a342933ad9b8d..27d13d29b5bbb 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -60,6 +60,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
>  void	xfs_buf_iodone_callbacks(struct xfs_buf *);
>  void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
>  void	xfs_buf_inode_iodone(struct xfs_buf *);
> +void	xfs_buf_dquot_iodone(struct xfs_buf *);
>  bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
>  
>  extern kmem_zone_t	*xfs_buf_item_zone;
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index d5b7f03e93c8d..2e2146fa0914c 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -1179,6 +1179,7 @@ xfs_qm_dqflush(
>  	 * Attach an iodone routine so that we can remove this dquot from the
>  	 * AIL and release the flush lock once the dquot is synced to disk.
>  	 */
> +	bp->b_flags |= _XBF_DQUOTS;
>  	xfs_buf_attach_iodone(bp, xfs_qm_dqflush_done,
>  				  &dqp->q_logitem.qli_item);
>  
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 552d0869aa0fe..93d62cb864c15 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -788,5 +788,6 @@ xfs_trans_dquot_buf(
>  		break;
>  	}
>  
> +	bp->b_flags |= _XBF_DQUOTS;
>  	xfs_trans_buf_set_type(tp, bp, type);
>  }
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 06/30] xfs: mark log recovery buffers for completion
  2020-06-01 21:42 ` [PATCH 06/30] xfs: mark log recovery buffers for completion Dave Chinner
@ 2020-06-02 16:45   ` Brian Foster
  2020-06-02 19:24   ` Darrick J. Wong
  1 sibling, 0 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-02 16:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:27AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Log recovery has it's own buffer write completion handler for
> buffers that it directly recovers. Convert these to direct calls by
> flagging these buffers as being log recovery buffers. The flag will
> get cleared by the log recovery IO completion routine, so it will
> never leak out of log recovery.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_buf.c                | 10 ++++++++++
>  fs/xfs/xfs_buf.h                |  2 ++
>  fs/xfs/xfs_buf_item_recover.c   |  5 ++---
>  fs/xfs/xfs_dquot_item_recover.c |  2 +-
>  fs/xfs/xfs_inode_item_recover.c |  2 +-
>  fs/xfs/xfs_log_recover.c        |  5 ++---
>  6 files changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 3bffde8640a52..0a69de674af9d 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -14,6 +14,7 @@
>  #include "xfs_mount.h"
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
> +#include "xfs_log_recover.h"
>  #include "xfs_trans.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_errortag.h"
> @@ -1207,6 +1208,15 @@ xfs_buf_ioend(
>  	if (read)
>  		goto out_finish;
>  
> +	/*
> +	 * If this is a log recovery buffer, we aren't doing transactional IO
> +	 * yet so we need to let it handle IO completions.
> +	 */
> +	if (bp->b_flags & _XBF_LOGRECOVERY) {
> +		xlog_recover_iodone(bp);
> +		return;
> +	}
> +
>  	if (bp->b_flags & _XBF_INODES) {
>  		xfs_buf_inode_iodone(bp);
>  		return;
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index c1d0843206dd6..30dabc5bae96d 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -33,6 +33,7 @@
>  /* buffer type flags for write callbacks */
>  #define _XBF_INODES	 (1 << 16)/* inode buffer */
>  #define _XBF_DQUOTS	 (1 << 17)/* dquot buffer */
> +#define _XBF_LOGRECOVERY	 (1 << 18)/* log recovery buffer */
>  
>  /* flags used only internally */
>  #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
> @@ -56,6 +57,7 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
>  	{ _XBF_INODES,		"INODES" }, \
>  	{ _XBF_DQUOTS,		"DQUOTS" }, \
> +	{ _XBF_LOGRECOVERY,		"LOG_RECOVERY" }, \
>  	{ _XBF_PAGES,		"PAGES" }, \
>  	{ _XBF_KMEM,		"KMEM" }, \
>  	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
> diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
> index 04faa7310c4f0..74c851f60eeeb 100644
> --- a/fs/xfs/xfs_buf_item_recover.c
> +++ b/fs/xfs/xfs_buf_item_recover.c
> @@ -419,8 +419,7 @@ xlog_recover_validate_buf_type(
>  	if (bp->b_ops) {
>  		struct xfs_buf_log_item	*bip;
>  
> -		ASSERT(!bp->b_iodone || bp->b_iodone == xlog_recover_iodone);
> -		bp->b_iodone = xlog_recover_iodone;
> +		bp->b_flags |= _XBF_LOGRECOVERY;
>  		xfs_buf_item_init(bp, mp);
>  		bip = bp->b_log_item;
>  		bip->bli_item.li_lsn = current_lsn;
> @@ -963,7 +962,7 @@ xlog_recover_buf_commit_pass2(
>  		error = xfs_bwrite(bp);
>  	} else {
>  		ASSERT(bp->b_mount == mp);
> -		bp->b_iodone = xlog_recover_iodone;
> +		bp->b_flags |= _XBF_LOGRECOVERY;
>  		xfs_buf_delwri_queue(bp, buffer_list);
>  	}
>  
> diff --git a/fs/xfs/xfs_dquot_item_recover.c b/fs/xfs/xfs_dquot_item_recover.c
> index 3400be4c88f08..f9ea9f55aa7cc 100644
> --- a/fs/xfs/xfs_dquot_item_recover.c
> +++ b/fs/xfs/xfs_dquot_item_recover.c
> @@ -153,7 +153,7 @@ xlog_recover_dquot_commit_pass2(
>  
>  	ASSERT(dq_f->qlf_size == 2);
>  	ASSERT(bp->b_mount == mp);
> -	bp->b_iodone = xlog_recover_iodone;
> +	bp->b_flags |= _XBF_LOGRECOVERY;
>  	xfs_buf_delwri_queue(bp, buffer_list);
>  
>  out_release:
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index dc3e26ff16c90..5e0d291835b35 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -376,7 +376,7 @@ xlog_recover_inode_commit_pass2(
>  	xfs_dinode_calc_crc(log->l_mp, dip);
>  
>  	ASSERT(bp->b_mount == mp);
> -	bp->b_iodone = xlog_recover_iodone;
> +	bp->b_flags |= _XBF_LOGRECOVERY;
>  	xfs_buf_delwri_queue(bp, buffer_list);
>  
>  out_release:
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index ec015df55b77a..52a65a74208ff 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -287,9 +287,8 @@ xlog_recover_iodone(
>  	if (bp->b_log_item)
>  		xfs_buf_item_relse(bp);
>  	ASSERT(bp->b_log_item == NULL);
> -
> -	bp->b_iodone = NULL;
> -	xfs_buf_ioend(bp);
> +	bp->b_flags &= ~_XBF_LOGRECOVERY;
> +	xfs_buf_ioend_finish(bp);
>  }
>  
>  /*
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 07/30] xfs: call xfs_buf_iodone directly
  2020-06-01 21:42 ` [PATCH 07/30] xfs: call xfs_buf_iodone directly Dave Chinner
@ 2020-06-02 16:47   ` Brian Foster
  2020-06-02 21:38     ` Dave Chinner
  0 siblings, 1 reply; 80+ messages in thread
From: Brian Foster @ 2020-06-02 16:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:28AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> All unmarked dirty buffers should be in the AIL and have log items
> attached to them. Hence when they are written, we will run a
> callback to remove the item from the AIL if appropriate. Now that
> we've handled inode and dquot buffers, all remaining calls are to
> xfs_buf_iodone() and so we can hard code this rather than use an
> indirect call.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/xfs/xfs_buf.c       | 24 ++++++++----------------
>  fs/xfs/xfs_buf.h       |  6 +-----
>  fs/xfs/xfs_buf_item.c  | 40 ++++++++++------------------------------
>  fs/xfs/xfs_buf_item.h  |  4 ++--
>  fs/xfs/xfs_trans_buf.c | 13 +++----------
>  5 files changed, 24 insertions(+), 63 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 0a69de674af9d..d7695b638e994 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
...
> @@ -1226,14 +1225,7 @@ xfs_buf_ioend(
>  		xfs_buf_dquot_iodone(bp);
>  		return;
>  	}
> -
> -	if (bp->b_iodone) {
> -		(*(bp->b_iodone))(bp);
> -		return;
> -	}
> -
> -out_finish:
> -	xfs_buf_ioend_finish(bp);
> +	xfs_buf_iodone(bp);

The way this function ends up would probably look nicer as an if/else
chain rather than a sequence of internal return statements.

>  }
>  
>  static void
...
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index a42cdf9ccc47d..d87ae6363a130 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
...
> @@ -1182,28 +1166,24 @@ xfs_buf_run_callbacks(
>  	xfs_buf_do_callbacks(bp);
>  	bp->b_log_item = NULL;
>  	list_del_init(&bp->b_li_list);
> -	bp->b_iodone = NULL;
>  }
>  
>  /*
> - * This is the iodone() function for buffers which have had callbacks attached
> - * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
> - * callback list, mark the buffer as having no more callbacks and then push the
> - * buffer through IO completion processing.
> + * Inode buffer iodone callback function.
>   */
>  void
> -xfs_buf_iodone_callbacks(
> +xfs_buf_inode_iodone(
>  	struct xfs_buf		*bp)
>  {
>  	xfs_buf_run_callbacks(bp);
> -	xfs_buf_ioend(bp);
> +	xfs_buf_ioend_finish(bp);
>  }
>  
>  /*
> - * Inode buffer iodone callback function.
> + * Dquot buffer iodone callback function.
>   */
>  void
> -xfs_buf_inode_iodone(
> +xfs_buf_dquot_iodone(
>  	struct xfs_buf		*bp)
>  {
>  	xfs_buf_run_callbacks(bp);
> @@ -1211,10 +1191,10 @@ xfs_buf_inode_iodone(
>  }
>  
>  /*
> - * Dquot buffer iodone callback function.
> + * Dirty buffer iodone callback function.
>   */
>  void
> -xfs_buf_dquot_iodone(
> +xfs_buf_iodone(
>  	struct xfs_buf		*bp)
>  {
>  	xfs_buf_run_callbacks(bp);
> @@ -1229,7 +1209,7 @@ xfs_buf_dquot_iodone(
>   * care of cleaning up the buffer itself.
>   */
>  void
> -xfs_buf_iodone(
> +xfs_buf_item_iodone(
>  	struct xfs_buf		*bp,
>  	struct xfs_log_item	*lip)
>  {

Wow, that's a nasty diff. Another recent instance where 'git show
--patience' comes in handy... :)

BTW, is there a longer term need to have three separate iodone functions
here that do the same thing?

Brian

> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index 27d13d29b5bbb..610cd00193289 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -57,10 +57,10 @@ bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
>  void	xfs_buf_attach_iodone(struct xfs_buf *,
>  			      void(*)(struct xfs_buf *, struct xfs_log_item *),
>  			      struct xfs_log_item *);
> -void	xfs_buf_iodone_callbacks(struct xfs_buf *);
> -void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
> +void	xfs_buf_item_iodone(struct xfs_buf *, struct xfs_log_item *);
>  void	xfs_buf_inode_iodone(struct xfs_buf *);
>  void	xfs_buf_dquot_iodone(struct xfs_buf *);
> +void	xfs_buf_iodone(struct xfs_buf *);
>  bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
>  
>  extern kmem_zone_t	*xfs_buf_item_zone;
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 93d62cb864c15..6752676b94fe7 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -465,24 +465,17 @@ xfs_trans_dirty_buf(
>  
>  	ASSERT(bp->b_transp == tp);
>  	ASSERT(bip != NULL);
> -	ASSERT(bp->b_iodone == NULL ||
> -	       bp->b_iodone == xfs_buf_iodone_callbacks);
>  
>  	/*
>  	 * Mark the buffer as needing to be written out eventually,
>  	 * and set its iodone function to remove the buffer's buf log
>  	 * item from the AIL and free it when the buffer is flushed
> -	 * to disk.  See xfs_buf_attach_iodone() for more details
> -	 * on li_cb and xfs_buf_iodone_callbacks().
> -	 * If we end up aborting this transaction, we trap this buffer
> -	 * inside the b_bdstrat callback so that this won't get written to
> -	 * disk.
> +	 * to disk.
>  	 */
>  	bp->b_flags |= XBF_DONE;
>  
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
> -	bp->b_iodone = xfs_buf_iodone_callbacks;
> -	bip->bli_item.li_cb = xfs_buf_iodone;
> +	bip->bli_item.li_cb = xfs_buf_item_iodone;
>  
>  	/*
>  	 * If we invalidated the buffer within this transaction, then
> @@ -651,7 +644,7 @@ xfs_trans_stale_inode_buf(
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  
>  	bip->bli_flags |= XFS_BLI_STALE_INODE;
> -	bip->bli_item.li_cb = xfs_buf_iodone;
> +	bip->bli_item.li_cb = xfs_buf_item_iodone;
>  	bp->b_flags |= _XBF_INODES;
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
>  }
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 08/30] xfs: clean up whacky buffer log item list reinit
  2020-06-01 21:42 ` [PATCH 08/30] xfs: clean up whacky buffer log item list reinit Dave Chinner
@ 2020-06-02 16:47   ` Brian Foster
  0 siblings, 0 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-02 16:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:29AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we've emptied the buffer log item list, it does a list_del_init
> on itself to reset it's pointers to itself. This is unnecessary as
> the list is already empty at this point - it was a left-over
> fragment from the list_head conversion of the buffer log item list.
> Remove them.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_buf_item.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index d87ae6363a130..5b3cd5e90947c 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -459,7 +459,6 @@ xfs_buf_item_unpin(
>  		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
>  			xfs_buf_do_callbacks(bp);
>  			bp->b_log_item = NULL;
> -			list_del_init(&bp->b_li_list);
>  		} else {
>  			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
>  			xfs_buf_item_relse(bp);
> @@ -1165,7 +1164,6 @@ xfs_buf_run_callbacks(
>  
>  	xfs_buf_do_callbacks(bp);
>  	bp->b_log_item = NULL;
> -	list_del_init(&bp->b_li_list);
>  }
>  
>  /*
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/30] xfs: mark dquot buffers in cache
  2020-06-01 21:42 ` [PATCH 05/30] xfs: mark dquot " Dave Chinner
  2020-06-02 16:45   ` Brian Foster
@ 2020-06-02 19:00   ` Darrick J. Wong
  1 sibling, 0 replies; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02 19:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:26AM +1000, Dave Chinner wrote:
> dquot buffers always have write IO callbacks, so by marking them
> directly we can avoid needing to attach ->b_iodone functions to
> them. This avoids an indirect call, and makes future modifications
> much simpler.
> 
> This is largely a rearrangement of the code at this point - no IO
> completion functionality changes at this point, just how the
> code is run is modified.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Seems fine to me,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_buf.c       |  5 +++++
>  fs/xfs/xfs_buf.h       |  2 ++
>  fs/xfs/xfs_buf_item.c  | 10 ++++++++++
>  fs/xfs/xfs_buf_item.h  |  1 +
>  fs/xfs/xfs_dquot.c     |  1 +
>  fs/xfs/xfs_trans_buf.c |  1 +
>  6 files changed, 20 insertions(+)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index fcf650575be61..3bffde8640a52 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1212,6 +1212,11 @@ xfs_buf_ioend(
>  		return;
>  	}
>  
> +	if (bp->b_flags & _XBF_DQUOTS) {
> +		xfs_buf_dquot_iodone(bp);
> +		return;
> +	}
> +
>  	if (bp->b_iodone) {
>  		(*(bp->b_iodone))(bp);
>  		return;
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 2400cb90a04c6..c1d0843206dd6 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -32,6 +32,7 @@
>  
>  /* buffer type flags for write callbacks */
>  #define _XBF_INODES	 (1 << 16)/* inode buffer */
> +#define _XBF_DQUOTS	 (1 << 17)/* dquot buffer */
>  
>  /* flags used only internally */
>  #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
> @@ -54,6 +55,7 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ XBF_STALE,		"STALE" }, \
>  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
>  	{ _XBF_INODES,		"INODES" }, \
> +	{ _XBF_DQUOTS,		"DQUOTS" }, \
>  	{ _XBF_PAGES,		"PAGES" }, \
>  	{ _XBF_KMEM,		"KMEM" }, \
>  	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 8659cf4282a64..a42cdf9ccc47d 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -1210,6 +1210,16 @@ xfs_buf_inode_iodone(
>  	xfs_buf_ioend_finish(bp);
>  }
>  
> +/*
> + * Dquot buffer iodone callback function.
> + */
> +void
> +xfs_buf_dquot_iodone(
> +	struct xfs_buf		*bp)
> +{
> +	xfs_buf_run_callbacks(bp);
> +	xfs_buf_ioend_finish(bp);
> +}
>  
>  /*
>   * This is the iodone() function for buffers which have been
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index a342933ad9b8d..27d13d29b5bbb 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -60,6 +60,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
>  void	xfs_buf_iodone_callbacks(struct xfs_buf *);
>  void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
>  void	xfs_buf_inode_iodone(struct xfs_buf *);
> +void	xfs_buf_dquot_iodone(struct xfs_buf *);
>  bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
>  
>  extern kmem_zone_t	*xfs_buf_item_zone;
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index d5b7f03e93c8d..2e2146fa0914c 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -1179,6 +1179,7 @@ xfs_qm_dqflush(
>  	 * Attach an iodone routine so that we can remove this dquot from the
>  	 * AIL and release the flush lock once the dquot is synced to disk.
>  	 */
> +	bp->b_flags |= _XBF_DQUOTS;
>  	xfs_buf_attach_iodone(bp, xfs_qm_dqflush_done,
>  				  &dqp->q_logitem.qli_item);
>  
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 552d0869aa0fe..93d62cb864c15 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -788,5 +788,6 @@ xfs_trans_dquot_buf(
>  		break;
>  	}
>  
> +	bp->b_flags |= _XBF_DQUOTS;
>  	xfs_trans_buf_set_type(tp, bp, type);
>  }
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 04/30] xfs: mark inode buffers in cache
  2020-06-02 16:45   ` Brian Foster
@ 2020-06-02 19:22     ` Darrick J. Wong
  2020-06-02 21:29     ` Dave Chinner
  1 sibling, 0 replies; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02 19:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Tue, Jun 02, 2020 at 12:45:35PM -0400, Brian Foster wrote:
> On Tue, Jun 02, 2020 at 07:42:25AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Inode buffers always have write IO callbacks, so by marking them
> > directly we can avoid needing to attach ->b_iodone functions to
> > them. This avoids an indirect call, and makes future modifications
> > much simpler.
> > 
> > This is largely a rearrangement of the code at this point - no IO
> > completion functionality changes at this point, just how the
> > code is run is modified.
> > 
> 
> Ok, I was initially thinking this patch looked incomplete in that we
> continue to set ->b_iodone() on inode buffers even though we'd never
> call it. Looking ahead, I see that the next few patches continue to
> clean that up to eventually remove ->b_iodone(), so that addresses that.
> 
> My only other curiosity is that while there may not be any functional
> difference, this technically changes callback behavior in that we set
> the new flag in some contexts that don't currently attach anything to
> the buffer, right? E.g., xfs_trans_inode_alloc_buf() sets the flag on
> inode chunk init, which means we can write out an inode buffer without
> any attached/flushed inodes. Is the intent of that to support future
> changes? If so, a note about that in the commit log would be helpful.

I had kinda wondered that myself...  I /think/ in the
xfs_trans_inode_alloc_buf case there won't be any inodes attached
because we mark the buffer delwri (v4) or ordered (v5) so the buffer
should get written out before we ever get the chance to attach inodes;
and in the stale case, the inodes were already staled so we're done
writing them?

--D

> Brian
> 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_buf.c       | 21 ++++++++++++++++-----
> >  fs/xfs/xfs_buf.h       | 38 +++++++++++++++++++++++++-------------
> >  fs/xfs/xfs_buf_item.c  | 42 +++++++++++++++++++++++++++++++-----------
> >  fs/xfs/xfs_buf_item.h  |  1 +
> >  fs/xfs/xfs_inode.c     |  2 +-
> >  fs/xfs/xfs_trans_buf.c |  3 +++
> >  6 files changed, 77 insertions(+), 30 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 9c2fbb6bbf89d..fcf650575be61 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -14,6 +14,8 @@
> >  #include "xfs_mount.h"
> >  #include "xfs_trace.h"
> >  #include "xfs_log.h"
> > +#include "xfs_trans.h"
> > +#include "xfs_buf_item.h"
> >  #include "xfs_errortag.h"
> >  #include "xfs_error.h"
> >  
> > @@ -1202,12 +1204,21 @@ xfs_buf_ioend(
> >  		bp->b_flags |= XBF_DONE;
> >  	}
> >  
> > -	if (bp->b_iodone)
> > +	if (read)
> > +		goto out_finish;
> > +
> > +	if (bp->b_flags & _XBF_INODES) {
> > +		xfs_buf_inode_iodone(bp);
> > +		return;
> > +	}
> > +
> > +	if (bp->b_iodone) {
> >  		(*(bp->b_iodone))(bp);
> > -	else if (bp->b_flags & XBF_ASYNC)
> > -		xfs_buf_relse(bp);
> > -	else
> > -		complete(&bp->b_iowait);
> > +		return;
> > +	}
> > +
> > +out_finish:
> > +	xfs_buf_ioend_finish(bp);
> >  }
> >  
> >  static void
> > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > index 050c53b739e24..2400cb90a04c6 100644
> > --- a/fs/xfs/xfs_buf.h
> > +++ b/fs/xfs/xfs_buf.h
> > @@ -30,15 +30,18 @@
> >  #define XBF_STALE	 (1 << 6) /* buffer has been staled, do not find it */
> >  #define XBF_WRITE_FAIL	 (1 << 7) /* async writes have failed on this buffer */
> >  
> > -/* flags used only as arguments to access routines */
> > -#define XBF_TRYLOCK	 (1 << 16)/* lock requested, but do not wait */
> > -#define XBF_UNMAPPED	 (1 << 17)/* do not map the buffer */
> > +/* buffer type flags for write callbacks */
> > +#define _XBF_INODES	 (1 << 16)/* inode buffer */
> >  
> >  /* flags used only internally */
> >  #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
> >  #define _XBF_KMEM	 (1 << 21)/* backed by heap memory */
> >  #define _XBF_DELWRI_Q	 (1 << 22)/* buffer on a delwri queue */
> >  
> > +/* flags used only as arguments to access routines */
> > +#define XBF_TRYLOCK	 (1 << 30)/* lock requested, but do not wait */
> > +#define XBF_UNMAPPED	 (1 << 31)/* do not map the buffer */
> > +
> >  typedef unsigned int xfs_buf_flags_t;
> >  
> >  #define XFS_BUF_FLAGS \
> > @@ -50,12 +53,13 @@ typedef unsigned int xfs_buf_flags_t;
> >  	{ XBF_DONE,		"DONE" }, \
> >  	{ XBF_STALE,		"STALE" }, \
> >  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
> > -	{ XBF_TRYLOCK,		"TRYLOCK" },	/* should never be set */\
> > -	{ XBF_UNMAPPED,		"UNMAPPED" },	/* ditto */\
> > +	{ _XBF_INODES,		"INODES" }, \
> >  	{ _XBF_PAGES,		"PAGES" }, \
> >  	{ _XBF_KMEM,		"KMEM" }, \
> > -	{ _XBF_DELWRI_Q,	"DELWRI_Q" }
> > -
> > +	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
> > +	/* The following interface flags should never be set */ \
> > +	{ XBF_TRYLOCK,		"TRYLOCK" }, \
> > +	{ XBF_UNMAPPED,		"UNMAPPED" }
> >  
> >  /*
> >   * Internal state flags.
> > @@ -257,9 +261,23 @@ extern void xfs_buf_unlock(xfs_buf_t *);
> >  #define xfs_buf_islocked(bp) \
> >  	((bp)->b_sema.count <= 0)
> >  
> > +static inline void xfs_buf_relse(xfs_buf_t *bp)
> > +{
> > +	xfs_buf_unlock(bp);
> > +	xfs_buf_rele(bp);
> > +}
> > +
> >  /* Buffer Read and Write Routines */
> >  extern int xfs_bwrite(struct xfs_buf *bp);
> >  extern void xfs_buf_ioend(struct xfs_buf *bp);
> > +static inline void xfs_buf_ioend_finish(struct xfs_buf *bp)
> > +{
> > +	if (bp->b_flags & XBF_ASYNC)
> > +		xfs_buf_relse(bp);
> > +	else
> > +		complete(&bp->b_iowait);
> > +}
> > +
> >  extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error,
> >  		xfs_failaddr_t failaddr);
> >  #define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address)
> > @@ -324,12 +342,6 @@ static inline int xfs_buf_ispinned(struct xfs_buf *bp)
> >  	return atomic_read(&bp->b_pin_count);
> >  }
> >  
> > -static inline void xfs_buf_relse(xfs_buf_t *bp)
> > -{
> > -	xfs_buf_unlock(bp);
> > -	xfs_buf_rele(bp);
> > -}
> > -
> >  static inline int
> >  xfs_buf_verify_cksum(struct xfs_buf *bp, unsigned long cksum_offset)
> >  {
> > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > index 9e75e8d6042ec..8659cf4282a64 100644
> > --- a/fs/xfs/xfs_buf_item.c
> > +++ b/fs/xfs/xfs_buf_item.c
> > @@ -1158,20 +1158,15 @@ xfs_buf_iodone_callback_error(
> >  	return false;
> >  }
> >  
> > -/*
> > - * This is the iodone() function for buffers which have had callbacks attached
> > - * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
> > - * callback list, mark the buffer as having no more callbacks and then push the
> > - * buffer through IO completion processing.
> > - */
> > -void
> > -xfs_buf_iodone_callbacks(
> > +static void
> > +xfs_buf_run_callbacks(
> >  	struct xfs_buf		*bp)
> >  {
> > +
> >  	/*
> > -	 * If there is an error, process it. Some errors require us
> > -	 * to run callbacks after failure processing is done so we
> > -	 * detect that and take appropriate action.
> > +	 * If there is an error, process it. Some errors require us to run
> > +	 * callbacks after failure processing is done so we detect that and take
> > +	 * appropriate action.
> >  	 */
> >  	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
> >  		return;
> > @@ -1188,9 +1183,34 @@ xfs_buf_iodone_callbacks(
> >  	bp->b_log_item = NULL;
> >  	list_del_init(&bp->b_li_list);
> >  	bp->b_iodone = NULL;
> > +}
> > +
> > +/*
> > + * This is the iodone() function for buffers which have had callbacks attached
> > + * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
> > + * callback list, mark the buffer as having no more callbacks and then push the
> > + * buffer through IO completion processing.
> > + */
> > +void
> > +xfs_buf_iodone_callbacks(
> > +	struct xfs_buf		*bp)
> > +{
> > +	xfs_buf_run_callbacks(bp);
> >  	xfs_buf_ioend(bp);
> >  }
> >  
> > +/*
> > + * Inode buffer iodone callback function.
> > + */
> > +void
> > +xfs_buf_inode_iodone(
> > +	struct xfs_buf		*bp)
> > +{
> > +	xfs_buf_run_callbacks(bp);
> > +	xfs_buf_ioend_finish(bp);
> > +}
> > +
> > +
> >  /*
> >   * This is the iodone() function for buffers which have been
> >   * logged.  It is called when they are eventually flushed out.
> > diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> > index c9c57e2da9327..a342933ad9b8d 100644
> > --- a/fs/xfs/xfs_buf_item.h
> > +++ b/fs/xfs/xfs_buf_item.h
> > @@ -59,6 +59,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
> >  			      struct xfs_log_item *);
> >  void	xfs_buf_iodone_callbacks(struct xfs_buf *);
> >  void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
> > +void	xfs_buf_inode_iodone(struct xfs_buf *);
> >  bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
> >  
> >  extern kmem_zone_t	*xfs_buf_item_zone;
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index ac3c8af8c9a14..d5dee57f914a9 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -3860,13 +3860,13 @@ xfs_iflush_int(
> >  	 * completion on the buffer to remove the inode from the AIL and release
> >  	 * the flush lock.
> >  	 */
> > +	bp->b_flags |= _XBF_INODES;
> >  	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);
> >  
> >  	/* generate the checksum. */
> >  	xfs_dinode_calc_crc(mp, dip);
> >  
> >  	ASSERT(!list_empty(&bp->b_li_list));
> > -	ASSERT(bp->b_iodone != NULL);
> >  	return error;
> >  }
> >  
> > diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> > index 08174ffa21189..552d0869aa0fe 100644
> > --- a/fs/xfs/xfs_trans_buf.c
> > +++ b/fs/xfs/xfs_trans_buf.c
> > @@ -626,6 +626,7 @@ xfs_trans_inode_buf(
> >  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
> >  
> >  	bip->bli_flags |= XFS_BLI_INODE_BUF;
> > +	bp->b_flags |= _XBF_INODES;
> >  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
> >  }
> >  
> > @@ -651,6 +652,7 @@ xfs_trans_stale_inode_buf(
> >  
> >  	bip->bli_flags |= XFS_BLI_STALE_INODE;
> >  	bip->bli_item.li_cb = xfs_buf_iodone;
> > +	bp->b_flags |= _XBF_INODES;
> >  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
> >  }
> >  
> > @@ -675,6 +677,7 @@ xfs_trans_inode_alloc_buf(
> >  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
> >  
> >  	bip->bli_flags |= XFS_BLI_INODE_ALLOC_BUF;
> > +	bp->b_flags |= _XBF_INODES;
> >  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
> >  }
> >  
> > -- 
> > 2.26.2.761.g0e0b3e54be
> > 
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 06/30] xfs: mark log recovery buffers for completion
  2020-06-01 21:42 ` [PATCH 06/30] xfs: mark log recovery buffers for completion Dave Chinner
  2020-06-02 16:45   ` Brian Foster
@ 2020-06-02 19:24   ` Darrick J. Wong
  1 sibling, 0 replies; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02 19:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:27AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Log recovery has it's own buffer write completion handler for
> buffers that it directly recovers. Convert these to direct calls by
> flagging these buffers as being log recovery buffers. The flag will
> get cleared by the log recovery IO completion routine, so it will
> never leak out of log recovery.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Yay vowels!
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_buf.c                | 10 ++++++++++
>  fs/xfs/xfs_buf.h                |  2 ++
>  fs/xfs/xfs_buf_item_recover.c   |  5 ++---
>  fs/xfs/xfs_dquot_item_recover.c |  2 +-
>  fs/xfs/xfs_inode_item_recover.c |  2 +-
>  fs/xfs/xfs_log_recover.c        |  5 ++---
>  6 files changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 3bffde8640a52..0a69de674af9d 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -14,6 +14,7 @@
>  #include "xfs_mount.h"
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
> +#include "xfs_log_recover.h"
>  #include "xfs_trans.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_errortag.h"
> @@ -1207,6 +1208,15 @@ xfs_buf_ioend(
>  	if (read)
>  		goto out_finish;
>  
> +	/*
> +	 * If this is a log recovery buffer, we aren't doing transactional IO
> +	 * yet so we need to let it handle IO completions.
> +	 */
> +	if (bp->b_flags & _XBF_LOGRECOVERY) {
> +		xlog_recover_iodone(bp);
> +		return;
> +	}
> +
>  	if (bp->b_flags & _XBF_INODES) {
>  		xfs_buf_inode_iodone(bp);
>  		return;
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index c1d0843206dd6..30dabc5bae96d 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -33,6 +33,7 @@
>  /* buffer type flags for write callbacks */
>  #define _XBF_INODES	 (1 << 16)/* inode buffer */
>  #define _XBF_DQUOTS	 (1 << 17)/* dquot buffer */
> +#define _XBF_LOGRECOVERY	 (1 << 18)/* log recovery buffer */
>  
>  /* flags used only internally */
>  #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
> @@ -56,6 +57,7 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
>  	{ _XBF_INODES,		"INODES" }, \
>  	{ _XBF_DQUOTS,		"DQUOTS" }, \
> +	{ _XBF_LOGRECOVERY,		"LOG_RECOVERY" }, \
>  	{ _XBF_PAGES,		"PAGES" }, \
>  	{ _XBF_KMEM,		"KMEM" }, \
>  	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
> diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
> index 04faa7310c4f0..74c851f60eeeb 100644
> --- a/fs/xfs/xfs_buf_item_recover.c
> +++ b/fs/xfs/xfs_buf_item_recover.c
> @@ -419,8 +419,7 @@ xlog_recover_validate_buf_type(
>  	if (bp->b_ops) {
>  		struct xfs_buf_log_item	*bip;
>  
> -		ASSERT(!bp->b_iodone || bp->b_iodone == xlog_recover_iodone);
> -		bp->b_iodone = xlog_recover_iodone;
> +		bp->b_flags |= _XBF_LOGRECOVERY;
>  		xfs_buf_item_init(bp, mp);
>  		bip = bp->b_log_item;
>  		bip->bli_item.li_lsn = current_lsn;
> @@ -963,7 +962,7 @@ xlog_recover_buf_commit_pass2(
>  		error = xfs_bwrite(bp);
>  	} else {
>  		ASSERT(bp->b_mount == mp);
> -		bp->b_iodone = xlog_recover_iodone;
> +		bp->b_flags |= _XBF_LOGRECOVERY;
>  		xfs_buf_delwri_queue(bp, buffer_list);
>  	}
>  
> diff --git a/fs/xfs/xfs_dquot_item_recover.c b/fs/xfs/xfs_dquot_item_recover.c
> index 3400be4c88f08..f9ea9f55aa7cc 100644
> --- a/fs/xfs/xfs_dquot_item_recover.c
> +++ b/fs/xfs/xfs_dquot_item_recover.c
> @@ -153,7 +153,7 @@ xlog_recover_dquot_commit_pass2(
>  
>  	ASSERT(dq_f->qlf_size == 2);
>  	ASSERT(bp->b_mount == mp);
> -	bp->b_iodone = xlog_recover_iodone;
> +	bp->b_flags |= _XBF_LOGRECOVERY;
>  	xfs_buf_delwri_queue(bp, buffer_list);
>  
>  out_release:
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index dc3e26ff16c90..5e0d291835b35 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -376,7 +376,7 @@ xlog_recover_inode_commit_pass2(
>  	xfs_dinode_calc_crc(log->l_mp, dip);
>  
>  	ASSERT(bp->b_mount == mp);
> -	bp->b_iodone = xlog_recover_iodone;
> +	bp->b_flags |= _XBF_LOGRECOVERY;
>  	xfs_buf_delwri_queue(bp, buffer_list);
>  
>  out_release:
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index ec015df55b77a..52a65a74208ff 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -287,9 +287,8 @@ xlog_recover_iodone(
>  	if (bp->b_log_item)
>  		xfs_buf_item_relse(bp);
>  	ASSERT(bp->b_log_item == NULL);
> -
> -	bp->b_iodone = NULL;
> -	xfs_buf_ioend(bp);
> +	bp->b_flags &= ~_XBF_LOGRECOVERY;
> +	xfs_buf_ioend_finish(bp);
>  }
>  
>  /*
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 10/30] xfs: use direct calls for dquot IO completion
  2020-06-01 21:42 ` [PATCH 10/30] xfs: use direct calls for dquot IO completion Dave Chinner
@ 2020-06-02 19:25   ` Darrick J. Wong
  2020-06-03 14:58   ` Brian Foster
  1 sibling, 0 replies; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02 19:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:31AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Similar to inodes, we can call the dquot IO completion functions
> directly from the buffer completion code, removing another user of
> log item callbacks for IO completion processing.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_buf_item.c | 18 +++++++++++++++++-
>  fs/xfs/xfs_dquot.c    | 18 ++++++++++++++----
>  fs/xfs/xfs_dquot.h    |  1 +
>  3 files changed, 32 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index a4e416af5c614..f46e5ec28111c 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -15,6 +15,9 @@
>  #include "xfs_buf_item.h"
>  #include "xfs_inode.h"
>  #include "xfs_inode_item.h"
> +#include "xfs_quota.h"
> +#include "xfs_dquot_item.h"
> +#include "xfs_dquot.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
> @@ -1209,7 +1212,20 @@ void
>  xfs_buf_dquot_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	xfs_buf_run_callbacks(bp);
> +	struct xfs_buf_log_item *blip = bp->b_log_item;
> +	struct xfs_log_item	*lip;
> +
> +	if (xfs_buf_had_callback_errors(bp))
> +		return;
> +
> +	/* a newly allocated dquot buffer might have a log item attached */
> +	if (blip) {
> +		lip = &blip->bli_item;
> +		lip->li_cb(bp, lip);
> +		bp->b_log_item = NULL;
> +	}
> +
> +	xfs_dquot_done(bp);
>  	xfs_buf_ioend_finish(bp);
>  }
>  
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index 2e2146fa0914c..403bc4e9f21ff 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -1048,9 +1048,8 @@ xfs_qm_dqrele(
>   * from the AIL if it has not been re-logged, and unlocking the dquot's
>   * flush lock. This behavior is very similar to that of inodes..
>   */
> -STATIC void
> +static void
>  xfs_qm_dqflush_done(
> -	struct xfs_buf		*bp,
>  	struct xfs_log_item	*lip)
>  {
>  	struct xfs_dq_logitem	*qip = (struct xfs_dq_logitem *)lip;
> @@ -1091,6 +1090,18 @@ xfs_qm_dqflush_done(
>  	xfs_dqfunlock(dqp);
>  }
>  
> +void
> +xfs_dquot_done(
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_log_item	*lip, *n;
> +
> +	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
> +		list_del_init(&lip->li_bio_list);
> +		xfs_qm_dqflush_done(lip);
> +	}
> +}
> +
>  /*
>   * Write a modified dquot to disk.
>   * The dquot must be locked and the flush lock too taken by caller.
> @@ -1180,8 +1191,7 @@ xfs_qm_dqflush(
>  	 * AIL and release the flush lock once the dquot is synced to disk.
>  	 */
>  	bp->b_flags |= _XBF_DQUOTS;
> -	xfs_buf_attach_iodone(bp, xfs_qm_dqflush_done,
> -				  &dqp->q_logitem.qli_item);
> +	xfs_buf_attach_iodone(bp, NULL, &dqp->q_logitem.qli_item);
>  
>  	/*
>  	 * If the buffer is pinned then push on the log so we won't
> diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
> index 71e36c85e20b6..fe9cc3e08ed6d 100644
> --- a/fs/xfs/xfs_dquot.h
> +++ b/fs/xfs/xfs_dquot.h
> @@ -174,6 +174,7 @@ void		xfs_qm_dqput(struct xfs_dquot *dqp);
>  void		xfs_dqlock2(struct xfs_dquot *, struct xfs_dquot *);
>  
>  void		xfs_dquot_set_prealloc_limits(struct xfs_dquot *);
> +void		xfs_dquot_done(struct xfs_buf *);
>  
>  static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp)
>  {
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 13/30] xfs: handle buffer log item IO errors directly
  2020-06-01 21:42 ` [PATCH 13/30] xfs: handle buffer log item IO errors directly Dave Chinner
@ 2020-06-02 20:39   ` Darrick J. Wong
  2020-06-02 22:17     ` Dave Chinner
  2020-06-03 15:02   ` Brian Foster
  1 sibling, 1 reply; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02 20:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:34AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Currently when a buffer with attached log items has an IO error
> it called ->iop_error for each attched log item. These all call
> xfs_set_li_failed() to handle the error, but we are about to change
> the way log items manage buffers. hence we first need to remove the
> per-item dependency on buffer handling done by xfs_set_li_failed().
> 
> We already have specific buffer type IO completion routines, so move
> the log item error handling out of the generic error handling and
> into the log item specific functions so we can implement per-type
> error handling easily.
> 
> This requires a more complex return value from the error handling
> code so that we can take the correct action the failure handling
> requires.  This results in some repeated boilerplate in the
> functions, but that can be cleaned up later once all the changes
> cascade through this code.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf_item.c | 167 ++++++++++++++++++++++++++++--------------
>  1 file changed, 112 insertions(+), 55 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 09bfe9c52dbdb..b6995719e877b 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -987,20 +987,18 @@ xfs_buf_do_callbacks_fail(
>  }
>  
>  static bool
> -xfs_buf_iodone_callback_error(
> +xfs_buf_ioerror_sync(
>  	struct xfs_buf		*bp)
>  {
>  	struct xfs_mount	*mp = bp->b_mount;
>  	static ulong		lasttime;
>  	static xfs_buftarg_t	*lasttarg;
> -	struct xfs_error_cfg	*cfg;
> -
>  	/*

This should preserve the blank line between the declarations and the
start of the code.

>  	 * If we've already decided to shutdown the filesystem because of
>  	 * I/O errors, there's no point in giving this a retry.
>  	 */
>  	if (XFS_FORCED_SHUTDOWN(mp))
> -		goto out_stale;
> +		return true;
>  
>  	if (bp->b_target != lasttarg ||
>  	    time_after(jiffies, (lasttime + 5*HZ))) {
> @@ -1011,19 +1009,15 @@ xfs_buf_iodone_callback_error(
>  
>  	/* synchronous writes will have callers process the error */
>  	if (!(bp->b_flags & XBF_ASYNC))
> -		goto out_stale;
> -
> -	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
> -
> -	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
> +		return true;
> +	return false;

What does the return value mean here?  true means "let the caller deal
with the error", false means "attempt a retry, if desired?  So this
function decides if we're going to fail immediately or not?

	if (xfs_buf_ioerr_fail_immediately(bp))
		goto out_stale;

That's a lengthy name though.  On second inspection, I guess this
function decides if the buffer is going to be sent through the io retry
mechanism, and the next two functions advance it through the retry
states until either the write succeeds or we declare permanent failure?

> +}
>  
> -	/*
> -	 * If the write was asynchronous then no one will be looking for the
> -	 * error.  If this is the first failure of this type, clear the error
> -	 * state and write the buffer out again. This means we always retry an
> -	 * async write failure at least once, but we also need to set the buffer
> -	 * up to behave correctly now for repeated failures.
> -	 */
> +static bool
> +xfs_buf_ioerror_retry(

Might be nice to preserve some of this comment, since I initially
missed that this function both decides whether or not to do the retry
and sets up the buffer to do that.

/*
 * Decide if we're going to retry the write after a failure, and prepare
 * the buffer for retrying the write.
 */

Or, adding some newlines in the outer if body to make the two lines
that modify the bp state stand out would also help.

(TBH I'm struggling right now to make sense of what these new functions
do, though I'm fairly convinced that they at least aren't changing much
of the functionality...)

> +	struct xfs_buf		*bp,
> +	struct xfs_error_cfg	*cfg)
> +{
>  	if (!(bp->b_flags & (XBF_STALE | XBF_WRITE_FAIL)) ||
>  	     bp->b_last_error != bp->b_error) {
>  		bp->b_flags |= (XBF_WRITE | XBF_DONE | XBF_WRITE_FAIL);
> @@ -1031,36 +1025,80 @@ xfs_buf_iodone_callback_error(
>  		if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER &&
>  		    !bp->b_first_retry_time)
>  			bp->b_first_retry_time = jiffies;
> -
> -		xfs_buf_ioerror(bp, 0);
> -		xfs_buf_submit(bp);
>  		return true;
>  	}
> +	return false;
> +}
>  
> -	/*
> -	 * Repeated failure on an async write. Take action according to the
> -	 * error configuration we have been set up to use.
> -	 */
> +static bool
> +xfs_buf_ioerror_permanent(

/*
 * Account for this latest trip around the retry handler, and decide if
 * we've failed enough times to constitute a permanent failure.
 */

> +	struct xfs_buf		*bp,
> +	struct xfs_error_cfg	*cfg)
> +{
> +	struct xfs_mount	*mp = bp->b_mount;
>  
>  	if (cfg->max_retries != XFS_ERR_RETRY_FOREVER &&
>  	    ++bp->b_retries > cfg->max_retries)
> -			goto permanent_error;
> +			return true;

Might as well fix the indentation while you're at it.


>  	if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER &&
>  	    time_after(jiffies, cfg->retry_timeout + bp->b_first_retry_time))
> -			goto permanent_error;
> +			return true;
>  
>  	/* At unmount we may treat errors differently */
>  	if ((mp->m_flags & XFS_MOUNT_UNMOUNTING) && mp->m_fail_unmount)
> +		return true;
> +
> +	return false;
> +}
> +
> +/*
> + * On a sync write or shutdown we just want to stale the buffer and let the
> + * caller handle the error in bp->b_error appropriately.
> + *
> + * If the write was asynchronous then no one will be looking for the error.  If
> + * this is the first failure of this type, clear the error state and write the
> + * buffer out again. This means we always retry an async write failure at least
> + * once, but we also need to set the buffer up to behave correctly now for
> + * repeated failures.
> + *
> + * If we get repeated async write failures, then we take action according to the
> + * error configuration we have been set up to use.
> + *
> + * Multi-state return value:
> + *
> + * 0: clear IO error retry state and run callback completions
> + * 1: resubmitted immediately, do not run any completions
> + * 2: transient error, run failure callback completions and then
> + *    release the buffer

Feels odd not to use an enum here, but as this is a static function
maybe it's not a high risk for screwing up in the callers.

--D

> + */
> +static int
> +xfs_buf_iodone_error(
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_mount	*mp = bp->b_mount;
> +	struct xfs_error_cfg	*cfg;
> +
> +	if (xfs_buf_ioerror_sync(bp))
> +		goto out_stale;
> +
> +	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
> +
> +	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
> +	if (xfs_buf_ioerror_retry(bp, cfg)) {
> +		xfs_buf_ioerror(bp, 0);
> +		xfs_buf_submit(bp);
> +		return 1;
> +	}
> +
> +	if (xfs_buf_ioerror_permanent(bp, cfg))
>  		goto permanent_error;
>  
>  	/*
>  	 * Still a transient error, run IO completion failure callbacks and let
>  	 * the higher layers retry the buffer.
>  	 */
> -	xfs_buf_do_callbacks_fail(bp);
>  	xfs_buf_ioerror(bp, 0);
> -	xfs_buf_relse(bp);
> -	return true;
> +	return 2;
>  
>  	/*
>  	 * Permanent error - we need to trigger a shutdown if we haven't already
> @@ -1072,30 +1110,7 @@ xfs_buf_iodone_callback_error(
>  	xfs_buf_stale(bp);
>  	bp->b_flags |= XBF_DONE;
>  	trace_xfs_buf_error_relse(bp, _RET_IP_);
> -	return false;
> -}
> -
> -static inline bool
> -xfs_buf_had_callback_errors(
> -	struct xfs_buf		*bp)
> -{
> -
> -	/*
> -	 * If there is an error, process it. Some errors require us to run
> -	 * callbacks after failure processing is done so we detect that and take
> -	 * appropriate action.
> -	 */
> -	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
> -		return true;
> -
> -	/*
> -	 * Successful IO or permanent error. Either way, we can clear the
> -	 * retry state here in preparation for the next error that may occur.
> -	 */
> -	bp->b_last_error = 0;
> -	bp->b_retries = 0;
> -	bp->b_first_retry_time = 0;
> -	return false;
> +	return 0;
>  }
>  
>  static void
> @@ -1122,6 +1137,15 @@ xfs_buf_item_done(
>  	xfs_buf_rele(bp);
>  }
>  
> +static inline void
> +xfs_buf_clear_ioerror_retry_state(
> +	struct xfs_buf		*bp)
> +{
> +	bp->b_last_error = 0;
> +	bp->b_retries = 0;
> +	bp->b_first_retry_time = 0;
> +}
> +
>  /*
>   * Inode buffer iodone callback function.
>   */
> @@ -1129,9 +1153,20 @@ void
>  xfs_buf_inode_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	if (xfs_buf_had_callback_errors(bp))
> +	if (bp->b_error) {
> +		int ret = xfs_buf_iodone_error(bp);
> +		if (!ret)
> +			goto finish_iodone;
> +		if (ret == 1)
> +			return;
> +		ASSERT(ret == 2);
> +		xfs_buf_do_callbacks_fail(bp);
> +		xfs_buf_relse(bp);
>  		return;
> +	}
>  
> +finish_iodone:
> +	xfs_buf_clear_ioerror_retry_state(bp);
>  	xfs_buf_item_done(bp);
>  	xfs_iflush_done(bp);
>  	xfs_buf_ioend_finish(bp);
> @@ -1144,9 +1179,20 @@ void
>  xfs_buf_dquot_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	if (xfs_buf_had_callback_errors(bp))
> +	if (bp->b_error) {
> +		int ret = xfs_buf_iodone_error(bp);
> +		if (!ret)
> +			goto finish_iodone;
> +		if (ret == 1)
> +			return;
> +		ASSERT(ret == 2);
> +		xfs_buf_do_callbacks_fail(bp);
> +		xfs_buf_relse(bp);
>  		return;
> +	}
>  
> +finish_iodone:
> +	xfs_buf_clear_ioerror_retry_state(bp);
>  	/* a newly allocated dquot buffer might have a log item attached */
>  	xfs_buf_item_done(bp);
>  	xfs_dquot_done(bp);
> @@ -1163,9 +1209,20 @@ void
>  xfs_buf_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	if (xfs_buf_had_callback_errors(bp))
> +	if (bp->b_error) {
> +		int ret = xfs_buf_iodone_error(bp);
> +		if (!ret)
> +			goto finish_iodone;
> +		if (ret == 1)
> +			return;
> +		ASSERT(ret == 2);
> +		xfs_buf_do_callbacks_fail(bp);
> +		xfs_buf_relse(bp);
>  		return;
> +	}
>  
> +finish_iodone:
> +	xfs_buf_clear_ioerror_retry_state(bp);
>  	xfs_buf_item_done(bp);
>  	xfs_buf_ioend_finish(bp);
>  }
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 14/30] xfs: unwind log item error flagging
  2020-06-01 21:42 ` [PATCH 14/30] xfs: unwind log item error flagging Dave Chinner
@ 2020-06-02 20:45   ` Darrick J. Wong
  2020-06-03 15:02   ` Brian Foster
  1 sibling, 0 replies; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02 20:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:35AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When an buffer IO error occurs, we want to mark all
> the log items attached to the buffer as failed. Open code
> the error handling loop so that we can modify the flagging for the
> different types of objects directly and independently of each other.
> 
> This also allows us to remove the ->iop_error method from the log
> item operations.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf_item.c   | 48 ++++++++++++-----------------------------
>  fs/xfs/xfs_dquot_item.c | 18 ----------------
>  fs/xfs/xfs_inode_item.c | 18 ----------------
>  fs/xfs/xfs_trans.h      |  1 -
>  4 files changed, 14 insertions(+), 71 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index b6995719e877b..2364a9aa2d71a 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -12,6 +12,7 @@
>  #include "xfs_bit.h"
>  #include "xfs_mount.h"
>  #include "xfs_trans.h"
> +#include "xfs_trans_priv.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_inode.h"
>  #include "xfs_inode_item.h"
> @@ -955,37 +956,6 @@ xfs_buf_item_relse(
>  	xfs_buf_item_free(bip);
>  }
>  
> -/*
> - * Invoke the error state callback for each log item affected by the failed I/O.
> - *
> - * If a metadata buffer write fails with a non-permanent error, the buffer is
> - * eventually resubmitted and so the completion callbacks are not run. The error
> - * state may need to be propagated to the log items attached to the buffer,
> - * however, so the next AIL push of the item knows hot to handle it correctly.
> - */
> -STATIC void
> -xfs_buf_do_callbacks_fail(
> -	struct xfs_buf		*bp)
> -{
> -	struct xfs_ail		*ailp = bp->b_mount->m_ail;
> -	struct xfs_log_item	*lip;
> -
> -	/*
> -	 * Buffer log item errors are handled directly by xfs_buf_item_push()
> -	 * and xfs_buf_iodone_callback_error, and they have no IO error
> -	 * callbacks. Check only for items in b_li_list.
> -	 */
> -	if (list_empty(&bp->b_li_list))
> -		return;
> -
> -	spin_lock(&ailp->ail_lock);
> -	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> -		if (lip->li_ops->iop_error)
> -			lip->li_ops->iop_error(lip, bp);
> -	}
> -	spin_unlock(&ailp->ail_lock);
> -}
> -
>  static bool
>  xfs_buf_ioerror_sync(
>  	struct xfs_buf		*bp)
> @@ -1154,13 +1124,18 @@ xfs_buf_inode_iodone(
>  	struct xfs_buf		*bp)
>  {
>  	if (bp->b_error) {
> +		struct xfs_log_item *lip;
>  		int ret = xfs_buf_iodone_error(bp);
>  		if (!ret)

Hmm we probably need a blank line between these declarations and the
start of the if statements, right?  Granted, I should've put this
complaint in the previous patch.

Otherwise this looks fine,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

>  			goto finish_iodone;
>  		if (ret == 1)
>  			return;
>  		ASSERT(ret == 2);
> -		xfs_buf_do_callbacks_fail(bp);
> +		spin_lock(&bp->b_mount->m_ail->ail_lock);
> +		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> +			xfs_set_li_failed(lip, bp);
> +		}
> +		spin_unlock(&bp->b_mount->m_ail->ail_lock);
>  		xfs_buf_relse(bp);
>  		return;
>  	}
> @@ -1180,13 +1155,18 @@ xfs_buf_dquot_iodone(
>  	struct xfs_buf		*bp)
>  {
>  	if (bp->b_error) {
> +		struct xfs_log_item *lip;
>  		int ret = xfs_buf_iodone_error(bp);
>  		if (!ret)
>  			goto finish_iodone;
>  		if (ret == 1)
>  			return;
>  		ASSERT(ret == 2);
> -		xfs_buf_do_callbacks_fail(bp);
> +		spin_lock(&bp->b_mount->m_ail->ail_lock);
> +		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> +			xfs_set_li_failed(lip, bp);
> +		}
> +		spin_unlock(&bp->b_mount->m_ail->ail_lock);
>  		xfs_buf_relse(bp);
>  		return;
>  	}
> @@ -1216,7 +1196,7 @@ xfs_buf_iodone(
>  		if (ret == 1)
>  			return;
>  		ASSERT(ret == 2);
> -		xfs_buf_do_callbacks_fail(bp);
> +		ASSERT(list_empty(&bp->b_li_list));
>  		xfs_buf_relse(bp);
>  		return;
>  	}
> diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
> index 349c92d26570c..d7e4de7151d7f 100644
> --- a/fs/xfs/xfs_dquot_item.c
> +++ b/fs/xfs/xfs_dquot_item.c
> @@ -113,23 +113,6 @@ xfs_qm_dqunpin_wait(
>  	wait_event(dqp->q_pinwait, (atomic_read(&dqp->q_pincount) == 0));
>  }
>  
> -/*
> - * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer
> - * have been failed during writeback
> - *
> - * this informs the AIL that the dquot is already flush locked on the next push,
> - * and acquires a hold on the buffer to ensure that it isn't reclaimed before
> - * dirty data makes it to disk.
> - */
> -STATIC void
> -xfs_dquot_item_error(
> -	struct xfs_log_item	*lip,
> -	struct xfs_buf		*bp)
> -{
> -	ASSERT(!completion_done(&DQUOT_ITEM(lip)->qli_dquot->q_flush));
> -	xfs_set_li_failed(lip, bp);
> -}
> -
>  STATIC uint
>  xfs_qm_dquot_logitem_push(
>  	struct xfs_log_item	*lip,
> @@ -216,7 +199,6 @@ static const struct xfs_item_ops xfs_dquot_item_ops = {
>  	.iop_release	= xfs_qm_dquot_logitem_release,
>  	.iop_committing	= xfs_qm_dquot_logitem_committing,
>  	.iop_push	= xfs_qm_dquot_logitem_push,
> -	.iop_error	= xfs_dquot_item_error
>  };
>  
>  /*
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 7049f2ae8d186..86c783dec2bac 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -464,23 +464,6 @@ xfs_inode_item_unpin(
>  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
>  }
>  
> -/*
> - * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer
> - * have been failed during writeback
> - *
> - * This informs the AIL that the inode is already flush locked on the next push,
> - * and acquires a hold on the buffer to ensure that it isn't reclaimed before
> - * dirty data makes it to disk.
> - */
> -STATIC void
> -xfs_inode_item_error(
> -	struct xfs_log_item	*lip,
> -	struct xfs_buf		*bp)
> -{
> -	ASSERT(xfs_isiflocked(INODE_ITEM(lip)->ili_inode));
> -	xfs_set_li_failed(lip, bp);
> -}
> -
>  STATIC uint
>  xfs_inode_item_push(
>  	struct xfs_log_item	*lip,
> @@ -619,7 +602,6 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
>  	.iop_committed	= xfs_inode_item_committed,
>  	.iop_push	= xfs_inode_item_push,
>  	.iop_committing	= xfs_inode_item_committing,
> -	.iop_error	= xfs_inode_item_error
>  };
>  
>  
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 99a9ab9cab25b..b752501818d25 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -74,7 +74,6 @@ struct xfs_item_ops {
>  	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
>  	void (*iop_release)(struct xfs_log_item *);
>  	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
> -	void (*iop_error)(struct xfs_log_item *, xfs_buf_t *);
>  	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
>  	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
>  };
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()
  2020-06-01 21:42 ` [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() Dave Chinner
@ 2020-06-02 20:47   ` Darrick J. Wong
  2020-06-03 15:02   ` Brian Foster
  1 sibling, 0 replies; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02 20:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:36AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_ail_delete_one() is called directly from dquot and inode IO
> completion, as well as from the generic xfs_trans_ail_delete()
> function. Inodes are about to have their own failure handling, and
> dquots will in future, too. Pull the clearing of the LI_FAILED flag
> up into the callers so we can customise the code appropriately.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_dquot.c      | 6 +-----
>  fs/xfs/xfs_inode_item.c | 3 +--
>  fs/xfs/xfs_trans_ail.c  | 2 +-
>  3 files changed, 3 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index d5984a926d1d0..76353c9a723ee 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -1070,16 +1070,12 @@ xfs_qm_dqflush_done(
>  	     test_bit(XFS_LI_FAILED, &lip->li_flags))) {
>  
>  		spin_lock(&ailp->ail_lock);
> +		xfs_clear_li_failed(lip);
>  		if (lip->li_lsn == qip->qli_flush_lsn) {
>  			/* xfs_ail_update_finish() drops the AIL lock */
>  			tail_lsn = xfs_ail_delete_one(ailp, lip);
>  			xfs_ail_update_finish(ailp, tail_lsn);
>  		} else {
> -			/*
> -			 * Clear the failed state since we are about to drop the
> -			 * flush lock
> -			 */
> -			xfs_clear_li_failed(lip);
>  			spin_unlock(&ailp->ail_lock);
>  		}
>  	}
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 86c783dec2bac..0ba75764a8dc5 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -690,12 +690,11 @@ xfs_iflush_done(
>  		/* this is an opencoded batch version of xfs_trans_ail_delete */
>  		spin_lock(&ailp->ail_lock);
>  		list_for_each_entry(lip, &tmp, li_bio_list) {
> +			xfs_clear_li_failed(lip);
>  			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
>  				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
>  				if (!tail_lsn && lsn)
>  					tail_lsn = lsn;
> -			} else {
> -				xfs_clear_li_failed(lip);
>  			}
>  		}
>  		xfs_ail_update_finish(ailp, tail_lsn);
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index ac5019361a139..ac33f6393f99c 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -843,7 +843,6 @@ xfs_ail_delete_one(
>  
>  	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
>  	xfs_ail_delete(ailp, lip);
> -	xfs_clear_li_failed(lip);
>  	clear_bit(XFS_LI_IN_AIL, &lip->li_flags);
>  	lip->li_lsn = 0;
>  
> @@ -874,6 +873,7 @@ xfs_trans_ail_delete(
>  	}
>  
>  	/* xfs_ail_update_finish() drops the AIL lock */
> +	xfs_clear_li_failed(lip);
>  	tail_lsn = xfs_ail_delete_one(ailp, lip);
>  	xfs_ail_update_finish(ailp, tail_lsn);
>  }
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 04/30] xfs: mark inode buffers in cache
  2020-06-02 16:45   ` Brian Foster
  2020-06-02 19:22     ` Darrick J. Wong
@ 2020-06-02 21:29     ` Dave Chinner
  2020-06-03 14:57       ` Brian Foster
  1 sibling, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-02 21:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 12:45:35PM -0400, Brian Foster wrote:
> On Tue, Jun 02, 2020 at 07:42:25AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Inode buffers always have write IO callbacks, so by marking them
> > directly we can avoid needing to attach ->b_iodone functions to
> > them. This avoids an indirect call, and makes future modifications
> > much simpler.
> > 
> > This is largely a rearrangement of the code at this point - no IO
> > completion functionality changes at this point, just how the
> > code is run is modified.
> > 
> 
> Ok, I was initially thinking this patch looked incomplete in that we
> continue to set ->b_iodone() on inode buffers even though we'd never
> call it. Looking ahead, I see that the next few patches continue to
> clean that up to eventually remove ->b_iodone(), so that addresses that.
> 
> My only other curiosity is that while there may not be any functional
> difference, this technically changes callback behavior in that we set
> the new flag in some contexts that don't currently attach anything to
> the buffer, right? E.g., xfs_trans_inode_alloc_buf() sets the flag on
> inode chunk init, which means we can write out an inode buffer without
> any attached/flushed inodes.

Yes, it can happen, and it happens before this patch, too, because
the AIL can push the buffer log item directly and that does not
flush dirty inodes to the buffer before it writes back(*).

As it is, xfs_buf_inode_iodone() on a buffer with no inode attached
if functionally identical to the existing xfs_buf_iodone() callback
that would otherwise be done. i.e. it just runs the buffer log item
completion callback. Hence the change here rearranges code, but it
does not change behaviour at all.

(*) this is a double-write bug that this patch set does not address.
i.e. buffer log item flushes the buffer without flushing inodes, IO
compeletes, then inode flush to the buffer and we do another IO to
clean them.  This is addressed by a follow-on patchset that tracks
dirty inodes via ordered cluster buffers, such that pushing the
buffer always triggers xfs_iflush_cluster() on buffers tagged
_XBF_INODES...

> Is the intent of that to support future
> changes? If so, a note about that in the commit log would be helpful.

That's part of it, as you can see from the (*) above. But the commit
log already says "..., and makes future modifications much simpler."
Was that insufficient to indicate that it will be used later on?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 07/30] xfs: call xfs_buf_iodone directly
  2020-06-02 16:47   ` Brian Foster
@ 2020-06-02 21:38     ` Dave Chinner
  2020-06-03 14:58       ` Brian Foster
  0 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-02 21:38 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 12:47:42PM -0400, Brian Foster wrote:
> On Tue, Jun 02, 2020 at 07:42:28AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > All unmarked dirty buffers should be in the AIL and have log items
> > attached to them. Hence when they are written, we will run a
> > callback to remove the item from the AIL if appropriate. Now that
> > we've handled inode and dquot buffers, all remaining calls are to
> > xfs_buf_iodone() and so we can hard code this rather than use an
> > indirect call.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >  fs/xfs/xfs_buf.c       | 24 ++++++++----------------
> >  fs/xfs/xfs_buf.h       |  6 +-----
> >  fs/xfs/xfs_buf_item.c  | 40 ++++++++++------------------------------
> >  fs/xfs/xfs_buf_item.h  |  4 ++--
> >  fs/xfs/xfs_trans_buf.c | 13 +++----------
> >  5 files changed, 24 insertions(+), 63 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 0a69de674af9d..d7695b638e994 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> ...
> > @@ -1226,14 +1225,7 @@ xfs_buf_ioend(
> >  		xfs_buf_dquot_iodone(bp);
> >  		return;
> >  	}
> > -
> > -	if (bp->b_iodone) {
> > -		(*(bp->b_iodone))(bp);
> > -		return;
> > -	}
> > -
> > -out_finish:
> > -	xfs_buf_ioend_finish(bp);
> > +	xfs_buf_iodone(bp);
> 
> The way this function ends up would probably look nicer as an if/else
> chain rather than a sequence of internal return statements.

I've kinda avoided refactoring these early patches because they
cascade into non-trivial conflicts with later patches in the series.
I've spent too much time chasing bugs introduced in the later
patches because of conflict resolution not being quite right. Hence
I want to leave cleanup and refactoring to a series after this whole
line of development is complete and the problems are solved.

> BTW, is there a longer term need to have three separate iodone functions
> here that do the same thing?

The inode iodone function changes almost immediately. I did it this
way so that the process of changing the inode buffer completion
functionality did not, in any way, impact on other types of buffers.
We need to go through the same process with dquot buffers, and then
once that is done, we can look to refactor all this into a more
integrated solution that largely sits in xfs_buf.c.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 13/30] xfs: handle buffer log item IO errors directly
  2020-06-02 20:39   ` Darrick J. Wong
@ 2020-06-02 22:17     ` Dave Chinner
  0 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-02 22:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 01:39:51PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 02, 2020 at 07:42:34AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Currently when a buffer with attached log items has an IO error
> > it called ->iop_error for each attched log item. These all call
> > xfs_set_li_failed() to handle the error, but we are about to change
> > the way log items manage buffers. hence we first need to remove the
> > per-item dependency on buffer handling done by xfs_set_li_failed().
> > 
> > We already have specific buffer type IO completion routines, so move
> > the log item error handling out of the generic error handling and
> > into the log item specific functions so we can implement per-type
> > error handling easily.
> > 
> > This requires a more complex return value from the error handling
> > code so that we can take the correct action the failure handling
> > requires.  This results in some repeated boilerplate in the
> > functions, but that can be cleaned up later once all the changes
> > cascade through this code.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_buf_item.c | 167 ++++++++++++++++++++++++++++--------------
> >  1 file changed, 112 insertions(+), 55 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > index 09bfe9c52dbdb..b6995719e877b 100644
> > --- a/fs/xfs/xfs_buf_item.c
> > +++ b/fs/xfs/xfs_buf_item.c
> > @@ -987,20 +987,18 @@ xfs_buf_do_callbacks_fail(
> >  }
> >  
> >  static bool
> > -xfs_buf_iodone_callback_error(
> > +xfs_buf_ioerror_sync(
> >  	struct xfs_buf		*bp)
> >  {
> >  	struct xfs_mount	*mp = bp->b_mount;
> >  	static ulong		lasttime;
> >  	static xfs_buftarg_t	*lasttarg;
> > -	struct xfs_error_cfg	*cfg;
> > -
> >  	/*
> 
> This should preserve the blank line between the declarations and the
> start of the code.
> 
> >  	 * If we've already decided to shutdown the filesystem because of
> >  	 * I/O errors, there's no point in giving this a retry.
> >  	 */
> >  	if (XFS_FORCED_SHUTDOWN(mp))
> > -		goto out_stale;
> > +		return true;
> >  
> >  	if (bp->b_target != lasttarg ||
> >  	    time_after(jiffies, (lasttime + 5*HZ))) {
> > @@ -1011,19 +1009,15 @@ xfs_buf_iodone_callback_error(
> >  
> >  	/* synchronous writes will have callers process the error */
> >  	if (!(bp->b_flags & XBF_ASYNC))
> > -		goto out_stale;
> > -
> > -	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
> > -
> > -	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
> > +		return true;
> > +	return false;
> 
> What does the return value mean here?  true means "let the caller deal
> with the error", false means "attempt a retry, if desired?  So this
> function decides if we're going to fail immediately or not?

Effectively, yes.
> 
> 	if (xfs_buf_ioerr_fail_immediately(bp))
> 		goto out_stale;
> 
> That's a lengthy name though.  On second inspection, I guess this
> function decides if the buffer is going to be sent through the io retry
> mechanism, and the next two functions advance it through the retry
> states until either the write succeeds or we declare permanent failure?

Pretty much. I had some difficulty in working out how to break this
large function up sanely because of the 3-4 conditional functions
it performed for error handling, I named the function originally for
handling sync IO errors vs async IO errors which (may) require
retries.

So, yeah, "fail_immediately" is probably a better description, or
"fail_no_retry" sounds like a better name.

> 
> > +}
> >  
> > -	/*
> > -	 * If the write was asynchronous then no one will be looking for the
> > -	 * error.  If this is the first failure of this type, clear the error
> > -	 * state and write the buffer out again. This means we always retry an
> > -	 * async write failure at least once, but we also need to set the buffer
> > -	 * up to behave correctly now for repeated failures.
> > -	 */
> > +static bool
> > +xfs_buf_ioerror_retry(
> 
> Might be nice to preserve some of this comment, since I initially
> missed that this function both decides whether or not to do the retry
> and sets up the buffer to do that.

I thought I preserved it somewhere... yeah, it's above the
xfs_buf_iodone_error() function now.

> 
> /*
>  * Decide if we're going to retry the write after a failure, and prepare
>  * the buffer for retrying the write.
>  */
> 
> Or, adding some newlines in the outer if body to make the two lines
> that modify the bp state stand out would also help.
> 
> (TBH I'm struggling right now to make sense of what these new functions
> do, though I'm fairly convinced that they at least aren't changing much
> of the functionality...)

I had to break up the IO error handling because the log item error
callbacks for the items attached to the buffer needed to be called
only if we want the higher level to issue retries. Later in this
series we end up with different retry error marking for each type of
buffer, but we only want to do that when the error handling code
itself hasn't done an immediate retry or marked it as a permanent
error.

So I had to break up the function in separate parts so that the
caller could tell exactly what action it needed to take on a
failure.

> > +/*
> > + * On a sync write or shutdown we just want to stale the buffer and let the
> > + * caller handle the error in bp->b_error appropriately.
> > + *
> > + * If the write was asynchronous then no one will be looking for the error.  If
> > + * this is the first failure of this type, clear the error state and write the
> > + * buffer out again. This means we always retry an async write failure at least
> > + * once, but we also need to set the buffer up to behave correctly now for
> > + * repeated failures.
> > + *
> > + * If we get repeated async write failures, then we take action according to the
> > + * error configuration we have been set up to use.
> > + *
> > + * Multi-state return value:
> > + *
> > + * 0: clear IO error retry state and run callback completions
> > + * 1: resubmitted immediately, do not run any completions
> > + * 2: transient error, run failure callback completions and then
> > + *    release the buffer
> 
> Feels odd not to use an enum here, but as this is a static function
> maybe it's not a high risk for screwing up in the callers.

I can change it to use an enum. I wrote this expecting that this
code will get further factored and moved to xfs_buf.c once all the
mods have been made and everything settles down. That's about 3-4
patch series down the road at this point, though, so <shrug>. At
least changes in this patch largely don't affect the rest of this
patchset....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/30] xfs: pin inode backing buffer to the inode log item
  2020-06-01 21:42 ` [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Dave Chinner
@ 2020-06-02 22:30   ` Darrick J. Wong
  2020-06-02 22:53     ` Dave Chinner
  2020-06-03 18:58   ` Brian Foster
  1 sibling, 1 reply; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02 22:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:37AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we dirty an inode, we are going to have to write it disk at
> some point in the near future. This requires the inode cluster
> backing buffer to be present in memory. Unfortunately, under severe
> memory pressure we can reclaim the inode backing buffer while the
> inode is dirty in memory, resulting in stalling the AIL pushing
> because it has to do a read-modify-write cycle on the cluster
> buffer.
> 
> When we have no memory available, the read of the cluster buffer
> blocks the AIL pushing process, and this causes all sorts of issues
> for memory reclaim as it requires inode writeback to make forwards
> progress. Allocating a cluster buffer causes more memory pressure,
> and results in more cluster buffers to be reclaimed, resulting in
> more RMW cycles to be done in the AIL context and everything then
> backs up on AIL progress. Only the synchronous inode cluster
> writeback in the the inode reclaim code provides some level of
> forwards progress guarantees that prevent OOM-killer rampages in
> this situation.
> 
> Fix this by pinning the inode backing buffer to the inode log item
> when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
> This may mean the first modification of an inode that has been held
> in cache for a long time may block on a cluster buffer read, but
> we can do that in transaction context and block safely until the
> buffer has been allocated and read.
> 
> Once we have the cluster buffer, the inode log item takes a
> reference to it, pinning it in memory, and attaches it to the log
> item for future reference. This means we can always grab the cluster
> buffer from the inode log item when we need it.
> 
> When the inode is finally cleaned and removed from the AIL, we can
> drop the reference the inode log item holds on the cluster buffer.
> Once all inodes on the cluster buffer are clean, the cluster buffer
> will be unpinned and it will be available for memory reclaim to
> reclaim again.
> 
> This avoids the issues with needing to do RMW cycles in the AIL
> pushing context, and hence allows complete non-blocking inode
> flushing to be performed by the AIL pushing context.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_inode_buf.c   |  3 +-
>  fs/xfs/libxfs/xfs_trans_inode.c | 53 +++++++++++++++++++++---
>  fs/xfs/xfs_buf_item.c           |  4 +-
>  fs/xfs/xfs_inode_item.c         | 73 +++++++++++++++++++++++++++------
>  fs/xfs/xfs_trans_ail.c          |  8 +++-
>  5 files changed, 117 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 6f84ea85fdd83..1af97235785c8 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -176,7 +176,8 @@ xfs_imap_to_bp(
>  	}
>  
>  	*bpp = bp;
> -	*dipp = xfs_buf_offset(bp, imap->im_boffset);
> +	if (dipp)
> +		*dipp = xfs_buf_offset(bp, imap->im_boffset);
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index fe6c2e39be85d..1e7147b90725e 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -8,6 +8,8 @@
>  #include "xfs_shared.h"
>  #include "xfs_format.h"
>  #include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
>  #include "xfs_inode.h"
>  #include "xfs_trans.h"
>  #include "xfs_trans_priv.h"
> @@ -72,13 +74,19 @@ xfs_trans_ichgtime(
>  }
>  
>  /*
> - * This is called to mark the fields indicated in fieldmask as needing
> - * to be logged when the transaction is committed.  The inode must
> - * already be associated with the given transaction.
> + * This is called to mark the fields indicated in fieldmask as needing to be
> + * logged when the transaction is committed.  The inode must already be
> + * associated with the given transaction.
>   *
> - * The values for fieldmask are defined in xfs_inode_item.h.  We always
> - * log all of the core inode if any of it has changed, and we always log
> - * all of the inline data/extents/b-tree root if any of them has changed.
> + * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
> + * of the core inode if any of it has changed, and we always log all of the
> + * inline data/extents/b-tree root if any of them has changed.
> + *
> + * Grab and pin the cluster buffer associated with this inode to avoid RMW
> + * cycles at inode writeback time. Avoid the need to add error handling to every
> + * xfs_trans_log_inode() call by shutting down on read error.  This will cause
> + * transactions to fail and everything to error out, just like if we return a
> + * read error in a dirty transaction and cancel it.
>   */
>  void
>  xfs_trans_log_inode(
> @@ -132,6 +140,39 @@ xfs_trans_log_inode(
>  	spin_lock(&iip->ili_lock);
>  	iip->ili_fsync_fields |= flags;
>  
> +	if (!iip->ili_item.li_buf) {
> +		struct xfs_buf	*bp;
> +		int		error;
> +
> +		/*
> +		 * We hold the ILOCK here, so this inode is not going to be
> +		 * flushed while we are here. Further, because there is no
> +		 * buffer attached to the item, we know that there is no IO in
> +		 * progress, so nothing will clear the ili_fields while we read
> +		 * in the buffer. Hence we can safely drop the spin lock and
> +		 * read the buffer knowing that the state will not change from
> +		 * here.
> +		 */
> +		spin_unlock(&iip->ili_lock);
> +		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, NULL,
> +					&bp, 0);
> +		if (error) {
> +			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> +			return;
> +		}
> +
> +		/*
> +		 * We need an explicit buffer reference for the log item but
> +		 * don't want the buffer to remain attached to the transaction.
> +		 * Hold the buffer but release the transaction reference.
> +		 */
> +		xfs_buf_hold(bp);
> +		xfs_trans_brelse(tp, bp);
> +
> +		spin_lock(&iip->ili_lock);
> +		iip->ili_item.li_buf = bp;
> +	}
> +
>  	/*
>  	 * Always OR in the bits from the ili_last_fields field.  This is to
>  	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 2364a9aa2d71a..9739d64a46443 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -1131,11 +1131,9 @@ xfs_buf_inode_iodone(
>  		if (ret == 1)
>  			return;
>  		ASSERT(ret == 2);
> -		spin_lock(&bp->b_mount->m_ail->ail_lock);
>  		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> -			xfs_set_li_failed(lip, bp);
> +			set_bit(XFS_LI_FAILED, &lip->li_flags);

Hm.  So if I read this right, for inode buffers we set/clear LI_FAILED
directly (i.e. without messing with li_buf) because for inodes we want
to manage the pointer directly without LI_FAILED messing with it.  That
way we can attach the buffer to the item when we dirty the inode, and
release it when iflush is finished (or aborts).  Dquots retain the old
behavior (grab the buffer only while we're checkpointing a dquot item)
which is why the v1 series crashed in xfs/438, so we have to leave
xfs_set/clear_li_failed alone for now.  Right?

If so,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D


>  		}
> -		spin_unlock(&bp->b_mount->m_ail->ail_lock);
>  		xfs_buf_relse(bp);
>  		return;
>  	}
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 0ba75764a8dc5..0a7720b7a821a 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -130,6 +130,8 @@ xfs_inode_item_size(
>  	xfs_inode_item_data_fork_size(iip, nvecs, nbytes);
>  	if (XFS_IFORK_Q(ip))
>  		xfs_inode_item_attr_fork_size(iip, nvecs, nbytes);
> +
> +	ASSERT(iip->ili_item.li_buf);
>  }
>  
>  STATIC void
> @@ -439,6 +441,7 @@ xfs_inode_item_pin(
>  	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
>  
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> +	ASSERT(lip->li_buf);
>  
>  	trace_xfs_inode_pin(ip, _RET_IP_);
>  	atomic_inc(&ip->i_pincount);
> @@ -450,6 +453,12 @@ xfs_inode_item_pin(
>   * item which was previously pinned with a call to xfs_inode_item_pin().
>   *
>   * Also wake up anyone in xfs_iunpin_wait() if the count goes to 0.
> + *
> + * Note that unpin can race with inode cluster buffer freeing marking the buffer
> + * stale. In that case, flush completions are run from the buffer unpin call,
> + * which may happen before the inode is unpinned. If we lose the race, there
> + * will be no buffer attached to the log item, but the inode will be marked
> + * XFS_ISTALE.
>   */
>  STATIC void
>  xfs_inode_item_unpin(
> @@ -459,6 +468,7 @@ xfs_inode_item_unpin(
>  	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
>  
>  	trace_xfs_inode_unpin(ip, _RET_IP_);
> +	ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE));
>  	ASSERT(atomic_read(&ip->i_pincount) > 0);
>  	if (atomic_dec_and_test(&ip->i_pincount))
>  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
> @@ -629,10 +639,15 @@ xfs_inode_item_init(
>   */
>  void
>  xfs_inode_item_destroy(
> -	xfs_inode_t	*ip)
> +	struct xfs_inode	*ip)
>  {
> -	kmem_free(ip->i_itemp->ili_item.li_lv_shadow);
> -	kmem_cache_free(xfs_ili_zone, ip->i_itemp);
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
> +
> +	ASSERT(iip->ili_item.li_buf == NULL);
> +
> +	ip->i_itemp = NULL;
> +	kmem_free(iip->ili_item.li_lv_shadow);
> +	kmem_cache_free(xfs_ili_zone, iip);
>  }
>  
>  
> @@ -647,6 +662,13 @@ xfs_inode_item_destroy(
>   * list for other inodes that will run this function. We remove them from the
>   * buffer list so we can process all the inode IO completions in one AIL lock
>   * traversal.
> + *
> + * Note: Now that we attach the log item to the buffer when we first log the
> + * inode in memory, we can have unflushed inodes on the buffer list here. These
> + * inodes will have a zero ili_last_fields, so skip over them here. We do
> + * this check -after- we've checked for stale inodes, because we're guaranteed
> + * to have XFS_ISTALE set in the case that dirty inodes are in the CIL and have
> + * not yet had their dirtying transactions committed to disk.
>   */
>  void
>  xfs_iflush_done(
> @@ -670,14 +692,16 @@ xfs_iflush_done(
>  			continue;
>  		}
>  
> +		if (!iip->ili_last_fields)
> +			continue;
> +
>  		list_move_tail(&lip->li_bio_list, &tmp);
>  
>  		/* Do an unlocked check for needing the AIL lock. */
> -		if (lip->li_lsn == iip->ili_flush_lsn ||
> +		if (iip->ili_flush_lsn == lip->li_lsn ||
>  		    test_bit(XFS_LI_FAILED, &lip->li_flags))
>  			need_ail++;
>  	}
> -	ASSERT(list_empty(&bp->b_li_list));
>  
>  	/*
>  	 * We only want to pull the item from the AIL if it is actually there
> @@ -690,7 +714,7 @@ xfs_iflush_done(
>  		/* this is an opencoded batch version of xfs_trans_ail_delete */
>  		spin_lock(&ailp->ail_lock);
>  		list_for_each_entry(lip, &tmp, li_bio_list) {
> -			xfs_clear_li_failed(lip);
> +			clear_bit(XFS_LI_FAILED, &lip->li_flags);
>  			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
>  				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
>  				if (!tail_lsn && lsn)
> @@ -706,14 +730,29 @@ xfs_iflush_done(
>  	 * them is safely on disk.
>  	 */
>  	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
> +		bool	drop_buffer = false;
> +
>  		list_del_init(&lip->li_bio_list);
>  		iip = INODE_ITEM(lip);
>  
>  		spin_lock(&iip->ili_lock);
> +
> +		/*
> +		 * Remove the reference to the cluster buffer if the inode is
> +		 * clean in memory. Drop the buffer reference once we've dropped
> +		 * the locks we hold.
> +		 */
> +		ASSERT(iip->ili_item.li_buf == bp);
> +		if (!iip->ili_fields) {
> +			iip->ili_item.li_buf = NULL;
> +			drop_buffer = true;
> +		}
>  		iip->ili_last_fields = 0;
> +		iip->ili_flush_lsn = 0;
>  		spin_unlock(&iip->ili_lock);
> -
>  		xfs_ifunlock(iip->ili_inode);
> +		if (drop_buffer)
> +			xfs_buf_rele(bp);
>  	}
>  }
>  
> @@ -725,12 +764,20 @@ xfs_iflush_done(
>   */
>  void
>  xfs_iflush_abort(
> -	struct xfs_inode		*ip)
> +	struct xfs_inode	*ip)
>  {
> -	struct xfs_inode_log_item	*iip = ip->i_itemp;
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
> +	struct xfs_buf		*bp = NULL;
>  
>  	if (iip) {
> +		/*
> +		 * Clear the failed bit before removing the item from the AIL so
> +		 * xfs_trans_ail_delete() doesn't try to clear and release the
> +		 * buffer attached to the log item before we are done with it.
> +		 */
> +		clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags);
>  		xfs_trans_ail_delete(&iip->ili_item, 0);
> +
>  		/*
>  		 * Clear the inode logging fields so no more flushes are
>  		 * attempted.
> @@ -739,12 +786,14 @@ xfs_iflush_abort(
>  		iip->ili_last_fields = 0;
>  		iip->ili_fields = 0;
>  		iip->ili_fsync_fields = 0;
> +		iip->ili_flush_lsn = 0;
> +		bp = iip->ili_item.li_buf;
> +		iip->ili_item.li_buf = NULL;
>  		spin_unlock(&iip->ili_lock);
>  	}
> -	/*
> -	 * Release the inode's flush lock since we're done with it.
> -	 */
>  	xfs_ifunlock(ip);
> +	if (bp)
> +		xfs_buf_rele(bp);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index ac33f6393f99c..c3be6e4401343 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -377,8 +377,12 @@ xfsaild_resubmit_item(
>  	}
>  
>  	/* protected by ail_lock */
> -	list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
> -		xfs_clear_li_failed(lip);
> +	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> +		if (bp->b_flags & _XBF_INODES)
> +			clear_bit(XFS_LI_FAILED, &lip->li_flags);
> +		else
> +			xfs_clear_li_failed(lip);
> +	}
>  
>  	xfs_buf_unlock(bp);
>  	return XFS_ITEM_SUCCESS;
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 18/30] xfs: remove IO submission from xfs_reclaim_inode()
  2020-06-01 21:42 ` [PATCH 18/30] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
@ 2020-06-02 22:36   ` Darrick J. Wong
  0 siblings, 0 replies; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02 22:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:39AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We no longer need to issue IO from shrinker based inode reclaim to
> prevent spurious OOM killer invocation. This leaves only the global
> filesystem management operations such as unmount needing to
> writeback dirty inodes and reclaim them.
> 
> Instead of using the reclaim pass to write dirty inodes before
> reclaiming them, use the AIL to push all the dirty inodes before we
> try to reclaim them. This allows us to remove all the conditional
> SYNC_WAIT locking and the writeback code from xfs_reclaim_inode()
> and greatly simplify the checks we need to do to reclaim an inode.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good this time around,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_icache.c | 117 ++++++++++++--------------------------------
>  1 file changed, 31 insertions(+), 86 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index a6780942034fc..74032316ce5cc 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1111,24 +1111,17 @@ xfs_reclaim_inode_grab(
>   *	dirty, async	=> requeue
>   *	dirty, sync	=> flush, wait and reclaim
>   */
> -STATIC int
> +static bool
>  xfs_reclaim_inode(
>  	struct xfs_inode	*ip,
>  	struct xfs_perag	*pag,
>  	int			sync_mode)
>  {
> -	struct xfs_buf		*bp = NULL;
>  	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
> -	int			error;
>  
> -restart:
> -	error = 0;
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	if (!xfs_iflock_nowait(ip)) {
> -		if (!(sync_mode & SYNC_WAIT))
> -			goto out;
> -		xfs_iflock(ip);
> -	}
> +	if (!xfs_iflock_nowait(ip))
> +		goto out;
>  
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
>  		xfs_iunpin_wait(ip);
> @@ -1136,52 +1129,12 @@ xfs_reclaim_inode(
>  		xfs_iflush_abort(ip);
>  		goto reclaim;
>  	}
> -	if (xfs_ipincount(ip)) {
> -		if (!(sync_mode & SYNC_WAIT))
> -			goto out_ifunlock;
> -		xfs_iunpin_wait(ip);
> -	}
> -	if (xfs_inode_clean(ip)) {
> -		xfs_ifunlock(ip);
> -		goto reclaim;
> -	}
> -
> -	/*
> -	 * Never flush out dirty data during non-blocking reclaim, as it would
> -	 * just contend with AIL pushing trying to do the same job.
> -	 */
> -	if (!(sync_mode & SYNC_WAIT))
> +	if (xfs_ipincount(ip))
> +		goto out_ifunlock;
> +	if (!xfs_inode_clean(ip))
>  		goto out_ifunlock;
>  
> -	/*
> -	 * Now we have an inode that needs flushing.
> -	 *
> -	 * Note that xfs_iflush will never block on the inode buffer lock, as
> -	 * xfs_ifree_cluster() can lock the inode buffer before it locks the
> -	 * ip->i_lock, and we are doing the exact opposite here.  As a result,
> -	 * doing a blocking xfs_imap_to_bp() to get the cluster buffer would
> -	 * result in an ABBA deadlock with xfs_ifree_cluster().
> -	 *
> -	 * As xfs_ifree_cluser() must gather all inodes that are active in the
> -	 * cache to mark them stale, if we hit this case we don't actually want
> -	 * to do IO here - we want the inode marked stale so we can simply
> -	 * reclaim it.  Hence if we get an EAGAIN error here,  just unlock the
> -	 * inode, back off and try again.  Hopefully the next pass through will
> -	 * see the stale flag set on the inode.
> -	 */
> -	error = xfs_iflush(ip, &bp);
> -	if (error == -EAGAIN) {
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -		/* backoff longer than in xfs_ifree_cluster */
> -		delay(2);
> -		goto restart;
> -	}
> -
> -	if (!error) {
> -		error = xfs_bwrite(bp);
> -		xfs_buf_relse(bp);
> -	}
> -
> +	xfs_ifunlock(ip);
>  reclaim:
>  	ASSERT(!xfs_isiflocked(ip));
>  
> @@ -1231,21 +1184,14 @@ xfs_reclaim_inode(
>  	ASSERT(xfs_inode_clean(ip));
>  
>  	__xfs_inode_free(ip);
> -	return error;
> +	return true;
>  
>  out_ifunlock:
>  	xfs_ifunlock(ip);
>  out:
> -	xfs_iflags_clear(ip, XFS_IRECLAIM);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -	/*
> -	 * We could return -EAGAIN here to make reclaim rescan the inode tree in
> -	 * a short while. However, this just burns CPU time scanning the tree
> -	 * waiting for IO to complete and the reclaim work never goes back to
> -	 * the idle state. Instead, return 0 to let the next scheduled
> -	 * background reclaim attempt to reclaim the inode again.
> -	 */
> -	return 0;
> +	xfs_iflags_clear(ip, XFS_IRECLAIM);
> +	return false;
>  }
>  
>  /*
> @@ -1253,21 +1199,22 @@ xfs_reclaim_inode(
>   * corrupted, we still want to try to reclaim all the inodes. If we don't,
>   * then a shut down during filesystem unmount reclaim walk leak all the
>   * unreclaimed inodes.
> + *
> + * Returns non-zero if any AGs or inodes were skipped in the reclaim pass
> + * so that callers that want to block until all dirty inodes are written back
> + * and reclaimed can sanely loop.
>   */
> -STATIC int
> +static int
>  xfs_reclaim_inodes_ag(
>  	struct xfs_mount	*mp,
>  	int			flags,
>  	int			*nr_to_scan)
>  {
>  	struct xfs_perag	*pag;
> -	int			error = 0;
> -	int			last_error = 0;
>  	xfs_agnumber_t		ag;
>  	int			trylock = flags & SYNC_TRYLOCK;
>  	int			skipped;
>  
> -restart:
>  	ag = 0;
>  	skipped = 0;
>  	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
> @@ -1341,9 +1288,8 @@ xfs_reclaim_inodes_ag(
>  			for (i = 0; i < nr_found; i++) {
>  				if (!batch[i])
>  					continue;
> -				error = xfs_reclaim_inode(batch[i], pag, flags);
> -				if (error && last_error != -EFSCORRUPTED)
> -					last_error = error;
> +				if (!xfs_reclaim_inode(batch[i], pag, flags))
> +					skipped++;
>  			}
>  
>  			*nr_to_scan -= XFS_LOOKUP_BATCH;
> @@ -1359,19 +1305,7 @@ xfs_reclaim_inodes_ag(
>  		mutex_unlock(&pag->pag_ici_reclaim_lock);
>  		xfs_perag_put(pag);
>  	}
> -
> -	/*
> -	 * if we skipped any AG, and we still have scan count remaining, do
> -	 * another pass this time using blocking reclaim semantics (i.e
> -	 * waiting on the reclaim locks and ignoring the reclaim cursors). This
> -	 * ensure that when we get more reclaimers than AGs we block rather
> -	 * than spin trying to execute reclaim.
> -	 */
> -	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
> -		trylock = 0;
> -		goto restart;
> -	}
> -	return last_error;
> +	return skipped;
>  }
>  
>  int
> @@ -1380,8 +1314,18 @@ xfs_reclaim_inodes(
>  	int		mode)
>  {
>  	int		nr_to_scan = INT_MAX;
> +	int		skipped;
>  
> -	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
> +	xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
> +	if (!(mode & SYNC_WAIT))
> +		return 0;
> +
> +	do {
> +		xfs_ail_push_all_sync(mp->m_ail);
> +		skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
> +	} while (skipped > 0);
> +
> +	return 0;
>  }
>  
>  /*
> @@ -1402,7 +1346,8 @@ xfs_reclaim_inodes_nr(
>  	xfs_reclaim_work_queue(mp);
>  	xfs_ail_push_all(mp->m_ail);
>  
> -	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
> +	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
> +	return 0;
>  }
>  
>  /*
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes()
  2020-06-01 21:42 ` [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Dave Chinner
@ 2020-06-02 22:43   ` Darrick J. Wong
  0 siblings, 0 replies; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02 22:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:43AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Clean up xfs_reclaim_inodes() callers. Most callers want blocking
> behaviour, so just make the existing SYNC_WAIT behaviour the
> default.
> 
> For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
> directly because we just want optimistic clean inode reclaim to be
> done in the background.
> 
> For xfs_quiesce_attr() we can just remove the inode reclaim calls as
> they are a historic relic that was required to flush dirty inodes
> that contained unlogged changes. We now log all changes to the
> inodes, so the sync AIL push from xfs_log_quiesce() called by
> xfs_quiesce_attr() will do all the required inode writeback for
> freeze.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Heh, neat,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_icache.c | 48 ++++++++++++++++++++-------------------------
>  fs/xfs/xfs_icache.h |  2 +-
>  fs/xfs/xfs_mount.c  | 11 +++++------
>  fs/xfs/xfs_super.c  |  3 ---
>  4 files changed, 27 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index ebe55124d6cb8..a27470fc201ff 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -160,24 +160,6 @@ xfs_reclaim_work_queue(
>  	rcu_read_unlock();
>  }
>  
> -/*
> - * This is a fast pass over the inode cache to try to get reclaim moving on as
> - * many inodes as possible in a short period of time. It kicks itself every few
> - * seconds, as well as being kicked by the inode cache shrinker when memory
> - * goes low. It scans as quickly as possible avoiding locked inodes or those
> - * already being flushed, and once done schedules a future pass.
> - */
> -void
> -xfs_reclaim_worker(
> -	struct work_struct *work)
> -{
> -	struct xfs_mount *mp = container_of(to_delayed_work(work),
> -					struct xfs_mount, m_reclaim_work);
> -
> -	xfs_reclaim_inodes(mp, 0);
> -	xfs_reclaim_work_queue(mp);
> -}
> -
>  static void
>  xfs_perag_set_reclaim_tag(
>  	struct xfs_perag	*pag)
> @@ -1298,24 +1280,17 @@ xfs_reclaim_inodes_ag(
>  	return skipped;
>  }
>  
> -int
> +void
>  xfs_reclaim_inodes(
> -	xfs_mount_t	*mp,
> -	int		mode)
> +	struct xfs_mount	*mp)
>  {
>  	int		nr_to_scan = INT_MAX;
>  	int		skipped;
>  
> -	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
> -	if (!(mode & SYNC_WAIT))
> -		return 0;
> -
>  	do {
>  		xfs_ail_push_all_sync(mp->m_ail);
>  		skipped = xfs_reclaim_inodes_ag(mp, &nr_to_scan);
>  	} while (skipped > 0);
> -
> -	return 0;
>  }
>  
>  /*
> @@ -1434,6 +1409,25 @@ xfs_inode_matches_eofb(
>  	return true;
>  }
>  
> +/*
> + * This is a fast pass over the inode cache to try to get reclaim moving on as
> + * many inodes as possible in a short period of time. It kicks itself every few
> + * seconds, as well as being kicked by the inode cache shrinker when memory
> + * goes low. It scans as quickly as possible avoiding locked inodes or those
> + * already being flushed, and once done schedules a future pass.
> + */
> +void
> +xfs_reclaim_worker(
> +	struct work_struct *work)
> +{
> +	struct xfs_mount *mp = container_of(to_delayed_work(work),
> +					struct xfs_mount, m_reclaim_work);
> +	int		nr_to_scan = INT_MAX;
> +
> +	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
> +	xfs_reclaim_work_queue(mp);
> +}
> +
>  STATIC int
>  xfs_inode_free_eofblocks(
>  	struct xfs_inode	*ip,
> diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
> index 93b54e7d55f0d..ae92ca53de423 100644
> --- a/fs/xfs/xfs_icache.h
> +++ b/fs/xfs/xfs_icache.h
> @@ -51,7 +51,7 @@ void xfs_inode_free(struct xfs_inode *ip);
>  
>  void xfs_reclaim_worker(struct work_struct *work);
>  
> -int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
> +void xfs_reclaim_inodes(struct xfs_mount *mp);
>  int xfs_reclaim_inodes_count(struct xfs_mount *mp);
>  long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
>  
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 03158b42a1943..c8ae49a1e99c3 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1011,7 +1011,7 @@ xfs_mountfs(
>  	 * quota inodes.
>  	 */
>  	cancel_delayed_work_sync(&mp->m_reclaim_work);
> -	xfs_reclaim_inodes(mp, SYNC_WAIT);
> +	xfs_reclaim_inodes(mp);
>  	xfs_health_unmount(mp);
>   out_log_dealloc:
>  	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
> @@ -1088,13 +1088,12 @@ xfs_unmountfs(
>  	xfs_ail_push_all_sync(mp->m_ail);
>  
>  	/*
> -	 * And reclaim all inodes.  At this point there should be no dirty
> -	 * inodes and none should be pinned or locked, but use synchronous
> -	 * reclaim just to be sure. We can stop background inode reclaim
> -	 * here as well if it is still running.
> +	 * Reclaim all inodes. At this point there should be no dirty inodes and
> +	 * none should be pinned or locked. Stop background inode reclaim here
> +	 * if it is still running.
>  	 */
>  	cancel_delayed_work_sync(&mp->m_reclaim_work);
> -	xfs_reclaim_inodes(mp, SYNC_WAIT);
> +	xfs_reclaim_inodes(mp);
>  	xfs_health_unmount(mp);
>  
>  	xfs_qm_unmount(mp);
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index fa58cb07c8fdf..9b03ea43f4fe7 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -890,9 +890,6 @@ xfs_quiesce_attr(
>  	/* force the log to unpin objects from the now complete transactions */
>  	xfs_log_force(mp, XFS_LOG_SYNC);
>  
> -	/* reclaim inodes to do any IO before the freeze completes */
> -	xfs_reclaim_inodes(mp, 0);
> -	xfs_reclaim_inodes(mp, SYNC_WAIT);
>  
>  	/* Push the superblock and write an unmount record */
>  	error = xfs_log_sbcount(mp);
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 23/30] xfs: clean up inode reclaim comments
  2020-06-01 21:42 ` [PATCH 23/30] xfs: clean up inode reclaim comments Dave Chinner
@ 2020-06-02 22:45   ` Darrick J. Wong
  0 siblings, 0 replies; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02 22:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:44AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Inode reclaim is quite different now to the way described in various
> comments, so update all the comments explaining what it does and how
> it works.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_icache.c | 128 ++++++++++++--------------------------------
>  1 file changed, 35 insertions(+), 93 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index a27470fc201ff..4fe6f250e8448 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -141,11 +141,8 @@ xfs_inode_free(
>  }
>  
>  /*
> - * Queue a new inode reclaim pass if there are reclaimable inodes and there
> - * isn't a reclaim pass already in progress. By default it runs every 5s based
> - * on the xfs periodic sync default of 30s. Perhaps this should have it's own
> - * tunable, but that can be done if this method proves to be ineffective or too
> - * aggressive.
> + * Queue background inode reclaim work if there are reclaimable inodes and there
> + * isn't reclaim work already scheduled or in progress.
>   */
>  static void
>  xfs_reclaim_work_queue(
> @@ -600,48 +597,31 @@ xfs_iget_cache_miss(
>  }
>  
>  /*
> - * Look up an inode by number in the given file system.
> - * The inode is looked up in the cache held in each AG.
> - * If the inode is found in the cache, initialise the vfs inode
> - * if necessary.
> + * Look up an inode by number in the given file system.  The inode is looked up
> + * in the cache held in each AG.  If the inode is found in the cache, initialise
> + * the vfs inode if necessary.
>   *
> - * If it is not in core, read it in from the file system's device,
> - * add it to the cache and initialise the vfs inode.
> + * If it is not in core, read it in from the file system's device, add it to the
> + * cache and initialise the vfs inode.
>   *
>   * The inode is locked according to the value of the lock_flags parameter.
> - * This flag parameter indicates how and if the inode's IO lock and inode lock
> - * should be taken.
> - *
> - * mp -- the mount point structure for the current file system.  It points
> - *       to the inode hash table.
> - * tp -- a pointer to the current transaction if there is one.  This is
> - *       simply passed through to the xfs_iread() call.
> - * ino -- the number of the inode desired.  This is the unique identifier
> - *        within the file system for the inode being requested.
> - * lock_flags -- flags indicating how to lock the inode.  See the comment
> - *		 for xfs_ilock() for a list of valid values.
> + * Inode lookup is only done during metadata operations and not as part of the
> + * data IO path. Hence we only allow locking of the XFS_ILOCK during lookup.
>   */
>  int
>  xfs_iget(
> -	xfs_mount_t	*mp,
> -	xfs_trans_t	*tp,
> -	xfs_ino_t	ino,
> -	uint		flags,
> -	uint		lock_flags,
> -	xfs_inode_t	**ipp)
> +	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
> +	xfs_ino_t		ino,
> +	uint			flags,
> +	uint			lock_flags,
> +	struct xfs_inode	**ipp)
>  {
> -	xfs_inode_t	*ip;
> -	int		error;
> -	xfs_perag_t	*pag;
> -	xfs_agino_t	agino;
> +	struct xfs_inode	*ip;
> +	struct xfs_perag	*pag;
> +	xfs_agino_t		agino;
> +	int			error;
>  
> -	/*
> -	 * xfs_reclaim_inode() uses the ILOCK to ensure an inode
> -	 * doesn't get freed while it's being referenced during a
> -	 * radix tree traversal here.  It assumes this function
> -	 * aqcuires only the ILOCK (and therefore it has no need to
> -	 * involve the IOLOCK in this synchronization).
> -	 */
>  	ASSERT((lock_flags & (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED)) == 0);
>  
>  	/* reject inode numbers outside existing AGs */
> @@ -758,15 +738,7 @@ xfs_inode_walk_ag_grab(
>  
>  	ASSERT(rcu_read_lock_held());
>  
> -	/*
> -	 * check for stale RCU freed inode
> -	 *
> -	 * If the inode has been reallocated, it doesn't matter if it's not in
> -	 * the AG we are walking - we are walking for writeback, so if it
> -	 * passes all the "valid inode" checks and is dirty, then we'll write
> -	 * it back anyway.  If it has been reallocated and still being
> -	 * initialised, the XFS_INEW check below will catch it.
> -	 */
> +	/* Check for stale RCU freed inode */
>  	spin_lock(&ip->i_flags_lock);
>  	if (!ip->i_ino)
>  		goto out_unlock_noent;
> @@ -1052,43 +1024,16 @@ xfs_reclaim_inode_grab(
>  }
>  
>  /*
> - * Inodes in different states need to be treated differently. The following
> - * table lists the inode states and the reclaim actions necessary:
> - *
> - *	inode state	     iflush ret		required action
> - *      ---------------      ----------         ---------------
> - *	bad			-		reclaim
> - *	shutdown		EIO		unpin and reclaim
> - *	clean, unpinned		0		reclaim
> - *	stale, unpinned		0		reclaim
> - *	clean, pinned(*)	0		requeue
> - *	stale, pinned		EAGAIN		requeue
> - *	dirty, async		-		requeue
> - *	dirty, sync		0		reclaim
> + * Inode reclaim is non-blocking, so the default action if progress cannot be
> + * made is to "requeue" the inode for reclaim by unlocking it and clearing the
> + * XFS_IRECLAIM flag.  If we are in a shutdown state, we don't care about
> + * blocking anymore and hence we can wait for the inode to be able to reclaim
> + * it.
>   *
> - * (*) dgc: I don't think the clean, pinned state is possible but it gets
> - * handled anyway given the order of checks implemented.
> - *
> - * Also, because we get the flush lock first, we know that any inode that has
> - * been flushed delwri has had the flush completed by the time we check that
> - * the inode is clean.
> - *
> - * Note that because the inode is flushed delayed write by AIL pushing, the
> - * flush lock may already be held here and waiting on it can result in very
> - * long latencies.  Hence for sync reclaims, where we wait on the flush lock,
> - * the caller should push the AIL first before trying to reclaim inodes to
> - * minimise the amount of time spent waiting.  For background relaim, we only
> - * bother to reclaim clean inodes anyway.
> - *
> - * Hence the order of actions after gaining the locks should be:
> - *	bad		=> reclaim
> - *	shutdown	=> unpin and reclaim
> - *	pinned, async	=> requeue
> - *	pinned, sync	=> unpin
> - *	stale		=> reclaim
> - *	clean		=> reclaim
> - *	dirty, async	=> requeue
> - *	dirty, sync	=> flush, wait and reclaim
> + * We do no IO here - if callers require inodes to be cleaned they must push the
> + * AIL first to trigger writeback of dirty inodes.  This enables writeback to be
> + * done in the background in a non-blocking manner, and enables memory reclaim
> + * to make progress without blocking.
>   */
>  static bool
>  xfs_reclaim_inode(
> @@ -1294,13 +1239,11 @@ xfs_reclaim_inodes(
>  }
>  
>  /*
> - * Scan a certain number of inodes for reclaim.
> - *
> - * When called we make sure that there is a background (fast) inode reclaim in
> - * progress, while we will throttle the speed of reclaim via doing synchronous
> - * reclaim of inodes. That means if we come across dirty inodes, we wait for
> - * them to be cleaned, which we hope will not be very long due to the
> - * background walker having already kicked the IO off on those dirty inodes.
> + * The shrinker infrastructure determines how many inodes we should scan for
> + * reclaim. We want as many clean inodes ready to reclaim as possible, so we
> + * push the AIL here. We also want to proactively free up memory if we can to
> + * minimise the amount of work memory reclaim has to do so we kick the
> + * background reclaim if it isn't already scheduled.
>   */
>  long
>  xfs_reclaim_inodes_nr(
> @@ -1413,8 +1356,7 @@ xfs_inode_matches_eofb(
>   * This is a fast pass over the inode cache to try to get reclaim moving on as
>   * many inodes as possible in a short period of time. It kicks itself every few
>   * seconds, as well as being kicked by the inode cache shrinker when memory
> - * goes low. It scans as quickly as possible avoiding locked inodes or those
> - * already being flushed, and once done schedules a future pass.
> + * goes low.
>   */
>  void
>  xfs_reclaim_worker(
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/30] xfs: pin inode backing buffer to the inode log item
  2020-06-02 22:30   ` Darrick J. Wong
@ 2020-06-02 22:53     ` Dave Chinner
  0 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-02 22:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 03:30:52PM -0700, Darrick J. Wong wrote:
> On Tue, Jun 02, 2020 at 07:42:37AM +1000, Dave Chinner wrote:
> > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > index 2364a9aa2d71a..9739d64a46443 100644
> > --- a/fs/xfs/xfs_buf_item.c
> > +++ b/fs/xfs/xfs_buf_item.c
> > @@ -1131,11 +1131,9 @@ xfs_buf_inode_iodone(
> >  		if (ret == 1)
> >  			return;
> >  		ASSERT(ret == 2);
> > -		spin_lock(&bp->b_mount->m_ail->ail_lock);
> >  		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> > -			xfs_set_li_failed(lip, bp);
> > +			set_bit(XFS_LI_FAILED, &lip->li_flags);
> 
> Hm.  So if I read this right, for inode buffers we set/clear LI_FAILED
> directly (i.e. without messing with li_buf) because for inodes we want
> to manage the pointer directly without LI_FAILED messing with it.  That
> way we can attach the buffer to the item when we dirty the inode, and
> release it when iflush is finished (or aborts).  Dquots retain the old
> behavior (grab the buffer only while we're checkpointing a dquot item)
> which is why the v1 series crashed in xfs/438, so we have to leave
> xfs_set/clear_li_failed alone for now.  Right?

Correct. The lip->li_buf pointer is now owned by the inode log item
for inodes, it's not a field that exists purely for buffer error
handling. Any time an inode is attached to the buffer, lip->li_buf
points to the buffer, and hence we no longer need to attach the
buffer to the log item when IO fails to be able to trigger retries.

> If so,
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 24/30] xfs: rework stale inodes in xfs_ifree_cluster
  2020-06-01 21:42 ` [PATCH 24/30] xfs: rework stale inodes in xfs_ifree_cluster Dave Chinner
@ 2020-06-02 23:01   ` Darrick J. Wong
  0 siblings, 0 replies; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02 23:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:45AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Once we have inodes pinning the cluster buffer and attached whenever
> they are dirty, we no longer have a guarantee that the items are
> flush locked when we lock the cluster buffer. Hence we cannot just
> walk the buffer log item list and modify the attached inodes.
> 
> If the inode is not flush locked, we have to ILOCK it first and
> the flush lock it and do all the prerequisite checks needed to avoid

"...and then flush lock it..."

> races with other code. This is already handled by
> xfs_ifree_get_one_inode(), so rework the inode iteration loop and
> function to update all inodes in cache whether they are attached to
> the buffer or not.
> 
> Note: we also remove the copying of the log item lsn to the
> ili_flush_lsn as xfs_iflush_done() now uses the XFS_ISTALE flag to
> trigger aborts and so flush lsn matching is not needed in IO
> completion for processing freed inodes.

Ok.  Thanks for breaking this up a bit since the previous patch.  That
makes it easier to figure out what's going on.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>

With that fixed,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode.c | 158 ++++++++++++++++++---------------------------
>  1 file changed, 62 insertions(+), 96 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 272b54cf97000..fb4c614c64fda 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2517,17 +2517,19 @@ xfs_iunlink_remove(
>  }
>  
>  /*
> - * Look up the inode number specified and mark it stale if it is found. If it is
> - * dirty, return the inode so it can be attached to the cluster buffer so it can
> - * be processed appropriately when the cluster free transaction completes.
> + * Look up the inode number specified and if it is not already marked XFS_ISTALE
> + * mark it stale. We should only find clean inodes in this lookup that aren't
> + * already stale.
>   */
> -static struct xfs_inode *
> -xfs_ifree_get_one_inode(
> -	struct xfs_perag	*pag,
> +static void
> +xfs_ifree_mark_inode_stale(
> +	struct xfs_buf		*bp,
>  	struct xfs_inode	*free_ip,
>  	xfs_ino_t		inum)
>  {
> -	struct xfs_mount	*mp = pag->pag_mount;
> +	struct xfs_mount	*mp = bp->b_mount;
> +	struct xfs_perag	*pag = bp->b_pag;
> +	struct xfs_inode_log_item *iip;
>  	struct xfs_inode	*ip;
>  
>  retry:
> @@ -2535,8 +2537,10 @@ xfs_ifree_get_one_inode(
>  	ip = radix_tree_lookup(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, inum));
>  
>  	/* Inode not in memory, nothing to do */
> -	if (!ip)
> -		goto out_rcu_unlock;
> +	if (!ip) {
> +		rcu_read_unlock();
> +		return;
> +	}
>  
>  	/*
>  	 * because this is an RCU protected lookup, we could find a recently
> @@ -2547,9 +2551,9 @@ xfs_ifree_get_one_inode(
>  	spin_lock(&ip->i_flags_lock);
>  	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
>  		spin_unlock(&ip->i_flags_lock);
> -		goto out_rcu_unlock;
> +		rcu_read_unlock();
> +		return;
>  	}
> -	spin_unlock(&ip->i_flags_lock);
>  
>  	/*
>  	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
> @@ -2559,43 +2563,53 @@ xfs_ifree_get_one_inode(
>  	 */
>  	if (ip != free_ip) {
>  		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
> +			spin_unlock(&ip->i_flags_lock);
>  			rcu_read_unlock();
>  			delay(1);
>  			goto retry;
>  		}
> -
> -		/*
> -		 * Check the inode number again in case we're racing with
> -		 * freeing in xfs_reclaim_inode().  See the comments in that
> -		 * function for more information as to why the initial check is
> -		 * not sufficient.
> -		 */
> -		if (ip->i_ino != inum) {
> -			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -			goto out_rcu_unlock;
> -		}
>  	}
> +	ip->i_flags |= XFS_ISTALE;
> +	spin_unlock(&ip->i_flags_lock);
>  	rcu_read_unlock();
>  
> -	xfs_iflock(ip);
> -	xfs_iflags_set(ip, XFS_ISTALE);
> +	/*
> +	 * If we can't get the flush lock, the inode is already attached.  All
> +	 * we needed to do here is mark the inode stale so buffer IO completion
> +	 * will remove it from the AIL.
> +	 */
> +	iip = ip->i_itemp;
> +	if (!xfs_iflock_nowait(ip)) {
> +		ASSERT(!list_empty(&iip->ili_item.li_bio_list));
> +		ASSERT(iip->ili_last_fields);
> +		goto out_iunlock;
> +	}
> +	ASSERT(!iip || list_empty(&iip->ili_item.li_bio_list));
>  
>  	/*
> -	 * We don't need to attach clean inodes or those only with unlogged
> -	 * changes (which we throw away, anyway).
> +	 * Clean inodes can be released immediately.  Everything else has to go
> +	 * through xfs_iflush_abort() on journal commit as the flock
> +	 * synchronises removal of the inode from the cluster buffer against
> +	 * inode reclaim.
>  	 */
> -	if (!ip->i_itemp || xfs_inode_clean(ip)) {
> -		ASSERT(ip != free_ip);
> +	if (xfs_inode_clean(ip)) {
>  		xfs_ifunlock(ip);
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -		goto out_no_inode;
> +		goto out_iunlock;
>  	}
> -	return ip;
>  
> -out_rcu_unlock:
> -	rcu_read_unlock();
> -out_no_inode:
> -	return NULL;
> +	/* we have a dirty inode in memory that has not yet been flushed. */
> +	ASSERT(iip->ili_fields);
> +	spin_lock(&iip->ili_lock);
> +	iip->ili_last_fields = iip->ili_fields;
> +	iip->ili_fields = 0;
> +	iip->ili_fsync_fields = 0;
> +	spin_unlock(&iip->ili_lock);
> +	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> +	ASSERT(iip->ili_last_fields);
> +
> +out_iunlock:
> +	if (ip != free_ip)
> +		xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  }
>  
>  /*
> @@ -2605,26 +2619,20 @@ xfs_ifree_get_one_inode(
>   */
>  STATIC int
>  xfs_ifree_cluster(
> -	xfs_inode_t		*free_ip,
> -	xfs_trans_t		*tp,
> +	struct xfs_inode	*free_ip,
> +	struct xfs_trans	*tp,
>  	struct xfs_icluster	*xic)
>  {
> -	xfs_mount_t		*mp = free_ip->i_mount;
> +	struct xfs_mount	*mp = free_ip->i_mount;
> +	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
> +	struct xfs_buf		*bp;
> +	xfs_daddr_t		blkno;
> +	xfs_ino_t		inum = xic->first_ino;
>  	int			nbufs;
>  	int			i, j;
>  	int			ioffset;
> -	xfs_daddr_t		blkno;
> -	xfs_buf_t		*bp;
> -	xfs_inode_t		*ip;
> -	struct xfs_inode_log_item *iip;
> -	struct xfs_log_item	*lip;
> -	struct xfs_perag	*pag;
> -	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
> -	xfs_ino_t		inum;
>  	int			error;
>  
> -	inum = xic->first_ino;
> -	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inum));
>  	nbufs = igeo->ialloc_blks / igeo->blocks_per_cluster;
>  
>  	for (j = 0; j < nbufs; j++, inum += igeo->inodes_per_cluster) {
> @@ -2668,59 +2676,16 @@ xfs_ifree_cluster(
>  		bp->b_ops = &xfs_inode_buf_ops;
>  
>  		/*
> -		 * Walk the inodes already attached to the buffer and mark them
> -		 * stale. These will all have the flush locks held, so an
> -		 * in-memory inode walk can't lock them. By marking them all
> -		 * stale first, we will not attempt to lock them in the loop
> -		 * below as the XFS_ISTALE flag will be set.
> -		 */
> -		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> -			if (lip->li_type == XFS_LI_INODE) {
> -				iip = (struct xfs_inode_log_item *)lip;
> -				xfs_trans_ail_copy_lsn(mp->m_ail,
> -							&iip->ili_flush_lsn,
> -							&iip->ili_item.li_lsn);
> -				xfs_iflags_set(iip->ili_inode, XFS_ISTALE);
> -			}
> -		}
> -
> -
> -		/*
> -		 * For each inode in memory attempt to add it to the inode
> -		 * buffer and set it up for being staled on buffer IO
> -		 * completion.  This is safe as we've locked out tail pushing
> -		 * and flushing by locking the buffer.
> -		 *
> -		 * We have already marked every inode that was part of a
> -		 * transaction stale above, which means there is no point in
> -		 * even trying to lock them.
> +		 * Now we need to set all the cached clean inodes as XFS_ISTALE,
> +		 * too. This requires lookups, and will skip inodes that we've
> +		 * already marked XFS_ISTALE.
>  		 */
> -		for (i = 0; i < igeo->inodes_per_cluster; i++) {
> -			ip = xfs_ifree_get_one_inode(pag, free_ip, inum + i);
> -			if (!ip)
> -				continue;
> -
> -			iip = ip->i_itemp;
> -			spin_lock(&iip->ili_lock);
> -			iip->ili_last_fields = iip->ili_fields;
> -			iip->ili_fields = 0;
> -			iip->ili_fsync_fields = 0;
> -			spin_unlock(&iip->ili_lock);
> -			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
> -						&iip->ili_item.li_lsn);
> -
> -			list_add_tail(&iip->ili_item.li_bio_list,
> -						&bp->b_li_list);
> -
> -			if (ip != free_ip)
> -				xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -		}
> +		for (i = 0; i < igeo->inodes_per_cluster; i++)
> +			xfs_ifree_mark_inode_stale(bp, free_ip, inum + i);
>  
>  		xfs_trans_stale_inode_buf(tp, bp);
>  		xfs_trans_binval(tp, bp);
>  	}
> -
> -	xfs_perag_put(pag);
>  	return 0;
>  }
>  
> @@ -3845,6 +3810,7 @@ xfs_iflush_int(
>  	iip->ili_fields = 0;
>  	iip->ili_fsync_fields = 0;
>  	spin_unlock(&iip->ili_lock);
> +	ASSERT(iip->ili_last_fields);
>  
>  	/*
>  	 * Store the current LSN of the inode so that we can tell whether the
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 25/30] xfs: attach inodes to the cluster buffer when dirtied
  2020-06-01 21:42 ` [PATCH 25/30] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
@ 2020-06-02 23:03   ` Darrick J. Wong
  0 siblings, 0 replies; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02 23:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:46AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Rather than attach inodes to the cluster buffer just when we are
> doing IO, attach the inodes to the cluster buffer when they are
> dirtied. The means the buffer always carries a list of dirty inodes
> that reference it, and we can use that list to make more fundamental
> changes to inode writeback that aren't otherwise possible.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks straightforward.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/libxfs/xfs_trans_inode.c |  9 ++++++---
>  fs/xfs/xfs_buf_item.c           |  1 +
>  fs/xfs/xfs_icache.c             |  1 +
>  fs/xfs/xfs_inode.c              | 24 +++++-------------------
>  fs/xfs/xfs_inode_item.c         | 14 ++++++++------
>  5 files changed, 21 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 1e7147b90725e..5e7634c13ce78 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -164,13 +164,16 @@ xfs_trans_log_inode(
>  		/*
>  		 * We need an explicit buffer reference for the log item but
>  		 * don't want the buffer to remain attached to the transaction.
> -		 * Hold the buffer but release the transaction reference.
> +		 * Hold the buffer but release the transaction reference once
> +		 * we've attached the inode log item to the buffer log item
> +		 * list.
>  		 */
>  		xfs_buf_hold(bp);
> -		xfs_trans_brelse(tp, bp);
> -
>  		spin_lock(&iip->ili_lock);
>  		iip->ili_item.li_buf = bp;
> +		bp->b_flags |= _XBF_INODES;
> +		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> +		xfs_trans_brelse(tp, bp);
>  	}
>  
>  	/*
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 9739d64a46443..6e7a2d460a675 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -465,6 +465,7 @@ xfs_buf_item_unpin(
>  		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
>  			xfs_buf_item_done(bp);
>  			xfs_iflush_done(bp);
> +			ASSERT(list_empty(&bp->b_li_list));
>  		} else {
>  			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
>  			xfs_buf_item_relse(bp);
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 4fe6f250e8448..ed386bc930977 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -115,6 +115,7 @@ __xfs_inode_free(
>  {
>  	/* asserts to verify all state is correct here */
>  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> +	ASSERT(!ip->i_itemp || list_empty(&ip->i_itemp->ili_item.li_bio_list));
>  	XFS_STATS_DEC(ip->i_mount, vn_active);
>  
>  	call_rcu(&VFS_I(ip)->i_rcu, xfs_inode_free_callback);
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index fb4c614c64fda..af65acd24ec4e 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2584,27 +2584,24 @@ xfs_ifree_mark_inode_stale(
>  		ASSERT(iip->ili_last_fields);
>  		goto out_iunlock;
>  	}
> -	ASSERT(!iip || list_empty(&iip->ili_item.li_bio_list));
>  
>  	/*
> -	 * Clean inodes can be released immediately.  Everything else has to go
> -	 * through xfs_iflush_abort() on journal commit as the flock
> -	 * synchronises removal of the inode from the cluster buffer against
> -	 * inode reclaim.
> +	 * Inodes not attached to the buffer can be released immediately.
> +	 * Everything else has to go through xfs_iflush_abort() on journal
> +	 * commit as the flock synchronises removal of the inode from the
> +	 * cluster buffer against inode reclaim.
>  	 */
> -	if (xfs_inode_clean(ip)) {
> +	if (!iip || list_empty(&iip->ili_item.li_bio_list)) {
>  		xfs_ifunlock(ip);
>  		goto out_iunlock;
>  	}
>  
>  	/* we have a dirty inode in memory that has not yet been flushed. */
> -	ASSERT(iip->ili_fields);
>  	spin_lock(&iip->ili_lock);
>  	iip->ili_last_fields = iip->ili_fields;
>  	iip->ili_fields = 0;
>  	iip->ili_fsync_fields = 0;
>  	spin_unlock(&iip->ili_lock);
> -	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
>  	ASSERT(iip->ili_last_fields);
>  
>  out_iunlock:
> @@ -3819,19 +3816,8 @@ xfs_iflush_int(
>  	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  				&iip->ili_item.li_lsn);
>  
> -	/*
> -	 * Attach the inode item callback to the buffer whether the flush
> -	 * succeeded or not. If not, the caller will shut down and fail I/O
> -	 * completion on the buffer to remove the inode from the AIL and release
> -	 * the flush lock.
> -	 */
> -	bp->b_flags |= _XBF_INODES;
> -	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> -
>  	/* generate the checksum. */
>  	xfs_dinode_calc_crc(mp, dip);
> -
> -	ASSERT(!list_empty(&bp->b_li_list));
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 0a7720b7a821a..66675b75de3ec 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -665,10 +665,7 @@ xfs_inode_item_destroy(
>   *
>   * Note: Now that we attach the log item to the buffer when we first log the
>   * inode in memory, we can have unflushed inodes on the buffer list here. These
> - * inodes will have a zero ili_last_fields, so skip over them here. We do
> - * this check -after- we've checked for stale inodes, because we're guaranteed
> - * to have XFS_ISTALE set in the case that dirty inodes are in the CIL and have
> - * not yet had their dirtying transactions committed to disk.
> + * inodes will have a zero ili_last_fields, so skip over them here.
>   */
>  void
>  xfs_iflush_done(
> @@ -686,8 +683,8 @@ xfs_iflush_done(
>  	 */
>  	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
>  		iip = INODE_ITEM(lip);
> +
>  		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
> -			list_del_init(&lip->li_bio_list);
>  			xfs_iflush_abort(iip->ili_inode);
>  			continue;
>  		}
> @@ -740,12 +737,16 @@ xfs_iflush_done(
>  		/*
>  		 * Remove the reference to the cluster buffer if the inode is
>  		 * clean in memory. Drop the buffer reference once we've dropped
> -		 * the locks we hold.
> +		 * the locks we hold. If the inode is dirty in memory, we need
> +		 * to put the inode item back on the buffer list for another
> +		 * pass through the flush machinery.
>  		 */
>  		ASSERT(iip->ili_item.li_buf == bp);
>  		if (!iip->ili_fields) {
>  			iip->ili_item.li_buf = NULL;
>  			drop_buffer = true;
> +		} else {
> +			list_add(&lip->li_bio_list, &bp->b_li_list);
>  		}
>  		iip->ili_last_fields = 0;
>  		iip->ili_flush_lsn = 0;
> @@ -789,6 +790,7 @@ xfs_iflush_abort(
>  		iip->ili_flush_lsn = 0;
>  		bp = iip->ili_item.li_buf;
>  		iip->ili_item.li_buf = NULL;
> +		list_del_init(&iip->ili_item.li_bio_list);
>  		spin_unlock(&iip->ili_lock);
>  	}
>  	xfs_ifunlock(ip);
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration
  2020-06-01 21:42 ` [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
@ 2020-06-02 23:23   ` Darrick J. Wong
  0 siblings, 0 replies; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-02 23:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:49AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have all the dirty inodes attached to the cluster
> buffer, we don't actually have to do radix tree lookups to find
> them. Sure, the radix tree is efficient, but walking a linked list
> of just the dirty inodes attached to the buffer is much better.
> 
> We are also no longer dependent on having a locked inode passed into
> the function to determine where to start the lookup. This means we
> can drop it from the function call and treat all inodes the same.
> 
> We also make xfs_iflush_cluster skip inodes marked with
> XFS_IRECLAIM. This we avoid races with inodes that reclaim is
> actively referencing or are being re-initialised by inode lookup. If
> they are actually dirty, they'll get written by a future cluster
> flush....
> 
> We also add a shutdown check after obtaining the flush lock so that
> we catch inodes that are dirty in memory and may have inconsistent
> state due to the shutdown in progress. We abort these inodes
> directly and so they remove themselves directly from the buffer list
> and the AIL rather than having to wait for the buffer to be failed
> and callbacks run to be processed correctly.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode.c      | 148 ++++++++++++++++------------------------
>  fs/xfs/xfs_inode.h      |   2 +-
>  fs/xfs/xfs_inode_item.c |   2 +-
>  3 files changed, 62 insertions(+), 90 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 8566bd0f4334d..931a483d5b316 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3611,117 +3611,94 @@ xfs_iflush(
>   */
>  int
>  xfs_iflush_cluster(
> -	struct xfs_inode	*ip,
>  	struct xfs_buf		*bp)
>  {
> -	struct xfs_mount	*mp = ip->i_mount;
> -	struct xfs_perag	*pag;
> -	unsigned long		first_index, mask;
> -	int			cilist_size;
> -	struct xfs_inode	**cilist;
> -	struct xfs_inode	*cip;
> -	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
> -	int			error = 0;
> -	int			nr_found;
> +	struct xfs_mount	*mp = bp->b_mount;
> +	struct xfs_log_item	*lip, *n;
> +	struct xfs_inode	*ip;
> +	struct xfs_inode_log_item *iip;
>  	int			clcount = 0;
> -	int			i;
> -
> -	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> -
> -	cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *);
> -	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
> -	if (!cilist)
> -		goto out_put;
> -
> -	mask = ~(igeo->inodes_per_cluster - 1);
> -	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
> -	rcu_read_lock();
> -	/* really need a gang lookup range call here */
> -	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
> -					first_index, igeo->inodes_per_cluster);
> -	if (nr_found == 0)
> -		goto out_free;
> +	int			error = 0;
>  
> -	for (i = 0; i < nr_found; i++) {
> -		cip = cilist[i];
> +	/*
> +	 * We must use the safe variant here as on shutdown xfs_iflush_abort()
> +	 * can remove itself from the list.
> +	 */
> +	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
> +		iip = (struct xfs_inode_log_item *)lip;
> +		ip = iip->ili_inode;
>  
>  		/*
> -		 * because this is an RCU protected lookup, we could find a
> -		 * recently freed or even reallocated inode during the lookup.
> -		 * We need to check under the i_flags_lock for a valid inode
> -		 * here. Skip it if it is not valid or the wrong inode.
> +		 * Quick and dirty check to avoid locks if possible.
>  		 */
> -		spin_lock(&cip->i_flags_lock);
> -		if (!cip->i_ino ||
> -		    __xfs_iflags_test(cip, XFS_ISTALE)) {
> -			spin_unlock(&cip->i_flags_lock);
> +		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK))
> +			continue;
> +		if (xfs_ipincount(ip))
>  			continue;
> -		}
>  
>  		/*
> -		 * Once we fall off the end of the cluster, no point checking
> -		 * any more inodes in the list because they will also all be
> -		 * outside the cluster.
> +		 * The inode is still attached to the buffer, which means it is
> +		 * dirty but reclaim might try to grab it. Check carefully for
> +		 * that, and grab the ilock while still holding the i_flags_lock
> +		 * to guarantee reclaim will not be able to reclaim this inode
> +		 * once we drop the i_flags_lock.
>  		 */
> -		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
> -			spin_unlock(&cip->i_flags_lock);
> -			break;
> +		spin_lock(&ip->i_flags_lock);
> +		ASSERT(!__xfs_iflags_test(ip, XFS_ISTALE));
> +		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK)) {
> +			spin_unlock(&ip->i_flags_lock);
> +			continue;
>  		}
> -		spin_unlock(&cip->i_flags_lock);
>  
>  		/*
> -		 * Do an un-protected check to see if the inode is dirty and
> -		 * is a candidate for flushing.  These checks will be repeated
> -		 * later after the appropriate locks are acquired.
> +		 * ILOCK will pin the inode against reclaim and prevent
> +		 * concurrent transactions modifying the inode while we are
> +		 * flushing the inode.
>  		 */
> -		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
> +		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) {
> +			spin_unlock(&ip->i_flags_lock);
>  			continue;
> +		}
> +		spin_unlock(&ip->i_flags_lock);
>  
>  		/*
> -		 * Try to get locks.  If any are unavailable or it is pinned,
> -		 * then this inode cannot be flushed and is skipped.
> +		 * Skip inodes that are already flush locked as they have
> +		 * already been written to the buffer.
>  		 */
> -
> -		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
> -			continue;
> -		if (!xfs_iflock_nowait(cip)) {
> -			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> -			continue;
> -		}
> -		if (xfs_ipincount(cip)) {
> -			xfs_ifunlock(cip);
> -			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> +		if (!xfs_iflock_nowait(ip)) {
> +			xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  			continue;
>  		}
>  
> -
>  		/*
> -		 * Check the inode number again, just to be certain we are not
> -		 * racing with freeing in xfs_reclaim_inode(). See the comments
> -		 * in that function for more information as to why the initial
> -		 * check is not sufficient.
> +		 * If we are shut down, unpin and abort the inode now as there
> +		 * is no point in flushing it to the buffer just to get an IO
> +		 * completion to abort the buffer and remove it from the AIL.
>  		 */
> -		if (!cip->i_ino) {
> -			xfs_ifunlock(cip);
> -			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> +		if (XFS_FORCED_SHUTDOWN(mp)) {
> +			xfs_iunpin_wait(ip);
> +			/* xfs_iflush_abort() drops the flush lock */
> +			xfs_iflush_abort(ip);
> +			xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +			error = -EIO;
>  			continue;
>  		}
>  
> -		/*
> -		 * arriving here means that this inode can be flushed.  First
> -		 * re-check that it's dirty before flushing.
> -		 */
> -		if (!xfs_inode_clean(cip)) {
> -			error = xfs_iflush(cip, bp);
> -			if (error) {
> -				xfs_iunlock(cip, XFS_ILOCK_SHARED);
> -				goto out_free;
> -			}
> -			clcount++;
> -		} else {
> -			xfs_ifunlock(cip);
> +		/* don't block waiting on a log force to unpin dirty inodes */
> +		if (xfs_ipincount(ip)) {
> +			xfs_ifunlock(ip);
> +			xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +			continue;
>  		}
> -		xfs_iunlock(cip, XFS_ILOCK_SHARED);
> +
> +		if (!xfs_inode_clean(ip))
> +			error = xfs_iflush(ip, bp);
> +		else
> +			xfs_ifunlock(ip);
> +		xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +		if (error)
> +			break;
> +		clcount++;
>  	}
>  
>  	if (clcount) {
> @@ -3729,11 +3706,6 @@ xfs_iflush_cluster(
>  		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
>  	}
>  
> -out_free:
> -	rcu_read_unlock();
> -	kmem_free(cilist);
> -out_put:
> -	xfs_perag_put(pag);
>  	if (error) {
>  		bp->b_flags |= XBF_ASYNC;
>  		xfs_buf_ioend_fail(bp);
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index d1109eb13ba2e..b93cf9076df8a 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -427,7 +427,7 @@ int		xfs_log_force_inode(struct xfs_inode *ip);
>  void		xfs_iunpin_wait(xfs_inode_t *);
>  #define xfs_ipincount(ip)	((unsigned int) atomic_read(&ip->i_pincount))
>  
> -int		xfs_iflush_cluster(struct xfs_inode *, struct xfs_buf *);
> +int		xfs_iflush_cluster(struct xfs_buf *);
>  void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
>  				struct xfs_inode *ip1, uint ip1_mode);
>  
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index e679fac944725..a3a8ae5e39e12 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -513,7 +513,7 @@ xfs_inode_item_push(
>  	 * reference for IO until we queue the buffer for delwri submission.
>  	 */
>  	xfs_buf_hold(bp);
> -	error = xfs_iflush_cluster(ip, bp);
> +	error = xfs_iflush_cluster(bp);
>  	if (!error) {
>  		if (!xfs_buf_delwri_queue(bp, buffer_list))
>  			rval = XFS_ITEM_FLUSHING;
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 04/30] xfs: mark inode buffers in cache
  2020-06-02 21:29     ` Dave Chinner
@ 2020-06-03 14:57       ` Brian Foster
  2020-06-03 21:21         ` Dave Chinner
  0 siblings, 1 reply; 80+ messages in thread
From: Brian Foster @ 2020-06-03 14:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 03, 2020 at 07:29:18AM +1000, Dave Chinner wrote:
> On Tue, Jun 02, 2020 at 12:45:35PM -0400, Brian Foster wrote:
> > On Tue, Jun 02, 2020 at 07:42:25AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Inode buffers always have write IO callbacks, so by marking them
> > > directly we can avoid needing to attach ->b_iodone functions to
> > > them. This avoids an indirect call, and makes future modifications
> > > much simpler.
> > > 
> > > This is largely a rearrangement of the code at this point - no IO
> > > completion functionality changes at this point, just how the
> > > code is run is modified.
> > > 
> > 
> > Ok, I was initially thinking this patch looked incomplete in that we
> > continue to set ->b_iodone() on inode buffers even though we'd never
> > call it. Looking ahead, I see that the next few patches continue to
> > clean that up to eventually remove ->b_iodone(), so that addresses that.
> > 
> > My only other curiosity is that while there may not be any functional
> > difference, this technically changes callback behavior in that we set
> > the new flag in some contexts that don't currently attach anything to
> > the buffer, right? E.g., xfs_trans_inode_alloc_buf() sets the flag on
> > inode chunk init, which means we can write out an inode buffer without
> > any attached/flushed inodes.
> 
> Yes, it can happen, and it happens before this patch, too, because
> the AIL can push the buffer log item directly and that does not
> flush dirty inodes to the buffer before it writes back(*).
> 

I was thinking more about cases where there are actually no inodes
attached.

> As it is, xfs_buf_inode_iodone() on a buffer with no inode attached
> if functionally identical to the existing xfs_buf_iodone() callback
> that would otherwise be done. i.e. it just runs the buffer log item
> completion callback. Hence the change here rearranges code, but it
> does not change behaviour at all.
> 

Right. That's indicative from the code, but doesn't help me understand
why the change is made. That's all I'm asking for...

> (*) this is a double-write bug that this patch set does not address.
> i.e. buffer log item flushes the buffer without flushing inodes, IO
> compeletes, then inode flush to the buffer and we do another IO to
> clean them.  This is addressed by a follow-on patchset that tracks
> dirty inodes via ordered cluster buffers, such that pushing the
> buffer always triggers xfs_iflush_cluster() on buffers tagged
> _XBF_INODES...
> 

Ok, interesting (but seems beyond the scope of this series).

> > Is the intent of that to support future
> > changes? If so, a note about that in the commit log would be helpful.
> 
> That's part of it, as you can see from the (*) above. But the commit
> log already says "..., and makes future modifications much simpler."
> Was that insufficient to indicate that it will be used later on?
> 

That's a rather vague hint. ;P I was more hoping for something like:
"While this is largely a refactor of existing functionality, broaden the
scope of the flag to beyond where inodes are explicitly attached because
<some actual reason>. This has the effect of possibly invoking the
callback in cases where it wouldn't have been previously, but this is
not a functional change because the callback is effectively a no-op when
inodes are not attached."

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 07/30] xfs: call xfs_buf_iodone directly
  2020-06-02 21:38     ` Dave Chinner
@ 2020-06-03 14:58       ` Brian Foster
  0 siblings, 0 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-03 14:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 03, 2020 at 07:38:09AM +1000, Dave Chinner wrote:
> On Tue, Jun 02, 2020 at 12:47:42PM -0400, Brian Foster wrote:
> > On Tue, Jun 02, 2020 at 07:42:28AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > All unmarked dirty buffers should be in the AIL and have log items
> > > attached to them. Hence when they are written, we will run a
> > > callback to remove the item from the AIL if appropriate. Now that
> > > we've handled inode and dquot buffers, all remaining calls are to
> > > xfs_buf_iodone() and so we can hard code this rather than use an
> > > indirect call.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> > > ---
> > >  fs/xfs/xfs_buf.c       | 24 ++++++++----------------
> > >  fs/xfs/xfs_buf.h       |  6 +-----
> > >  fs/xfs/xfs_buf_item.c  | 40 ++++++++++------------------------------
> > >  fs/xfs/xfs_buf_item.h  |  4 ++--
> > >  fs/xfs/xfs_trans_buf.c | 13 +++----------
> > >  5 files changed, 24 insertions(+), 63 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 0a69de674af9d..d7695b638e994 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > ...
> > > @@ -1226,14 +1225,7 @@ xfs_buf_ioend(
> > >  		xfs_buf_dquot_iodone(bp);
> > >  		return;
> > >  	}
> > > -
> > > -	if (bp->b_iodone) {
> > > -		(*(bp->b_iodone))(bp);
> > > -		return;
> > > -	}
> > > -
> > > -out_finish:
> > > -	xfs_buf_ioend_finish(bp);
> > > +	xfs_buf_iodone(bp);
> > 
> > The way this function ends up would probably look nicer as an if/else
> > chain rather than a sequence of internal return statements.
> 
> I've kinda avoided refactoring these early patches because they
> cascade into non-trivial conflicts with later patches in the series.
> I've spent too much time chasing bugs introduced in the later
> patches because of conflict resolution not being quite right. Hence
> I want to leave cleanup and refactoring to a series after this whole
> line of development is complete and the problems are solved.
> 
> > BTW, is there a longer term need to have three separate iodone functions
> > here that do the same thing?
> 
> The inode iodone function changes almost immediately. I did it this
> way so that the process of changing the inode buffer completion
> functionality did not, in any way, impact on other types of buffers.
> We need to go through the same process with dquot buffers, and then
> once that is done, we can look to refactor all this into a more
> integrated solution that largely sits in xfs_buf.c.
> 

Seems reasonable enough to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 09/30] xfs: make inode IO completion buffer centric
  2020-06-01 21:42 ` [PATCH 09/30] xfs: make inode IO completion buffer centric Dave Chinner
@ 2020-06-03 14:58   ` Brian Foster
  0 siblings, 0 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-03 14:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:30AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Having different io completion callbacks for different inode states
> makes things complex. We can detect if the inode is stale via the
> XFS_ISTALE flag in IO completion, so we don't need a special
> callback just for this.
> 
> This means inodes only have a single iodone callback, and inode IO
> completion is entirely buffer centric at this point. Hence we no
> longer need to use a log item callback at all as we can just call
> xfs_iflush_done() directly from the buffer completions and walk the
> buffer log item list to complete the all inodes under IO.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Probably not worth changing now, but I think this would have been
cleaner if the elimination of xfs_istale_done() was factored into a
separate patch. Otherwise LGTM:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_buf_item.c   | 35 ++++++++++++++++++----
>  fs/xfs/xfs_inode.c      |  6 ++--
>  fs/xfs/xfs_inode_item.c | 65 ++++++++++++++---------------------------
>  fs/xfs/xfs_inode_item.h |  5 ++--
>  4 files changed, 56 insertions(+), 55 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 5b3cd5e90947c..a4e416af5c614 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -13,6 +13,8 @@
>  #include "xfs_mount.h"
>  #include "xfs_trans.h"
>  #include "xfs_buf_item.h"
> +#include "xfs_inode.h"
> +#include "xfs_inode_item.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
> @@ -457,7 +459,8 @@ xfs_buf_item_unpin(
>  		 * the AIL lock.
>  		 */
>  		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> -			xfs_buf_do_callbacks(bp);
> +			lip->li_cb(bp, lip);
> +			xfs_iflush_done(bp);
>  			bp->b_log_item = NULL;
>  		} else {
>  			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> @@ -1141,8 +1144,8 @@ xfs_buf_iodone_callback_error(
>  	return false;
>  }
>  
> -static void
> -xfs_buf_run_callbacks(
> +static inline bool
> +xfs_buf_had_callback_errors(
>  	struct xfs_buf		*bp)
>  {
>  
> @@ -1152,7 +1155,7 @@ xfs_buf_run_callbacks(
>  	 * appropriate action.
>  	 */
>  	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
> -		return;
> +		return true;
>  
>  	/*
>  	 * Successful IO or permanent error. Either way, we can clear the
> @@ -1161,7 +1164,16 @@ xfs_buf_run_callbacks(
>  	bp->b_last_error = 0;
>  	bp->b_retries = 0;
>  	bp->b_first_retry_time = 0;
> +	return false;
> +}
>  
> +static void
> +xfs_buf_run_callbacks(
> +	struct xfs_buf		*bp)
> +{
> +
> +	if (xfs_buf_had_callback_errors(bp))
> +		return;
>  	xfs_buf_do_callbacks(bp);
>  	bp->b_log_item = NULL;
>  }
> @@ -1173,7 +1185,20 @@ void
>  xfs_buf_inode_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	xfs_buf_run_callbacks(bp);
> +	struct xfs_buf_log_item *blip = bp->b_log_item;
> +	struct xfs_log_item	*lip;
> +
> +	if (xfs_buf_had_callback_errors(bp))
> +		return;
> +
> +	/* If there is a buf_log_item attached, run its callback */
> +	if (blip) {
> +		lip = &blip->bli_item;
> +		lip->li_cb(bp, lip);
> +		bp->b_log_item = NULL;
> +	}
> +
> +	xfs_iflush_done(bp);
>  	xfs_buf_ioend_finish(bp);
>  }
>  
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index d5dee57f914a9..1b4e8e0bb0cf0 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2677,7 +2677,6 @@ xfs_ifree_cluster(
>  		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
>  			if (lip->li_type == XFS_LI_INODE) {
>  				iip = (struct xfs_inode_log_item *)lip;
> -				lip->li_cb = xfs_istale_done;
>  				xfs_trans_ail_copy_lsn(mp->m_ail,
>  							&iip->ili_flush_lsn,
>  							&iip->ili_item.li_lsn);
> @@ -2710,8 +2709,7 @@ xfs_ifree_cluster(
>  			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  						&iip->ili_item.li_lsn);
>  
> -			xfs_buf_attach_iodone(bp, xfs_istale_done,
> -						  &iip->ili_item);
> +			xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
>  
>  			if (ip != free_ip)
>  				xfs_iunlock(ip, XFS_ILOCK_EXCL);
> @@ -3861,7 +3859,7 @@ xfs_iflush_int(
>  	 * the flush lock.
>  	 */
>  	bp->b_flags |= _XBF_INODES;
> -	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);
> +	xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
>  
>  	/* generate the checksum. */
>  	xfs_dinode_calc_crc(mp, dip);
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 6ef9cbcfc94a7..7049f2ae8d186 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -668,40 +668,34 @@ xfs_inode_item_destroy(
>   */
>  void
>  xfs_iflush_done(
> -	struct xfs_buf		*bp,
> -	struct xfs_log_item	*lip)
> +	struct xfs_buf		*bp)
>  {
>  	struct xfs_inode_log_item *iip;
> -	struct xfs_log_item	*blip, *n;
> -	struct xfs_ail		*ailp = lip->li_ailp;
> +	struct xfs_log_item	*lip, *n;
> +	struct xfs_ail		*ailp = bp->b_mount->m_ail;
>  	int			need_ail = 0;
>  	LIST_HEAD(tmp);
>  
>  	/*
> -	 * Scan the buffer IO completions for other inodes being completed and
> -	 * attach them to the current inode log item.
> +	 * Pull the attached inodes from the buffer one at a time and take the
> +	 * appropriate action on them.
>  	 */
> -
> -	list_add_tail(&lip->li_bio_list, &tmp);
> -
> -	list_for_each_entry_safe(blip, n, &bp->b_li_list, li_bio_list) {
> -		if (lip->li_cb != xfs_iflush_done)
> +	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
> +		iip = INODE_ITEM(lip);
> +		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
> +			list_del_init(&lip->li_bio_list);
> +			xfs_iflush_abort(iip->ili_inode);
>  			continue;
> +		}
>  
> -		list_move_tail(&blip->li_bio_list, &tmp);
> +		list_move_tail(&lip->li_bio_list, &tmp);
>  
>  		/* Do an unlocked check for needing the AIL lock. */
> -		iip = INODE_ITEM(blip);
> -		if (blip->li_lsn == iip->ili_flush_lsn ||
> -		    test_bit(XFS_LI_FAILED, &blip->li_flags))
> +		if (lip->li_lsn == iip->ili_flush_lsn ||
> +		    test_bit(XFS_LI_FAILED, &lip->li_flags))
>  			need_ail++;
>  	}
> -
> -	/* make sure we capture the state of the initial inode. */
> -	iip = INODE_ITEM(lip);
> -	if (lip->li_lsn == iip->ili_flush_lsn ||
> -	    test_bit(XFS_LI_FAILED, &lip->li_flags))
> -		need_ail++;
> +	ASSERT(list_empty(&bp->b_li_list));
>  
>  	/*
>  	 * We only want to pull the item from the AIL if it is actually there
> @@ -713,19 +707,13 @@ xfs_iflush_done(
>  
>  		/* this is an opencoded batch version of xfs_trans_ail_delete */
>  		spin_lock(&ailp->ail_lock);
> -		list_for_each_entry(blip, &tmp, li_bio_list) {
> -			if (blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
> -				/*
> -				 * xfs_ail_update_finish() only cares about the
> -				 * lsn of the first tail item removed, any
> -				 * others will be at the same or higher lsn so
> -				 * we just ignore them.
> -				 */
> -				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip);
> +		list_for_each_entry(lip, &tmp, li_bio_list) {
> +			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
> +				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
>  				if (!tail_lsn && lsn)
>  					tail_lsn = lsn;
>  			} else {
> -				xfs_clear_li_failed(blip);
> +				xfs_clear_li_failed(lip);
>  			}
>  		}
>  		xfs_ail_update_finish(ailp, tail_lsn);
> @@ -736,9 +724,9 @@ xfs_iflush_done(
>  	 * ili_last_fields bits now that we know that the data corresponding to
>  	 * them is safely on disk.
>  	 */
> -	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
> -		list_del_init(&blip->li_bio_list);
> -		iip = INODE_ITEM(blip);
> +	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
> +		list_del_init(&lip->li_bio_list);
> +		iip = INODE_ITEM(lip);
>  
>  		spin_lock(&iip->ili_lock);
>  		iip->ili_last_fields = 0;
> @@ -746,7 +734,6 @@ xfs_iflush_done(
>  
>  		xfs_ifunlock(iip->ili_inode);
>  	}
> -	list_del(&tmp);
>  }
>  
>  /*
> @@ -779,14 +766,6 @@ xfs_iflush_abort(
>  	xfs_ifunlock(ip);
>  }
>  
> -void
> -xfs_istale_done(
> -	struct xfs_buf		*bp,
> -	struct xfs_log_item	*lip)
> -{
> -	xfs_iflush_abort(INODE_ITEM(lip)->ili_inode);
> -}
> -
>  /*
>   * convert an xfs_inode_log_format struct from the old 32 bit version
>   * (which can have different field alignments) to the native 64 bit version
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 44c47c08b0b59..1545fccad4eeb 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -36,15 +36,14 @@ struct xfs_inode_log_item {
>  	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
>  };
>  
> -static inline int xfs_inode_clean(xfs_inode_t *ip)
> +static inline int xfs_inode_clean(struct xfs_inode *ip)
>  {
>  	return !ip->i_itemp || !(ip->i_itemp->ili_fields & XFS_ILOG_ALL);
>  }
>  
>  extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *);
>  extern void xfs_inode_item_destroy(struct xfs_inode *);
> -extern void xfs_iflush_done(struct xfs_buf *, struct xfs_log_item *);
> -extern void xfs_istale_done(struct xfs_buf *, struct xfs_log_item *);
> +extern void xfs_iflush_done(struct xfs_buf *);
>  extern void xfs_iflush_abort(struct xfs_inode *);
>  extern int xfs_inode_item_format_convert(xfs_log_iovec_t *,
>  					 struct xfs_inode_log_format *);
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 10/30] xfs: use direct calls for dquot IO completion
  2020-06-01 21:42 ` [PATCH 10/30] xfs: use direct calls for dquot IO completion Dave Chinner
  2020-06-02 19:25   ` Darrick J. Wong
@ 2020-06-03 14:58   ` Brian Foster
  1 sibling, 0 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-03 14:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:31AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Similar to inodes, we can call the dquot IO completion functions
> directly from the buffer completion code, removing another user of
> log item callbacks for IO completion processing.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_buf_item.c | 18 +++++++++++++++++-
>  fs/xfs/xfs_dquot.c    | 18 ++++++++++++++----
>  fs/xfs/xfs_dquot.h    |  1 +
>  3 files changed, 32 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index a4e416af5c614..f46e5ec28111c 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -15,6 +15,9 @@
>  #include "xfs_buf_item.h"
>  #include "xfs_inode.h"
>  #include "xfs_inode_item.h"
> +#include "xfs_quota.h"
> +#include "xfs_dquot_item.h"
> +#include "xfs_dquot.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
> @@ -1209,7 +1212,20 @@ void
>  xfs_buf_dquot_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	xfs_buf_run_callbacks(bp);
> +	struct xfs_buf_log_item *blip = bp->b_log_item;
> +	struct xfs_log_item	*lip;
> +
> +	if (xfs_buf_had_callback_errors(bp))
> +		return;
> +
> +	/* a newly allocated dquot buffer might have a log item attached */
> +	if (blip) {
> +		lip = &blip->bli_item;
> +		lip->li_cb(bp, lip);
> +		bp->b_log_item = NULL;
> +	}
> +
> +	xfs_dquot_done(bp);
>  	xfs_buf_ioend_finish(bp);
>  }
>  
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index 2e2146fa0914c..403bc4e9f21ff 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -1048,9 +1048,8 @@ xfs_qm_dqrele(
>   * from the AIL if it has not been re-logged, and unlocking the dquot's
>   * flush lock. This behavior is very similar to that of inodes..
>   */
> -STATIC void
> +static void
>  xfs_qm_dqflush_done(
> -	struct xfs_buf		*bp,
>  	struct xfs_log_item	*lip)
>  {
>  	struct xfs_dq_logitem	*qip = (struct xfs_dq_logitem *)lip;
> @@ -1091,6 +1090,18 @@ xfs_qm_dqflush_done(
>  	xfs_dqfunlock(dqp);
>  }
>  
> +void
> +xfs_dquot_done(
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_log_item	*lip, *n;
> +
> +	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
> +		list_del_init(&lip->li_bio_list);
> +		xfs_qm_dqflush_done(lip);
> +	}
> +}
> +
>  /*
>   * Write a modified dquot to disk.
>   * The dquot must be locked and the flush lock too taken by caller.
> @@ -1180,8 +1191,7 @@ xfs_qm_dqflush(
>  	 * AIL and release the flush lock once the dquot is synced to disk.
>  	 */
>  	bp->b_flags |= _XBF_DQUOTS;
> -	xfs_buf_attach_iodone(bp, xfs_qm_dqflush_done,
> -				  &dqp->q_logitem.qli_item);
> +	xfs_buf_attach_iodone(bp, NULL, &dqp->q_logitem.qli_item);
>  
>  	/*
>  	 * If the buffer is pinned then push on the log so we won't
> diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
> index 71e36c85e20b6..fe9cc3e08ed6d 100644
> --- a/fs/xfs/xfs_dquot.h
> +++ b/fs/xfs/xfs_dquot.h
> @@ -174,6 +174,7 @@ void		xfs_qm_dqput(struct xfs_dquot *dqp);
>  void		xfs_dqlock2(struct xfs_dquot *, struct xfs_dquot *);
>  
>  void		xfs_dquot_set_prealloc_limits(struct xfs_dquot *);
> +void		xfs_dquot_done(struct xfs_buf *);
>  
>  static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp)
>  {
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 11/30] xfs: clean up the buffer iodone callback functions
  2020-06-01 21:42 ` [PATCH 11/30] xfs: clean up the buffer iodone callback functions Dave Chinner
@ 2020-06-03 14:58   ` Brian Foster
  0 siblings, 0 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-03 14:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:32AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we've sorted inode and dquot buffers, we can apply the same
> cleanups to dirty buffers with buffer log items. They only have one
> callback, too, so we don't need the log item callback. Collapse the
> iodone functions and remove all the now unnecessary infrastructure
> around callback processing.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_buf_item.c  | 140 +++++++++--------------------------------
>  fs/xfs/xfs_buf_item.h  |   1 -
>  fs/xfs/xfs_trans_buf.c |   2 -
>  3 files changed, 29 insertions(+), 114 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index f46e5ec28111c..0ece5de9dd711 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -30,7 +30,7 @@ static inline struct xfs_buf_log_item *BUF_ITEM(struct xfs_log_item *lip)
>  	return container_of(lip, struct xfs_buf_log_item, bli_item);
>  }
>  
> -STATIC void	xfs_buf_do_callbacks(struct xfs_buf *bp);
> +static void xfs_buf_item_done(struct xfs_buf *bp);
>  
>  /* Is this log iovec plausibly large enough to contain the buffer log format? */
>  bool
> @@ -462,9 +462,8 @@ xfs_buf_item_unpin(
>  		 * the AIL lock.
>  		 */
>  		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> -			lip->li_cb(bp, lip);
> +			xfs_buf_item_done(bp);
>  			xfs_iflush_done(bp);
> -			bp->b_log_item = NULL;
>  		} else {
>  			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
>  			xfs_buf_item_relse(bp);
> @@ -973,46 +972,6 @@ xfs_buf_attach_iodone(
>  	list_add_tail(&lip->li_bio_list, &bp->b_li_list);
>  }
>  
> -/*
> - * We can have many callbacks on a buffer. Running the callbacks individually
> - * can cause a lot of contention on the AIL lock, so we allow for a single
> - * callback to be able to scan the remaining items in bp->b_li_list for other
> - * items of the same type and callback to be processed in the first call.
> - *
> - * As a result, the loop walking the callback list below will also modify the
> - * list. it removes the first item from the list and then runs the callback.
> - * The loop then restarts from the new first item int the list. This allows the
> - * callback to scan and modify the list attached to the buffer and we don't
> - * have to care about maintaining a next item pointer.
> - */
> -STATIC void
> -xfs_buf_do_callbacks(
> -	struct xfs_buf		*bp)
> -{
> -	struct xfs_buf_log_item *blip = bp->b_log_item;
> -	struct xfs_log_item	*lip;
> -
> -	/* If there is a buf_log_item attached, run its callback */
> -	if (blip) {
> -		lip = &blip->bli_item;
> -		lip->li_cb(bp, lip);
> -	}
> -
> -	while (!list_empty(&bp->b_li_list)) {
> -		lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
> -				       li_bio_list);
> -
> -		/*
> -		 * Remove the item from the list, so we don't have any
> -		 * confusion if the item is added to another buf.
> -		 * Don't touch the log item after calling its
> -		 * callback, because it could have freed itself.
> -		 */
> -		list_del_init(&lip->li_bio_list);
> -		lip->li_cb(bp, lip);
> -	}
> -}
> -
>  /*
>   * Invoke the error state callback for each log item affected by the failed I/O.
>   *
> @@ -1025,8 +984,8 @@ STATIC void
>  xfs_buf_do_callbacks_fail(
>  	struct xfs_buf		*bp)
>  {
> +	struct xfs_ail		*ailp = bp->b_mount->m_ail;
>  	struct xfs_log_item	*lip;
> -	struct xfs_ail		*ailp;
>  
>  	/*
>  	 * Buffer log item errors are handled directly by xfs_buf_item_push()
> @@ -1036,9 +995,6 @@ xfs_buf_do_callbacks_fail(
>  	if (list_empty(&bp->b_li_list))
>  		return;
>  
> -	lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
> -			li_bio_list);
> -	ailp = lip->li_ailp;
>  	spin_lock(&ailp->ail_lock);
>  	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
>  		if (lip->li_ops->iop_error)
> @@ -1051,22 +1007,11 @@ static bool
>  xfs_buf_iodone_callback_error(
>  	struct xfs_buf		*bp)
>  {
> -	struct xfs_buf_log_item	*bip = bp->b_log_item;
> -	struct xfs_log_item	*lip;
> -	struct xfs_mount	*mp;
> +	struct xfs_mount	*mp = bp->b_mount;
>  	static ulong		lasttime;
>  	static xfs_buftarg_t	*lasttarg;
>  	struct xfs_error_cfg	*cfg;
>  
> -	/*
> -	 * The failed buffer might not have a buf_log_item attached or the
> -	 * log_item list might be empty. Get the mp from the available
> -	 * xfs_log_item
> -	 */
> -	lip = list_first_entry_or_null(&bp->b_li_list, struct xfs_log_item,
> -				       li_bio_list);
> -	mp = lip ? lip->li_mountp : bip->bli_item.li_mountp;
> -
>  	/*
>  	 * If we've already decided to shutdown the filesystem because of
>  	 * I/O errors, there's no point in giving this a retry.
> @@ -1171,14 +1116,27 @@ xfs_buf_had_callback_errors(
>  }
>  
>  static void
> -xfs_buf_run_callbacks(
> +xfs_buf_item_done(
>  	struct xfs_buf		*bp)
>  {
> +	struct xfs_buf_log_item	*bip = bp->b_log_item;
>  
> -	if (xfs_buf_had_callback_errors(bp))
> +	if (!bip)
>  		return;
> -	xfs_buf_do_callbacks(bp);
> +
> +	/*
> +	 * If we are forcibly shutting down, this may well be off the AIL
> +	 * already. That's because we simulate the log-committed callbacks to
> +	 * unpin these buffers. Or we may never have put this item on AIL
> +	 * because of the transaction was aborted forcibly.
> +	 * xfs_trans_ail_delete() takes care of these.
> +	 *
> +	 * Either way, AIL is useless if we're forcing a shutdown.
> +	 */
> +	xfs_trans_ail_delete(&bip->bli_item, SHUTDOWN_CORRUPT_INCORE);
>  	bp->b_log_item = NULL;
> +	xfs_buf_item_free(bip);
> +	xfs_buf_rele(bp);
>  }
>  
>  /*
> @@ -1188,19 +1146,10 @@ void
>  xfs_buf_inode_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	struct xfs_buf_log_item *blip = bp->b_log_item;
> -	struct xfs_log_item	*lip;
> -
>  	if (xfs_buf_had_callback_errors(bp))
>  		return;
>  
> -	/* If there is a buf_log_item attached, run its callback */
> -	if (blip) {
> -		lip = &blip->bli_item;
> -		lip->li_cb(bp, lip);
> -		bp->b_log_item = NULL;
> -	}
> -
> +	xfs_buf_item_done(bp);
>  	xfs_iflush_done(bp);
>  	xfs_buf_ioend_finish(bp);
>  }
> @@ -1212,59 +1161,28 @@ void
>  xfs_buf_dquot_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	struct xfs_buf_log_item *blip = bp->b_log_item;
> -	struct xfs_log_item	*lip;
> -
>  	if (xfs_buf_had_callback_errors(bp))
>  		return;
>  
>  	/* a newly allocated dquot buffer might have a log item attached */
> -	if (blip) {
> -		lip = &blip->bli_item;
> -		lip->li_cb(bp, lip);
> -		bp->b_log_item = NULL;
> -	}
> -
> +	xfs_buf_item_done(bp);
>  	xfs_dquot_done(bp);
>  	xfs_buf_ioend_finish(bp);
>  }
>  
>  /*
>   * Dirty buffer iodone callback function.
> + *
> + * Note that for things like remote attribute buffers, there may not be a buffer
> + * log item here, so processing the buffer log item must remain be optional.
>   */
>  void
>  xfs_buf_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	xfs_buf_run_callbacks(bp);
> -	xfs_buf_ioend_finish(bp);
> -}
> -
> -/*
> - * This is the iodone() function for buffers which have been
> - * logged.  It is called when they are eventually flushed out.
> - * It should remove the buf item from the AIL, and free the buf item.
> - * It is called by xfs_buf_iodone_callbacks() above which will take
> - * care of cleaning up the buffer itself.
> - */
> -void
> -xfs_buf_item_iodone(
> -	struct xfs_buf		*bp,
> -	struct xfs_log_item	*lip)
> -{
> -	ASSERT(BUF_ITEM(lip)->bli_buf == bp);
> -
> -	xfs_buf_rele(bp);
> +	if (xfs_buf_had_callback_errors(bp))
> +		return;
>  
> -	/*
> -	 * If we are forcibly shutting down, this may well be off the AIL
> -	 * already. That's because we simulate the log-committed callbacks to
> -	 * unpin these buffers. Or we may never have put this item on AIL
> -	 * because of the transaction was aborted forcibly.
> -	 * xfs_trans_ail_delete() takes care of these.
> -	 *
> -	 * Either way, AIL is useless if we're forcing a shutdown.
> -	 */
> -	xfs_trans_ail_delete(lip, SHUTDOWN_CORRUPT_INCORE);
> -	xfs_buf_item_free(BUF_ITEM(lip));
> +	xfs_buf_item_done(bp);
> +	xfs_buf_ioend_finish(bp);
>  }
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index 610cd00193289..7c0bd2a210aff 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -57,7 +57,6 @@ bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
>  void	xfs_buf_attach_iodone(struct xfs_buf *,
>  			      void(*)(struct xfs_buf *, struct xfs_log_item *),
>  			      struct xfs_log_item *);
> -void	xfs_buf_item_iodone(struct xfs_buf *, struct xfs_log_item *);
>  void	xfs_buf_inode_iodone(struct xfs_buf *);
>  void	xfs_buf_dquot_iodone(struct xfs_buf *);
>  void	xfs_buf_iodone(struct xfs_buf *);
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 6752676b94fe7..11cd666cd99a6 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -475,7 +475,6 @@ xfs_trans_dirty_buf(
>  	bp->b_flags |= XBF_DONE;
>  
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
> -	bip->bli_item.li_cb = xfs_buf_item_iodone;
>  
>  	/*
>  	 * If we invalidated the buffer within this transaction, then
> @@ -644,7 +643,6 @@ xfs_trans_stale_inode_buf(
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  
>  	bip->bli_flags |= XFS_BLI_STALE_INODE;
> -	bip->bli_item.li_cb = xfs_buf_item_iodone;
>  	bp->b_flags |= _XBF_INODES;
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
>  }
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 12/30] xfs: get rid of log item callbacks
  2020-06-01 21:42 ` [PATCH 12/30] xfs: get rid of log item callbacks Dave Chinner
@ 2020-06-03 14:58   ` Brian Foster
  0 siblings, 0 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-03 14:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:33AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> They are not used anymore, so remove them from the log item and the
> buffer iodone attachment interfaces.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_buf_item.c | 17 -----------------
>  fs/xfs/xfs_buf_item.h |  3 ---
>  fs/xfs/xfs_dquot.c    |  6 +++---
>  fs/xfs/xfs_inode.c    |  5 +++--
>  fs/xfs/xfs_trans.h    |  4 ----
>  5 files changed, 6 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 0ece5de9dd711..09bfe9c52dbdb 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -955,23 +955,6 @@ xfs_buf_item_relse(
>  	xfs_buf_item_free(bip);
>  }
>  
> -
> -/*
> - * Add the given log item with its callback to the list of callbacks
> - * to be called when the buffer's I/O completes.
> - */
> -void
> -xfs_buf_attach_iodone(
> -	struct xfs_buf		*bp,
> -	void			(*cb)(struct xfs_buf *, struct xfs_log_item *),
> -	struct xfs_log_item	*lip)
> -{
> -	ASSERT(xfs_buf_islocked(bp));
> -
> -	lip->li_cb = cb;
> -	list_add_tail(&lip->li_bio_list, &bp->b_li_list);
> -}
> -
>  /*
>   * Invoke the error state callback for each log item affected by the failed I/O.
>   *
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index 7c0bd2a210aff..23507cbb4c413 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -54,9 +54,6 @@ void	xfs_buf_item_relse(struct xfs_buf *);
>  bool	xfs_buf_item_put(struct xfs_buf_log_item *);
>  void	xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint);
>  bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
> -void	xfs_buf_attach_iodone(struct xfs_buf *,
> -			      void(*)(struct xfs_buf *, struct xfs_log_item *),
> -			      struct xfs_log_item *);
>  void	xfs_buf_inode_iodone(struct xfs_buf *);
>  void	xfs_buf_dquot_iodone(struct xfs_buf *);
>  void	xfs_buf_iodone(struct xfs_buf *);
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index 403bc4e9f21ff..d5984a926d1d0 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -1187,11 +1187,11 @@ xfs_qm_dqflush(
>  	}
>  
>  	/*
> -	 * Attach an iodone routine so that we can remove this dquot from the
> -	 * AIL and release the flush lock once the dquot is synced to disk.
> +	 * Attach the dquot to the buffer so that we can remove this dquot from
> +	 * the AIL and release the flush lock once the dquot is synced to disk.
>  	 */
>  	bp->b_flags |= _XBF_DQUOTS;
> -	xfs_buf_attach_iodone(bp, NULL, &dqp->q_logitem.qli_item);
> +	list_add_tail(&dqp->q_logitem.qli_item.li_bio_list, &bp->b_li_list);
>  
>  	/*
>  	 * If the buffer is pinned then push on the log so we won't
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 1b4e8e0bb0cf0..272b54cf97000 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2709,7 +2709,8 @@ xfs_ifree_cluster(
>  			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  						&iip->ili_item.li_lsn);
>  
> -			xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
> +			list_add_tail(&iip->ili_item.li_bio_list,
> +						&bp->b_li_list);
>  
>  			if (ip != free_ip)
>  				xfs_iunlock(ip, XFS_ILOCK_EXCL);
> @@ -3859,7 +3860,7 @@ xfs_iflush_int(
>  	 * the flush lock.
>  	 */
>  	bp->b_flags |= _XBF_INODES;
> -	xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
> +	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
>  
>  	/* generate the checksum. */
>  	xfs_dinode_calc_crc(mp, dip);
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 8308bf6d7e404..99a9ab9cab25b 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -37,10 +37,6 @@ struct xfs_log_item {
>  	unsigned long			li_flags;	/* misc flags */
>  	struct xfs_buf			*li_buf;	/* real buffer pointer */
>  	struct list_head		li_bio_list;	/* buffer item list */
> -	void				(*li_cb)(struct xfs_buf *,
> -						 struct xfs_log_item *);
> -							/* buffer item iodone */
> -							/* callback func */
>  	const struct xfs_item_ops	*li_ops;	/* function list */
>  
>  	/* delayed logging */
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 13/30] xfs: handle buffer log item IO errors directly
  2020-06-01 21:42 ` [PATCH 13/30] xfs: handle buffer log item IO errors directly Dave Chinner
  2020-06-02 20:39   ` Darrick J. Wong
@ 2020-06-03 15:02   ` Brian Foster
  2020-06-03 21:34     ` Dave Chinner
  1 sibling, 1 reply; 80+ messages in thread
From: Brian Foster @ 2020-06-03 15:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:34AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Currently when a buffer with attached log items has an IO error
> it called ->iop_error for each attched log item. These all call
> xfs_set_li_failed() to handle the error, but we are about to change
> the way log items manage buffers. hence we first need to remove the
> per-item dependency on buffer handling done by xfs_set_li_failed().
> 
> We already have specific buffer type IO completion routines, so move
> the log item error handling out of the generic error handling and
> into the log item specific functions so we can implement per-type
> error handling easily.
> 
> This requires a more complex return value from the error handling
> code so that we can take the correct action the failure handling
> requires.  This results in some repeated boilerplate in the
> functions, but that can be cleaned up later once all the changes
> cascade through this code.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

I reiterate some of Darrick's comments in that it's slightly annoying to
see refactoring squashed together that looks like it could be done in a
couple smaller and more simple patches. That aside, the only thing that
kind of bothers me is...

>  fs/xfs/xfs_buf_item.c | 167 ++++++++++++++++++++++++++++--------------
>  1 file changed, 112 insertions(+), 55 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 09bfe9c52dbdb..b6995719e877b 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
...
> @@ -1031,36 +1025,80 @@ xfs_buf_iodone_callback_error(
...
> +static int
> +xfs_buf_iodone_error(
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_mount	*mp = bp->b_mount;
> +	struct xfs_error_cfg	*cfg;
> +
> +	if (xfs_buf_ioerror_sync(bp))
> +		goto out_stale;
> +
> +	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
> +
> +	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
> +	if (xfs_buf_ioerror_retry(bp, cfg)) {
> +		xfs_buf_ioerror(bp, 0);
> +		xfs_buf_submit(bp);
> +		return 1;
> +	}
> +
> +	if (xfs_buf_ioerror_permanent(bp, cfg))
>  		goto permanent_error;
>  
>  	/*
>  	 * Still a transient error, run IO completion failure callbacks and let
>  	 * the higher layers retry the buffer.
>  	 */
> -	xfs_buf_do_callbacks_fail(bp);
>  	xfs_buf_ioerror(bp, 0);
> -	xfs_buf_relse(bp);
> -	return true;
> +	return 2;

... that we now clear the buffer error code before running the failure
callbacks. I know that nothing in the callbacks looks at it right now,
but I think it's subtle and inelegant to split it off this way. Can we
just move this entire block together into the type callbacks?

Brian

>  
>  	/*
>  	 * Permanent error - we need to trigger a shutdown if we haven't already
> @@ -1072,30 +1110,7 @@ xfs_buf_iodone_callback_error(
>  	xfs_buf_stale(bp);
>  	bp->b_flags |= XBF_DONE;
>  	trace_xfs_buf_error_relse(bp, _RET_IP_);
> -	return false;
> -}
> -
> -static inline bool
> -xfs_buf_had_callback_errors(
> -	struct xfs_buf		*bp)
> -{
> -
> -	/*
> -	 * If there is an error, process it. Some errors require us to run
> -	 * callbacks after failure processing is done so we detect that and take
> -	 * appropriate action.
> -	 */
> -	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
> -		return true;
> -
> -	/*
> -	 * Successful IO or permanent error. Either way, we can clear the
> -	 * retry state here in preparation for the next error that may occur.
> -	 */
> -	bp->b_last_error = 0;
> -	bp->b_retries = 0;
> -	bp->b_first_retry_time = 0;
> -	return false;
> +	return 0;
>  }
>  
>  static void
> @@ -1122,6 +1137,15 @@ xfs_buf_item_done(
>  	xfs_buf_rele(bp);
>  }
>  
> +static inline void
> +xfs_buf_clear_ioerror_retry_state(
> +	struct xfs_buf		*bp)
> +{
> +	bp->b_last_error = 0;
> +	bp->b_retries = 0;
> +	bp->b_first_retry_time = 0;
> +}
> +
>  /*
>   * Inode buffer iodone callback function.
>   */
> @@ -1129,9 +1153,20 @@ void
>  xfs_buf_inode_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	if (xfs_buf_had_callback_errors(bp))
> +	if (bp->b_error) {
> +		int ret = xfs_buf_iodone_error(bp);
> +		if (!ret)
> +			goto finish_iodone;
> +		if (ret == 1)
> +			return;
> +		ASSERT(ret == 2);
> +		xfs_buf_do_callbacks_fail(bp);
> +		xfs_buf_relse(bp);
>  		return;
> +	}
>  
> +finish_iodone:
> +	xfs_buf_clear_ioerror_retry_state(bp);
>  	xfs_buf_item_done(bp);
>  	xfs_iflush_done(bp);
>  	xfs_buf_ioend_finish(bp);
> @@ -1144,9 +1179,20 @@ void
>  xfs_buf_dquot_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	if (xfs_buf_had_callback_errors(bp))
> +	if (bp->b_error) {
> +		int ret = xfs_buf_iodone_error(bp);
> +		if (!ret)
> +			goto finish_iodone;
> +		if (ret == 1)
> +			return;
> +		ASSERT(ret == 2);
> +		xfs_buf_do_callbacks_fail(bp);
> +		xfs_buf_relse(bp);
>  		return;
> +	}
>  
> +finish_iodone:
> +	xfs_buf_clear_ioerror_retry_state(bp);
>  	/* a newly allocated dquot buffer might have a log item attached */
>  	xfs_buf_item_done(bp);
>  	xfs_dquot_done(bp);
> @@ -1163,9 +1209,20 @@ void
>  xfs_buf_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	if (xfs_buf_had_callback_errors(bp))
> +	if (bp->b_error) {
> +		int ret = xfs_buf_iodone_error(bp);
> +		if (!ret)
> +			goto finish_iodone;
> +		if (ret == 1)
> +			return;
> +		ASSERT(ret == 2);
> +		xfs_buf_do_callbacks_fail(bp);
> +		xfs_buf_relse(bp);
>  		return;
> +	}
>  
> +finish_iodone:
> +	xfs_buf_clear_ioerror_retry_state(bp);
>  	xfs_buf_item_done(bp);
>  	xfs_buf_ioend_finish(bp);
>  }
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 14/30] xfs: unwind log item error flagging
  2020-06-01 21:42 ` [PATCH 14/30] xfs: unwind log item error flagging Dave Chinner
  2020-06-02 20:45   ` Darrick J. Wong
@ 2020-06-03 15:02   ` Brian Foster
  1 sibling, 0 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-03 15:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:35AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When an buffer IO error occurs, we want to mark all
> the log items attached to the buffer as failed. Open code
> the error handling loop so that we can modify the flagging for the
> different types of objects directly and independently of each other.
> 
> This also allows us to remove the ->iop_error method from the log
> item operations.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_buf_item.c   | 48 ++++++++++++-----------------------------
>  fs/xfs/xfs_dquot_item.c | 18 ----------------
>  fs/xfs/xfs_inode_item.c | 18 ----------------
>  fs/xfs/xfs_trans.h      |  1 -
>  4 files changed, 14 insertions(+), 71 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index b6995719e877b..2364a9aa2d71a 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -12,6 +12,7 @@
>  #include "xfs_bit.h"
>  #include "xfs_mount.h"
>  #include "xfs_trans.h"
> +#include "xfs_trans_priv.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_inode.h"
>  #include "xfs_inode_item.h"
> @@ -955,37 +956,6 @@ xfs_buf_item_relse(
>  	xfs_buf_item_free(bip);
>  }
>  
> -/*
> - * Invoke the error state callback for each log item affected by the failed I/O.
> - *
> - * If a metadata buffer write fails with a non-permanent error, the buffer is
> - * eventually resubmitted and so the completion callbacks are not run. The error
> - * state may need to be propagated to the log items attached to the buffer,
> - * however, so the next AIL push of the item knows hot to handle it correctly.
> - */
> -STATIC void
> -xfs_buf_do_callbacks_fail(
> -	struct xfs_buf		*bp)
> -{
> -	struct xfs_ail		*ailp = bp->b_mount->m_ail;
> -	struct xfs_log_item	*lip;
> -
> -	/*
> -	 * Buffer log item errors are handled directly by xfs_buf_item_push()
> -	 * and xfs_buf_iodone_callback_error, and they have no IO error
> -	 * callbacks. Check only for items in b_li_list.
> -	 */
> -	if (list_empty(&bp->b_li_list))
> -		return;
> -
> -	spin_lock(&ailp->ail_lock);
> -	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> -		if (lip->li_ops->iop_error)
> -			lip->li_ops->iop_error(lip, bp);
> -	}
> -	spin_unlock(&ailp->ail_lock);
> -}
> -
>  static bool
>  xfs_buf_ioerror_sync(
>  	struct xfs_buf		*bp)
> @@ -1154,13 +1124,18 @@ xfs_buf_inode_iodone(
>  	struct xfs_buf		*bp)
>  {
>  	if (bp->b_error) {
> +		struct xfs_log_item *lip;
>  		int ret = xfs_buf_iodone_error(bp);
>  		if (!ret)
>  			goto finish_iodone;
>  		if (ret == 1)
>  			return;
>  		ASSERT(ret == 2);
> -		xfs_buf_do_callbacks_fail(bp);
> +		spin_lock(&bp->b_mount->m_ail->ail_lock);
> +		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> +			xfs_set_li_failed(lip, bp);
> +		}
> +		spin_unlock(&bp->b_mount->m_ail->ail_lock);
>  		xfs_buf_relse(bp);
>  		return;
>  	}
> @@ -1180,13 +1155,18 @@ xfs_buf_dquot_iodone(
>  	struct xfs_buf		*bp)
>  {
>  	if (bp->b_error) {
> +		struct xfs_log_item *lip;
>  		int ret = xfs_buf_iodone_error(bp);
>  		if (!ret)
>  			goto finish_iodone;
>  		if (ret == 1)
>  			return;
>  		ASSERT(ret == 2);
> -		xfs_buf_do_callbacks_fail(bp);
> +		spin_lock(&bp->b_mount->m_ail->ail_lock);
> +		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> +			xfs_set_li_failed(lip, bp);
> +		}
> +		spin_unlock(&bp->b_mount->m_ail->ail_lock);
>  		xfs_buf_relse(bp);
>  		return;
>  	}
> @@ -1216,7 +1196,7 @@ xfs_buf_iodone(
>  		if (ret == 1)
>  			return;
>  		ASSERT(ret == 2);
> -		xfs_buf_do_callbacks_fail(bp);
> +		ASSERT(list_empty(&bp->b_li_list));
>  		xfs_buf_relse(bp);
>  		return;
>  	}
> diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
> index 349c92d26570c..d7e4de7151d7f 100644
> --- a/fs/xfs/xfs_dquot_item.c
> +++ b/fs/xfs/xfs_dquot_item.c
> @@ -113,23 +113,6 @@ xfs_qm_dqunpin_wait(
>  	wait_event(dqp->q_pinwait, (atomic_read(&dqp->q_pincount) == 0));
>  }
>  
> -/*
> - * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer
> - * have been failed during writeback
> - *
> - * this informs the AIL that the dquot is already flush locked on the next push,
> - * and acquires a hold on the buffer to ensure that it isn't reclaimed before
> - * dirty data makes it to disk.
> - */
> -STATIC void
> -xfs_dquot_item_error(
> -	struct xfs_log_item	*lip,
> -	struct xfs_buf		*bp)
> -{
> -	ASSERT(!completion_done(&DQUOT_ITEM(lip)->qli_dquot->q_flush));
> -	xfs_set_li_failed(lip, bp);
> -}
> -
>  STATIC uint
>  xfs_qm_dquot_logitem_push(
>  	struct xfs_log_item	*lip,
> @@ -216,7 +199,6 @@ static const struct xfs_item_ops xfs_dquot_item_ops = {
>  	.iop_release	= xfs_qm_dquot_logitem_release,
>  	.iop_committing	= xfs_qm_dquot_logitem_committing,
>  	.iop_push	= xfs_qm_dquot_logitem_push,
> -	.iop_error	= xfs_dquot_item_error
>  };
>  
>  /*
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 7049f2ae8d186..86c783dec2bac 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -464,23 +464,6 @@ xfs_inode_item_unpin(
>  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
>  }
>  
> -/*
> - * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer
> - * have been failed during writeback
> - *
> - * This informs the AIL that the inode is already flush locked on the next push,
> - * and acquires a hold on the buffer to ensure that it isn't reclaimed before
> - * dirty data makes it to disk.
> - */
> -STATIC void
> -xfs_inode_item_error(
> -	struct xfs_log_item	*lip,
> -	struct xfs_buf		*bp)
> -{
> -	ASSERT(xfs_isiflocked(INODE_ITEM(lip)->ili_inode));
> -	xfs_set_li_failed(lip, bp);
> -}
> -
>  STATIC uint
>  xfs_inode_item_push(
>  	struct xfs_log_item	*lip,
> @@ -619,7 +602,6 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
>  	.iop_committed	= xfs_inode_item_committed,
>  	.iop_push	= xfs_inode_item_push,
>  	.iop_committing	= xfs_inode_item_committing,
> -	.iop_error	= xfs_inode_item_error
>  };
>  
>  
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 99a9ab9cab25b..b752501818d25 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -74,7 +74,6 @@ struct xfs_item_ops {
>  	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
>  	void (*iop_release)(struct xfs_log_item *);
>  	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
> -	void (*iop_error)(struct xfs_log_item *, xfs_buf_t *);
>  	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
>  	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
>  };
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()
  2020-06-01 21:42 ` [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() Dave Chinner
  2020-06-02 20:47   ` Darrick J. Wong
@ 2020-06-03 15:02   ` Brian Foster
  1 sibling, 0 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-03 15:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:36AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_ail_delete_one() is called directly from dquot and inode IO
> completion, as well as from the generic xfs_trans_ail_delete()
> function. Inodes are about to have their own failure handling, and
> dquots will in future, too. Pull the clearing of the LI_FAILED flag
> up into the callers so we can customise the code appropriately.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_dquot.c      | 6 +-----
>  fs/xfs/xfs_inode_item.c | 3 +--
>  fs/xfs/xfs_trans_ail.c  | 2 +-
>  3 files changed, 3 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index d5984a926d1d0..76353c9a723ee 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -1070,16 +1070,12 @@ xfs_qm_dqflush_done(
>  	     test_bit(XFS_LI_FAILED, &lip->li_flags))) {
>  
>  		spin_lock(&ailp->ail_lock);
> +		xfs_clear_li_failed(lip);
>  		if (lip->li_lsn == qip->qli_flush_lsn) {
>  			/* xfs_ail_update_finish() drops the AIL lock */
>  			tail_lsn = xfs_ail_delete_one(ailp, lip);
>  			xfs_ail_update_finish(ailp, tail_lsn);
>  		} else {
> -			/*
> -			 * Clear the failed state since we are about to drop the
> -			 * flush lock
> -			 */
> -			xfs_clear_li_failed(lip);
>  			spin_unlock(&ailp->ail_lock);
>  		}
>  	}
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 86c783dec2bac..0ba75764a8dc5 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -690,12 +690,11 @@ xfs_iflush_done(
>  		/* this is an opencoded batch version of xfs_trans_ail_delete */
>  		spin_lock(&ailp->ail_lock);
>  		list_for_each_entry(lip, &tmp, li_bio_list) {
> +			xfs_clear_li_failed(lip);
>  			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
>  				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
>  				if (!tail_lsn && lsn)
>  					tail_lsn = lsn;
> -			} else {
> -				xfs_clear_li_failed(lip);
>  			}
>  		}
>  		xfs_ail_update_finish(ailp, tail_lsn);
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index ac5019361a139..ac33f6393f99c 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -843,7 +843,6 @@ xfs_ail_delete_one(
>  
>  	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
>  	xfs_ail_delete(ailp, lip);
> -	xfs_clear_li_failed(lip);
>  	clear_bit(XFS_LI_IN_AIL, &lip->li_flags);
>  	lip->li_lsn = 0;
>  
> @@ -874,6 +873,7 @@ xfs_trans_ail_delete(
>  	}
>  
>  	/* xfs_ail_update_finish() drops the AIL lock */
> +	xfs_clear_li_failed(lip);
>  	tail_lsn = xfs_ail_delete_one(ailp, lip);
>  	xfs_ail_update_finish(ailp, tail_lsn);
>  }
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/30] xfs: pin inode backing buffer to the inode log item
  2020-06-01 21:42 ` [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Dave Chinner
  2020-06-02 22:30   ` Darrick J. Wong
@ 2020-06-03 18:58   ` Brian Foster
  2020-06-03 22:15     ` Dave Chinner
  1 sibling, 1 reply; 80+ messages in thread
From: Brian Foster @ 2020-06-03 18:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 07:42:37AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we dirty an inode, we are going to have to write it disk at
> some point in the near future. This requires the inode cluster
> backing buffer to be present in memory. Unfortunately, under severe
> memory pressure we can reclaim the inode backing buffer while the
> inode is dirty in memory, resulting in stalling the AIL pushing
> because it has to do a read-modify-write cycle on the cluster
> buffer.
> 
> When we have no memory available, the read of the cluster buffer
> blocks the AIL pushing process, and this causes all sorts of issues
> for memory reclaim as it requires inode writeback to make forwards
> progress. Allocating a cluster buffer causes more memory pressure,
> and results in more cluster buffers to be reclaimed, resulting in
> more RMW cycles to be done in the AIL context and everything then
> backs up on AIL progress. Only the synchronous inode cluster
> writeback in the the inode reclaim code provides some level of
> forwards progress guarantees that prevent OOM-killer rampages in
> this situation.
> 
> Fix this by pinning the inode backing buffer to the inode log item
> when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
> This may mean the first modification of an inode that has been held
> in cache for a long time may block on a cluster buffer read, but
> we can do that in transaction context and block safely until the
> buffer has been allocated and read.
> 
> Once we have the cluster buffer, the inode log item takes a
> reference to it, pinning it in memory, and attaches it to the log
> item for future reference. This means we can always grab the cluster
> buffer from the inode log item when we need it.
> 
> When the inode is finally cleaned and removed from the AIL, we can
> drop the reference the inode log item holds on the cluster buffer.
> Once all inodes on the cluster buffer are clean, the cluster buffer
> will be unpinned and it will be available for memory reclaim to
> reclaim again.
> 
> This avoids the issues with needing to do RMW cycles in the AIL
> pushing context, and hence allows complete non-blocking inode
> flushing to be performed by the AIL pushing context.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_inode_buf.c   |  3 +-
>  fs/xfs/libxfs/xfs_trans_inode.c | 53 +++++++++++++++++++++---
>  fs/xfs/xfs_buf_item.c           |  4 +-
>  fs/xfs/xfs_inode_item.c         | 73 +++++++++++++++++++++++++++------
>  fs/xfs/xfs_trans_ail.c          |  8 +++-
>  5 files changed, 117 insertions(+), 24 deletions(-)
> 
...
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index fe6c2e39be85d..1e7147b90725e 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
...
> @@ -132,6 +140,39 @@ xfs_trans_log_inode(
>  	spin_lock(&iip->ili_lock);
>  	iip->ili_fsync_fields |= flags;
>  
> +	if (!iip->ili_item.li_buf) {
> +		struct xfs_buf	*bp;
> +		int		error;
> +
> +		/*
> +		 * We hold the ILOCK here, so this inode is not going to be
> +		 * flushed while we are here. Further, because there is no
> +		 * buffer attached to the item, we know that there is no IO in
> +		 * progress, so nothing will clear the ili_fields while we read
> +		 * in the buffer. Hence we can safely drop the spin lock and
> +		 * read the buffer knowing that the state will not change from
> +		 * here.
> +		 */
> +		spin_unlock(&iip->ili_lock);
> +		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, NULL,
> +					&bp, 0);
> +		if (error) {
> +			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> +			return;
> +		}

It's slightly unfortunate to shutdown on a read error, but I'd guess
many of these cases would have a dirty transaction already. Perhaps
something worth cleaning up later..?

> +
> +		/*
> +		 * We need an explicit buffer reference for the log item but
> +		 * don't want the buffer to remain attached to the transaction.
> +		 * Hold the buffer but release the transaction reference.
> +		 */
> +		xfs_buf_hold(bp);
> +		xfs_trans_brelse(tp, bp);
> +
> +		spin_lock(&iip->ili_lock);
> +		iip->ili_item.li_buf = bp;
> +	}
> +
>  	/*
>  	 * Always OR in the bits from the ili_last_fields field.  This is to
>  	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
...
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 0ba75764a8dc5..0a7720b7a821a 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -130,6 +130,8 @@ xfs_inode_item_size(
>  	xfs_inode_item_data_fork_size(iip, nvecs, nbytes);
>  	if (XFS_IFORK_Q(ip))
>  		xfs_inode_item_attr_fork_size(iip, nvecs, nbytes);
> +
> +	ASSERT(iip->ili_item.li_buf);

This assert seems unnecessary since we have one in ->iop_pin() just
below.

>  }
>  
>  STATIC void
> @@ -439,6 +441,7 @@ xfs_inode_item_pin(
>  	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
>  
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> +	ASSERT(lip->li_buf);
>  
>  	trace_xfs_inode_pin(ip, _RET_IP_);
>  	atomic_inc(&ip->i_pincount);
> @@ -450,6 +453,12 @@ xfs_inode_item_pin(
>   * item which was previously pinned with a call to xfs_inode_item_pin().
>   *
>   * Also wake up anyone in xfs_iunpin_wait() if the count goes to 0.
> + *
> + * Note that unpin can race with inode cluster buffer freeing marking the buffer
> + * stale. In that case, flush completions are run from the buffer unpin call,
> + * which may happen before the inode is unpinned. If we lose the race, there
> + * will be no buffer attached to the log item, but the inode will be marked
> + * XFS_ISTALE.
>   */
>  STATIC void
>  xfs_inode_item_unpin(
> @@ -459,6 +468,7 @@ xfs_inode_item_unpin(
>  	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
>  
>  	trace_xfs_inode_unpin(ip, _RET_IP_);
> +	ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE));
>  	ASSERT(atomic_read(&ip->i_pincount) > 0);
>  	if (atomic_dec_and_test(&ip->i_pincount))
>  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);

So I was wondering what happens to the attached buffer hold if shutdown
occurs after the inode is logged (i.e. transaction aborts or log write
fails). I see there's an assert for the buffer being cleaned up before
the ili is freed, so presumably that case is handled. It looks like we
unconditionally abort a flush on inode reclaim if the fs is shutdown,
regardless of whether the inode is dirty and we drop the buffer from
there..?

> @@ -629,10 +639,15 @@ xfs_inode_item_init(
>   */
>  void
>  xfs_inode_item_destroy(
> -	xfs_inode_t	*ip)
> +	struct xfs_inode	*ip)
>  {
> -	kmem_free(ip->i_itemp->ili_item.li_lv_shadow);
> -	kmem_cache_free(xfs_ili_zone, ip->i_itemp);
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
> +
> +	ASSERT(iip->ili_item.li_buf == NULL);
> +
> +	ip->i_itemp = NULL;
> +	kmem_free(iip->ili_item.li_lv_shadow);
> +	kmem_cache_free(xfs_ili_zone, iip);
>  }
>  
>  
> @@ -647,6 +662,13 @@ xfs_inode_item_destroy(
>   * list for other inodes that will run this function. We remove them from the
>   * buffer list so we can process all the inode IO completions in one AIL lock
>   * traversal.
> + *
> + * Note: Now that we attach the log item to the buffer when we first log the
> + * inode in memory, we can have unflushed inodes on the buffer list here. These
> + * inodes will have a zero ili_last_fields, so skip over them here. We do
> + * this check -after- we've checked for stale inodes, because we're guaranteed
> + * to have XFS_ISTALE set in the case that dirty inodes are in the CIL and have
> + * not yet had their dirtying transactions committed to disk.
>   */
>  void
>  xfs_iflush_done(
> @@ -670,14 +692,16 @@ xfs_iflush_done(
>  			continue;
>  		}
>  
> +		if (!iip->ili_last_fields)
> +			continue;
> +

Hmm.. reading the comment above, do we actually attach the log item to
the buffer any earlier? ISTM we attach the buffer to the log item via a
hold, but that's different from getting the ili on ->b_li_list such that
it's available here. Hm?

>  		list_move_tail(&lip->li_bio_list, &tmp);
>  
>  		/* Do an unlocked check for needing the AIL lock. */
> -		if (lip->li_lsn == iip->ili_flush_lsn ||
> +		if (iip->ili_flush_lsn == lip->li_lsn ||
>  		    test_bit(XFS_LI_FAILED, &lip->li_flags))
>  			need_ail++;
>  	}
> -	ASSERT(list_empty(&bp->b_li_list));
>  
>  	/*
>  	 * We only want to pull the item from the AIL if it is actually there
...
> @@ -706,14 +730,29 @@ xfs_iflush_done(
>  	 * them is safely on disk.
>  	 */
>  	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
> +		bool	drop_buffer = false;
> +
>  		list_del_init(&lip->li_bio_list);
>  		iip = INODE_ITEM(lip);
>  
>  		spin_lock(&iip->ili_lock);
> +
> +		/*
> +		 * Remove the reference to the cluster buffer if the inode is
> +		 * clean in memory. Drop the buffer reference once we've dropped
> +		 * the locks we hold.
> +		 */
> +		ASSERT(iip->ili_item.li_buf == bp);
> +		if (!iip->ili_fields) {
> +			iip->ili_item.li_buf = NULL;
> +			drop_buffer = true;
> +		}
>  		iip->ili_last_fields = 0;
> +		iip->ili_flush_lsn = 0;

This also seems related to the behavior noted in the comment above.
Presumably we have to clear the flush lsn if clean inodes remain
attached to the buffer.. (but does that actually happen yet)?

Brian

>  		spin_unlock(&iip->ili_lock);
> -
>  		xfs_ifunlock(iip->ili_inode);
> +		if (drop_buffer)
> +			xfs_buf_rele(bp);
>  	}
>  }
>  
> @@ -725,12 +764,20 @@ xfs_iflush_done(
>   */
>  void
>  xfs_iflush_abort(
> -	struct xfs_inode		*ip)
> +	struct xfs_inode	*ip)
>  {
> -	struct xfs_inode_log_item	*iip = ip->i_itemp;
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
> +	struct xfs_buf		*bp = NULL;
>  
>  	if (iip) {
> +		/*
> +		 * Clear the failed bit before removing the item from the AIL so
> +		 * xfs_trans_ail_delete() doesn't try to clear and release the
> +		 * buffer attached to the log item before we are done with it.
> +		 */
> +		clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags);
>  		xfs_trans_ail_delete(&iip->ili_item, 0);
> +
>  		/*
>  		 * Clear the inode logging fields so no more flushes are
>  		 * attempted.
> @@ -739,12 +786,14 @@ xfs_iflush_abort(
>  		iip->ili_last_fields = 0;
>  		iip->ili_fields = 0;
>  		iip->ili_fsync_fields = 0;
> +		iip->ili_flush_lsn = 0;
> +		bp = iip->ili_item.li_buf;
> +		iip->ili_item.li_buf = NULL;
>  		spin_unlock(&iip->ili_lock);
>  	}
> -	/*
> -	 * Release the inode's flush lock since we're done with it.
> -	 */
>  	xfs_ifunlock(ip);
> +	if (bp)
> +		xfs_buf_rele(bp);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index ac33f6393f99c..c3be6e4401343 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -377,8 +377,12 @@ xfsaild_resubmit_item(
>  	}
>  
>  	/* protected by ail_lock */
> -	list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
> -		xfs_clear_li_failed(lip);
> +	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> +		if (bp->b_flags & _XBF_INODES)
> +			clear_bit(XFS_LI_FAILED, &lip->li_flags);
> +		else
> +			xfs_clear_li_failed(lip);
> +	}
>  
>  	xfs_buf_unlock(bp);
>  	return XFS_ITEM_SUCCESS;
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 04/30] xfs: mark inode buffers in cache
  2020-06-03 14:57       ` Brian Foster
@ 2020-06-03 21:21         ` Dave Chinner
  0 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-03 21:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Jun 03, 2020 at 10:57:49AM -0400, Brian Foster wrote:
> On Wed, Jun 03, 2020 at 07:29:18AM +1000, Dave Chinner wrote:
> > On Tue, Jun 02, 2020 at 12:45:35PM -0400, Brian Foster wrote:
> > > On Tue, Jun 02, 2020 at 07:42:25AM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Inode buffers always have write IO callbacks, so by marking them
> > > > directly we can avoid needing to attach ->b_iodone functions to
> > > > them. This avoids an indirect call, and makes future modifications
> > > > much simpler.
> > > > 
> > > > This is largely a rearrangement of the code at this point - no IO
> > > > completion functionality changes at this point, just how the
> > > > code is run is modified.
> > > > 
> > > 
> > > Ok, I was initially thinking this patch looked incomplete in that we
> > > continue to set ->b_iodone() on inode buffers even though we'd never
> > > call it. Looking ahead, I see that the next few patches continue to
> > > clean that up to eventually remove ->b_iodone(), so that addresses that.
> > > 
> > > My only other curiosity is that while there may not be any functional
> > > difference, this technically changes callback behavior in that we set
> > > the new flag in some contexts that don't currently attach anything to
> > > the buffer, right? E.g., xfs_trans_inode_alloc_buf() sets the flag on
> > > inode chunk init, which means we can write out an inode buffer without
> > > any attached/flushed inodes.
> > 
> > Yes, it can happen, and it happens before this patch, too, because
> > the AIL can push the buffer log item directly and that does not
> > flush dirty inodes to the buffer before it writes back(*).
> > 
> 
> I was thinking more about cases where there are actually no inodes
> attached.
> 
> > As it is, xfs_buf_inode_iodone() on a buffer with no inode attached
> > if functionally identical to the existing xfs_buf_iodone() callback
> > that would otherwise be done. i.e. it just runs the buffer log item
> > completion callback. Hence the change here rearranges code, but it
> > does not change behaviour at all.
> > 
> 
> Right. That's indicative from the code, but doesn't help me understand
> why the change is made. That's all I'm asking for...
> 
> > (*) this is a double-write bug that this patch set does not address.
> > i.e. buffer log item flushes the buffer without flushing inodes, IO
> > compeletes, then inode flush to the buffer and we do another IO to
> > clean them.  This is addressed by a follow-on patchset that tracks
> > dirty inodes via ordered cluster buffers, such that pushing the
> > buffer always triggers xfs_iflush_cluster() on buffers tagged
> > _XBF_INODES...
> > 
> 
> Ok, interesting (but seems beyond the scope of this series).

It is used in this series in the ail buffer resubmit code to clear
the LI_FAILED state appropriately, because inode items are treated
differently to dquot items once they track the cluster buffer...

> > > Is the intent of that to support future
> > > changes? If so, a note about that in the commit log would be helpful.
> > 
> > That's part of it, as you can see from the (*) above. But the commit
> > log already says "..., and makes future modifications much simpler."
> > Was that insufficient to indicate that it will be used later on?
> > 
> 
> That's a rather vague hint. ;P I was more hoping for something like:
> "While this is largely a refactor of existing functionality, broaden the
> scope of the flag to beyond where inodes are explicitly attached because
> <some actual reason>. This has the effect of possibly invoking the
> callback in cases where it wouldn't have been previously, but this is
> not a functional change because the callback is effectively a no-op when
> inodes are not attached."

Ok.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 13/30] xfs: handle buffer log item IO errors directly
  2020-06-03 15:02   ` Brian Foster
@ 2020-06-03 21:34     ` Dave Chinner
  0 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-03 21:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Jun 03, 2020 at 11:02:07AM -0400, Brian Foster wrote:
> On Tue, Jun 02, 2020 at 07:42:34AM +1000, Dave Chinner wrote:
> > +	if (xfs_buf_ioerror_sync(bp))
> > +		goto out_stale;
> > +
> > +	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
> > +
> > +	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
> > +	if (xfs_buf_ioerror_retry(bp, cfg)) {
> > +		xfs_buf_ioerror(bp, 0);
> > +		xfs_buf_submit(bp);
> > +		return 1;
> > +	}
> > +
> > +	if (xfs_buf_ioerror_permanent(bp, cfg))
> >  		goto permanent_error;
> >  
> >  	/*
> >  	 * Still a transient error, run IO completion failure callbacks and let
> >  	 * the higher layers retry the buffer.
> >  	 */
> > -	xfs_buf_do_callbacks_fail(bp);
> >  	xfs_buf_ioerror(bp, 0);
> > -	xfs_buf_relse(bp);
> > -	return true;
> > +	return 2;
> 
> ... that we now clear the buffer error code before running the failure
> callbacks. I know that nothing in the callbacks looks at it right now,
> but I think it's subtle and inelegant to split it off this way. Can we
> just move this entire block together into the type callbacks?

Sure. It's largely just deck chair rearragnement, though, because
the whole XFS_LI_FAILED ends up going away real soon. The next patchset
gets rid of it entirely for inode log items, and when the same
approach is applied to dquots, it no longer will be used by
anything and will be removed entirely.

IOWs, the future isn't "maybe error callbacks will do something
different", the future is "error callbacks don't exist any more".

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/30] xfs: pin inode backing buffer to the inode log item
  2020-06-03 18:58   ` Brian Foster
@ 2020-06-03 22:15     ` Dave Chinner
  2020-06-04 14:03       ` Brian Foster
  0 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-03 22:15 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Jun 03, 2020 at 02:58:12PM -0400, Brian Foster wrote:
> On Tue, Jun 02, 2020 at 07:42:37AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When we dirty an inode, we are going to have to write it disk at
> > some point in the near future. This requires the inode cluster
> > backing buffer to be present in memory. Unfortunately, under severe
> > memory pressure we can reclaim the inode backing buffer while the
> > inode is dirty in memory, resulting in stalling the AIL pushing
> > because it has to do a read-modify-write cycle on the cluster
> > buffer.
> > 
> > When we have no memory available, the read of the cluster buffer
> > blocks the AIL pushing process, and this causes all sorts of issues
> > for memory reclaim as it requires inode writeback to make forwards
> > progress. Allocating a cluster buffer causes more memory pressure,
> > and results in more cluster buffers to be reclaimed, resulting in
> > more RMW cycles to be done in the AIL context and everything then
> > backs up on AIL progress. Only the synchronous inode cluster
> > writeback in the the inode reclaim code provides some level of
> > forwards progress guarantees that prevent OOM-killer rampages in
> > this situation.
> > 
> > Fix this by pinning the inode backing buffer to the inode log item
> > when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
> > This may mean the first modification of an inode that has been held
> > in cache for a long time may block on a cluster buffer read, but
> > we can do that in transaction context and block safely until the
> > buffer has been allocated and read.
> > 
> > Once we have the cluster buffer, the inode log item takes a
> > reference to it, pinning it in memory, and attaches it to the log
> > item for future reference. This means we can always grab the cluster
> > buffer from the inode log item when we need it.
> > 
> > When the inode is finally cleaned and removed from the AIL, we can
> > drop the reference the inode log item holds on the cluster buffer.
> > Once all inodes on the cluster buffer are clean, the cluster buffer
> > will be unpinned and it will be available for memory reclaim to
> > reclaim again.
> > 
> > This avoids the issues with needing to do RMW cycles in the AIL
> > pushing context, and hence allows complete non-blocking inode
> > flushing to be performed by the AIL pushing context.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_inode_buf.c   |  3 +-
> >  fs/xfs/libxfs/xfs_trans_inode.c | 53 +++++++++++++++++++++---
> >  fs/xfs/xfs_buf_item.c           |  4 +-
> >  fs/xfs/xfs_inode_item.c         | 73 +++++++++++++++++++++++++++------
> >  fs/xfs/xfs_trans_ail.c          |  8 +++-
> >  5 files changed, 117 insertions(+), 24 deletions(-)
> > 
> ...
> > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> > index fe6c2e39be85d..1e7147b90725e 100644
> > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> ...
> > @@ -132,6 +140,39 @@ xfs_trans_log_inode(
> >  	spin_lock(&iip->ili_lock);
> >  	iip->ili_fsync_fields |= flags;
> >  
> > +	if (!iip->ili_item.li_buf) {
> > +		struct xfs_buf	*bp;
> > +		int		error;
> > +
> > +		/*
> > +		 * We hold the ILOCK here, so this inode is not going to be
> > +		 * flushed while we are here. Further, because there is no
> > +		 * buffer attached to the item, we know that there is no IO in
> > +		 * progress, so nothing will clear the ili_fields while we read
> > +		 * in the buffer. Hence we can safely drop the spin lock and
> > +		 * read the buffer knowing that the state will not change from
> > +		 * here.
> > +		 */
> > +		spin_unlock(&iip->ili_lock);
> > +		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, NULL,
> > +					&bp, 0);
> > +		if (error) {
> > +			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> > +			return;
> > +		}
> 
> It's slightly unfortunate to shutdown on a read error, but I'd guess
> many of these cases would have a dirty transaction already. Perhaps
> something worth cleaning up later..?

All of these cases will a dirty transaction - the inode has been
modified in memory before it was logged and hence we cannot undo
what has already been done. If we return an error here, we will need
to cancel the transaction, and xfs_trans_cancel() will do a shutdown
anyway. Doing it here just means we don't have to return an error
and add error handling to the ~80 callers of xfs_trans_log_inode()
just to trigger a shutdown correctly.

> > @@ -450,6 +453,12 @@ xfs_inode_item_pin(
> >   * item which was previously pinned with a call to xfs_inode_item_pin().
> >   *
> >   * Also wake up anyone in xfs_iunpin_wait() if the count goes to 0.
> > + *
> > + * Note that unpin can race with inode cluster buffer freeing marking the buffer
> > + * stale. In that case, flush completions are run from the buffer unpin call,
> > + * which may happen before the inode is unpinned. If we lose the race, there
> > + * will be no buffer attached to the log item, but the inode will be marked
> > + * XFS_ISTALE.
> >   */
> >  STATIC void
> >  xfs_inode_item_unpin(
> > @@ -459,6 +468,7 @@ xfs_inode_item_unpin(
> >  	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
> >  
> >  	trace_xfs_inode_unpin(ip, _RET_IP_);
> > +	ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE));
> >  	ASSERT(atomic_read(&ip->i_pincount) > 0);
> >  	if (atomic_dec_and_test(&ip->i_pincount))
> >  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
> 
> So I was wondering what happens to the attached buffer hold if shutdown
> occurs after the inode is logged (i.e. transaction aborts or log write
> fails).

Hmmm. Good question. 

There may be no other inodes on the buffer and it this inode may not
be in the AIL, so there's no trigger for xfs_iflush_abort() to be
run from buffer IO completion. So, yes, we could leave an inode
attached to the buffer here....


> I see there's an assert for the buffer being cleaned up before
> the ili is freed, so presumably that case is handled. It looks like we
> unconditionally abort a flush on inode reclaim if the fs is shutdown,
> regardless of whether the inode is dirty and we drop the buffer from
> there..?

Yes, that's where this shutdown race condition has always been
handled. i.e. we know that inodes that are dirty in memory can be
left dangling by the shutdown, and if it's a log IO error they may
even still be pinned. Hence reclaim has to ensure that they are
properly aborted before reclaim otherwise various "reclaiming dirty
inode" asserts will fire.

As it is, in the next patchset the cluster buffer is always inserted
into the AIL as an ordered buffer so it is always committed in the
same transaction as the inode. Hence the abort/unpin call on the
buffer runs the inode IO done processing, it will get removed from
the list and we aren't directly reliant on inode reclaim running a
flush abort to do that for us.

> > @@ -629,10 +639,15 @@ xfs_inode_item_init(
> >   */
> >  void
> >  xfs_inode_item_destroy(
> > -	xfs_inode_t	*ip)
> > +	struct xfs_inode	*ip)
> >  {
> > -	kmem_free(ip->i_itemp->ili_item.li_lv_shadow);
> > -	kmem_cache_free(xfs_ili_zone, ip->i_itemp);
> > +	struct xfs_inode_log_item *iip = ip->i_itemp;
> > +
> > +	ASSERT(iip->ili_item.li_buf == NULL);
> > +
> > +	ip->i_itemp = NULL;
> > +	kmem_free(iip->ili_item.li_lv_shadow);
> > +	kmem_cache_free(xfs_ili_zone, iip);
> >  }
> >  
> >  
> > @@ -647,6 +662,13 @@ xfs_inode_item_destroy(
> >   * list for other inodes that will run this function. We remove them from the
> >   * buffer list so we can process all the inode IO completions in one AIL lock
> >   * traversal.
> > + *
> > + * Note: Now that we attach the log item to the buffer when we first log the
> > + * inode in memory, we can have unflushed inodes on the buffer list here. These
> > + * inodes will have a zero ili_last_fields, so skip over them here. We do
> > + * this check -after- we've checked for stale inodes, because we're guaranteed
> > + * to have XFS_ISTALE set in the case that dirty inodes are in the CIL and have
> > + * not yet had their dirtying transactions committed to disk.
> >   */
> >  void
> >  xfs_iflush_done(
> > @@ -670,14 +692,16 @@ xfs_iflush_done(
> >  			continue;
> >  		}
> >  
> > +		if (!iip->ili_last_fields)
> > +			continue;
> > +
> 
> Hmm.. reading the comment above, do we actually attach the log item to
> the buffer any earlier? ISTM we attach the buffer to the log item via a
> hold, but that's different from getting the ili on ->b_li_list such that
> it's available here. Hm?

I think I've probably just put the comment and this check in the
wrong patch when I split this all up. A later patch in the series
moves the inode attachment to the buffer to the
xfs_trans_log_inode() call, and that's when this situation arises
and the check is needed.


> > @@ -706,14 +730,29 @@ xfs_iflush_done(
> >  	 * them is safely on disk.
> >  	 */
> >  	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
> > +		bool	drop_buffer = false;
> > +
> >  		list_del_init(&lip->li_bio_list);
> >  		iip = INODE_ITEM(lip);
> >  
> >  		spin_lock(&iip->ili_lock);
> > +
> > +		/*
> > +		 * Remove the reference to the cluster buffer if the inode is
> > +		 * clean in memory. Drop the buffer reference once we've dropped
> > +		 * the locks we hold.
> > +		 */
> > +		ASSERT(iip->ili_item.li_buf == bp);
> > +		if (!iip->ili_fields) {
> > +			iip->ili_item.li_buf = NULL;
> > +			drop_buffer = true;
> > +		}
> >  		iip->ili_last_fields = 0;
> > +		iip->ili_flush_lsn = 0;
> 
> This also seems related to the behavior noted in the comment above.
> Presumably we have to clear the flush lsn if clean inodes remain
> attached to the buffer.. (but does that actually happen yet)?

I think I added that here when debugging an issue as a mechanism
to check that the flush_lsn was only set while a flush was in
progress. It's all hazy now because I had to rebase the parts of the
patchset that change this section of code soooo many times I kinda
lost track of where and why of the little details...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 03/30] xfs: add an inode item lock
  2020-06-02 16:34   ` Brian Foster
@ 2020-06-04  1:54     ` Dave Chinner
  2020-06-04 14:03       ` Brian Foster
  0 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-04  1:54 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Jun 02, 2020 at 12:34:44PM -0400, Brian Foster wrote:
> On Tue, Jun 02, 2020 at 07:42:24AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> ...
> > @@ -122,23 +117,30 @@ xfs_trans_log_inode(
> >  	 * set however, then go ahead and bump the i_version counter
> >  	 * unconditionally.
> >  	 */
> > -	if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
> > -	    IS_I_VERSION(VFS_I(ip))) {
> > -		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
> > -			flags |= XFS_ILOG_CORE;
> > +	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> > +		if (IS_I_VERSION(inode) &&
> > +		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > +			iversion_flags = XFS_ILOG_CORE;
> >  	}
> >  
> > -	tp->t_flags |= XFS_TRANS_DIRTY;
> > +	/*
> > +	 * Record the specific change for fdatasync optimisation. This allows
> > +	 * fdatasync to skip log forces for inodes that are only timestamp
> > +	 * dirty. We do this before the change count so that the core being
> > +	 * logged in this case does not impact on fdatasync behaviour.
> > +	 */
> 
> We no longer do this before the change count logic so that part of the
> comment is bogus.

Ugh. Another 6 patch conflicts to resolve coming right up....

> > +	spin_lock(&iip->ili_lock);
> > +	iip->ili_fsync_fields |= flags;
> >  
> >  	/*
> > -	 * Always OR in the bits from the ili_last_fields field.
> > -	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
> > -	 * routines in the eventual clearing of the ili_fields bits.
> > -	 * See the big comment in xfs_iflush() for an explanation of
> > -	 * this coordination mechanism.
> > +	 * Always OR in the bits from the ili_last_fields field.  This is to
> > +	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
> > +	 * the eventual clearing of the ili_fields bits.  See the big comment in
> > +	 * xfs_iflush() for an explanation of this coordination mechanism.
> >  	 */
> > -	flags |= ip->i_itemp->ili_last_fields;
> > -	ip->i_itemp->ili_fields |= flags;
> > +	iip->ili_fields |= (flags | iip->ili_last_fields |
> > +			    iversion_flags);
> > +	spin_unlock(&iip->ili_lock);
> >  }
> >  
> >  int
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 403c90309a8ff..0abf770b77498 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -94,6 +94,7 @@ xfs_file_fsync(
> >  {
> >  	struct inode		*inode = file->f_mapping->host;
> >  	struct xfs_inode	*ip = XFS_I(inode);
> > +	struct xfs_inode_log_item *iip = ip->i_itemp;
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	int			error = 0;
> >  	int			log_flushed = 0;
> > @@ -137,13 +138,15 @@ xfs_file_fsync(
> >  	xfs_ilock(ip, XFS_ILOCK_SHARED);
> >  	if (xfs_ipincount(ip)) {
> >  		if (!datasync ||
> > -		    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > -			lsn = ip->i_itemp->ili_last_lsn;
> > +		    (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > +			lsn = iip->ili_last_lsn;
> 
> I am still a little confused why the lock is elided in other read cases,
> such as this one or perhaps the similar check in xfs_bmbt_to_iomap()..?

They are still all serialised against those field changing the same
way they currently are. i.e. they are all under the ILOCK, so
changes that occur during IO submission will never occur.  Hence the
only thing that we can race with is IO completion clearing the
fields, in which case the subsequent operations if the item is now
clean turn into no-ops.

i.e:
- ILOCK serialises transaction logging vs IO submission.
- iflock serialises IO submission vs IO completion.
- Nothing serialises transaction logging vs IO completion.

The latter is what the ili_lock is intended for; everything else is
still protected by the existing serialisation mechanisms that they
are now. Any races in areas outside xfs_trans_log_inode() vs
xfs_iflush_done/abort() is largely outside the scope of this patch
and this lock...

> Similarly, it looks like we set the ili_[flush|last]_lsn fields outside
> of this lock (though last_lsn looks like it's also covered by ilock),
> yet the update to the inode_log_item struct implies they should be
> protected. What's the intent there?

The lsn fields are updated via xfs_trans_ail_lsn_copy(), which on 32
bit systems takes the AIL lock, and I don't think it's a good idea
to put the AIL lock inside the inode item lock.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/30] xfs: pin inode backing buffer to the inode log item
  2020-06-03 22:15     ` Dave Chinner
@ 2020-06-04 14:03       ` Brian Foster
  0 siblings, 0 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-04 14:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 04, 2020 at 08:15:31AM +1000, Dave Chinner wrote:
> On Wed, Jun 03, 2020 at 02:58:12PM -0400, Brian Foster wrote:
> > On Tue, Jun 02, 2020 at 07:42:37AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > When we dirty an inode, we are going to have to write it disk at
> > > some point in the near future. This requires the inode cluster
> > > backing buffer to be present in memory. Unfortunately, under severe
> > > memory pressure we can reclaim the inode backing buffer while the
> > > inode is dirty in memory, resulting in stalling the AIL pushing
> > > because it has to do a read-modify-write cycle on the cluster
> > > buffer.
> > > 
> > > When we have no memory available, the read of the cluster buffer
> > > blocks the AIL pushing process, and this causes all sorts of issues
> > > for memory reclaim as it requires inode writeback to make forwards
> > > progress. Allocating a cluster buffer causes more memory pressure,
> > > and results in more cluster buffers to be reclaimed, resulting in
> > > more RMW cycles to be done in the AIL context and everything then
> > > backs up on AIL progress. Only the synchronous inode cluster
> > > writeback in the the inode reclaim code provides some level of
> > > forwards progress guarantees that prevent OOM-killer rampages in
> > > this situation.
> > > 
> > > Fix this by pinning the inode backing buffer to the inode log item
> > > when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
> > > This may mean the first modification of an inode that has been held
> > > in cache for a long time may block on a cluster buffer read, but
> > > we can do that in transaction context and block safely until the
> > > buffer has been allocated and read.
> > > 
> > > Once we have the cluster buffer, the inode log item takes a
> > > reference to it, pinning it in memory, and attaches it to the log
> > > item for future reference. This means we can always grab the cluster
> > > buffer from the inode log item when we need it.
> > > 
> > > When the inode is finally cleaned and removed from the AIL, we can
> > > drop the reference the inode log item holds on the cluster buffer.
> > > Once all inodes on the cluster buffer are clean, the cluster buffer
> > > will be unpinned and it will be available for memory reclaim to
> > > reclaim again.
> > > 
> > > This avoids the issues with needing to do RMW cycles in the AIL
> > > pushing context, and hence allows complete non-blocking inode
> > > flushing to be performed by the AIL pushing context.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_inode_buf.c   |  3 +-
> > >  fs/xfs/libxfs/xfs_trans_inode.c | 53 +++++++++++++++++++++---
> > >  fs/xfs/xfs_buf_item.c           |  4 +-
> > >  fs/xfs/xfs_inode_item.c         | 73 +++++++++++++++++++++++++++------
> > >  fs/xfs/xfs_trans_ail.c          |  8 +++-
> > >  5 files changed, 117 insertions(+), 24 deletions(-)
> > > 
> > ...
> > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> > > index fe6c2e39be85d..1e7147b90725e 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > ...
> > > @@ -132,6 +140,39 @@ xfs_trans_log_inode(
> > >  	spin_lock(&iip->ili_lock);
> > >  	iip->ili_fsync_fields |= flags;
> > >  
> > > +	if (!iip->ili_item.li_buf) {
> > > +		struct xfs_buf	*bp;
> > > +		int		error;
> > > +
> > > +		/*
> > > +		 * We hold the ILOCK here, so this inode is not going to be
> > > +		 * flushed while we are here. Further, because there is no
> > > +		 * buffer attached to the item, we know that there is no IO in
> > > +		 * progress, so nothing will clear the ili_fields while we read
> > > +		 * in the buffer. Hence we can safely drop the spin lock and
> > > +		 * read the buffer knowing that the state will not change from
> > > +		 * here.
> > > +		 */
> > > +		spin_unlock(&iip->ili_lock);
> > > +		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, NULL,
> > > +					&bp, 0);
> > > +		if (error) {
> > > +			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> > > +			return;
> > > +		}
> > 
> > It's slightly unfortunate to shutdown on a read error, but I'd guess
> > many of these cases would have a dirty transaction already. Perhaps
> > something worth cleaning up later..?
> 
> All of these cases will a dirty transaction - the inode has been
> modified in memory before it was logged and hence we cannot undo
> what has already been done. If we return an error here, we will need
> to cancel the transaction, and xfs_trans_cancel() will do a shutdown
> anyway. Doing it here just means we don't have to return an error
> and add error handling to the ~80 callers of xfs_trans_log_inode()
> just to trigger a shutdown correctly.
> 

Yes, I'm not suggesting doing that. That doesn't change the fact that
historically such a read error doesn't translate to a permanent
filesystem error because it wasn't tied to a dirty transaction.
Technically we could get around that by acquiring the cluster buffer
earlier in the transaction, but I'm not suggesting we do that either for
similar reasons around the amount of churn involved in all of the
codepaths that log an inode.

Granted, it could still be the case that the impact of this is minimal
because in the event of persistent I/O errors, we would have had to read
the buffer to get the in-core inode at some point and likely would have
failed more gracefully earlier anyways. Either way, I'm just positing
that it's worth thinking about if we happen to come up with something
more simple/clever down the line to avoid the shutdown vector.

> > > @@ -450,6 +453,12 @@ xfs_inode_item_pin(
> > >   * item which was previously pinned with a call to xfs_inode_item_pin().
> > >   *
> > >   * Also wake up anyone in xfs_iunpin_wait() if the count goes to 0.
> > > + *
> > > + * Note that unpin can race with inode cluster buffer freeing marking the buffer
> > > + * stale. In that case, flush completions are run from the buffer unpin call,
> > > + * which may happen before the inode is unpinned. If we lose the race, there
> > > + * will be no buffer attached to the log item, but the inode will be marked
> > > + * XFS_ISTALE.
> > >   */
> > >  STATIC void
> > >  xfs_inode_item_unpin(
> > > @@ -459,6 +468,7 @@ xfs_inode_item_unpin(
> > >  	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
> > >  
> > >  	trace_xfs_inode_unpin(ip, _RET_IP_);
> > > +	ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE));
> > >  	ASSERT(atomic_read(&ip->i_pincount) > 0);
> > >  	if (atomic_dec_and_test(&ip->i_pincount))
> > >  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
> > 
> > So I was wondering what happens to the attached buffer hold if shutdown
> > occurs after the inode is logged (i.e. transaction aborts or log write
> > fails).
> 
> Hmmm. Good question. 
> 
> There may be no other inodes on the buffer and it this inode may not
> be in the AIL, so there's no trigger for xfs_iflush_abort() to be
> run from buffer IO completion. So, yes, we could leave an inode
> attached to the buffer here....
> 
> 
> > I see there's an assert for the buffer being cleaned up before
> > the ili is freed, so presumably that case is handled. It looks like we
> > unconditionally abort a flush on inode reclaim if the fs is shutdown,
> > regardless of whether the inode is dirty and we drop the buffer from
> > there..?
> 
> Yes, that's where this shutdown race condition has always been
> handled. i.e. we know that inodes that are dirty in memory can be
> left dangling by the shutdown, and if it's a log IO error they may
> even still be pinned. Hence reclaim has to ensure that they are
> properly aborted before reclaim otherwise various "reclaiming dirty
> inode" asserts will fire.
> 

Makes sense.

> As it is, in the next patchset the cluster buffer is always inserted
> into the AIL as an ordered buffer so it is always committed in the
> same transaction as the inode. Hence the abort/unpin call on the
> buffer runs the inode IO done processing, it will get removed from
> the list and we aren't directly reliant on inode reclaim running a
> flush abort to do that for us.
> 

That sounds more preferable. While I see why we need the current code
and ultimately it looks correct, it seems a bit of indirect "clean up
the broken mess" logic due to lack of handling in more direct codepaths.

> > > @@ -629,10 +639,15 @@ xfs_inode_item_init(
> > >   */
> > >  void
> > >  xfs_inode_item_destroy(
> > > -	xfs_inode_t	*ip)
> > > +	struct xfs_inode	*ip)
> > >  {
> > > -	kmem_free(ip->i_itemp->ili_item.li_lv_shadow);
> > > -	kmem_cache_free(xfs_ili_zone, ip->i_itemp);
> > > +	struct xfs_inode_log_item *iip = ip->i_itemp;
> > > +
> > > +	ASSERT(iip->ili_item.li_buf == NULL);
> > > +
> > > +	ip->i_itemp = NULL;
> > > +	kmem_free(iip->ili_item.li_lv_shadow);
> > > +	kmem_cache_free(xfs_ili_zone, iip);
> > >  }
> > >  
> > >  
> > > @@ -647,6 +662,13 @@ xfs_inode_item_destroy(
> > >   * list for other inodes that will run this function. We remove them from the
> > >   * buffer list so we can process all the inode IO completions in one AIL lock
> > >   * traversal.
> > > + *
> > > + * Note: Now that we attach the log item to the buffer when we first log the
> > > + * inode in memory, we can have unflushed inodes on the buffer list here. These
> > > + * inodes will have a zero ili_last_fields, so skip over them here. We do
> > > + * this check -after- we've checked for stale inodes, because we're guaranteed
> > > + * to have XFS_ISTALE set in the case that dirty inodes are in the CIL and have
> > > + * not yet had their dirtying transactions committed to disk.
> > >   */
> > >  void
> > >  xfs_iflush_done(
> > > @@ -670,14 +692,16 @@ xfs_iflush_done(
> > >  			continue;
> > >  		}
> > >  
> > > +		if (!iip->ili_last_fields)
> > > +			continue;
> > > +
> > 
> > Hmm.. reading the comment above, do we actually attach the log item to
> > the buffer any earlier? ISTM we attach the buffer to the log item via a
> > hold, but that's different from getting the ili on ->b_li_list such that
> > it's available here. Hm?
> 
> I think I've probably just put the comment and this check in the
> wrong patch when I split this all up. A later patch in the series
> moves the inode attachment to the buffer to the
> xfs_trans_log_inode() call, and that's when this situation arises
> and the check is needed.
> 

Ok, that makes much more sense.

> 
> > > @@ -706,14 +730,29 @@ xfs_iflush_done(
> > >  	 * them is safely on disk.
> > >  	 */
> > >  	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
> > > +		bool	drop_buffer = false;
> > > +
> > >  		list_del_init(&lip->li_bio_list);
> > >  		iip = INODE_ITEM(lip);
> > >  
> > >  		spin_lock(&iip->ili_lock);
> > > +
> > > +		/*
> > > +		 * Remove the reference to the cluster buffer if the inode is
> > > +		 * clean in memory. Drop the buffer reference once we've dropped
> > > +		 * the locks we hold.
> > > +		 */
> > > +		ASSERT(iip->ili_item.li_buf == bp);
> > > +		if (!iip->ili_fields) {
> > > +			iip->ili_item.li_buf = NULL;
> > > +			drop_buffer = true;
> > > +		}
> > >  		iip->ili_last_fields = 0;
> > > +		iip->ili_flush_lsn = 0;
> > 
> > This also seems related to the behavior noted in the comment above.
> > Presumably we have to clear the flush lsn if clean inodes remain
> > attached to the buffer.. (but does that actually happen yet)?
> 
> I think I added that here when debugging an issue as a mechanism
> to check that the flush_lsn was only set while a flush was in
> progress. It's all hazy now because I had to rebase the parts of the
> patchset that change this section of code soooo many times I kinda
> lost track of where and why of the little details...
> 

Heh. I guess it's not clear to me if this is functional or defensive
logic, but I am obviously still working my way through the series..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 03/30] xfs: add an inode item lock
  2020-06-04  1:54     ` Dave Chinner
@ 2020-06-04 14:03       ` Brian Foster
  0 siblings, 0 replies; 80+ messages in thread
From: Brian Foster @ 2020-06-04 14:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 04, 2020 at 11:54:56AM +1000, Dave Chinner wrote:
> On Tue, Jun 02, 2020 at 12:34:44PM -0400, Brian Foster wrote:
> > On Tue, Jun 02, 2020 at 07:42:24AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > ...
> > > @@ -122,23 +117,30 @@ xfs_trans_log_inode(
> > >  	 * set however, then go ahead and bump the i_version counter
> > >  	 * unconditionally.
> > >  	 */
> > > -	if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
> > > -	    IS_I_VERSION(VFS_I(ip))) {
> > > -		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
> > > -			flags |= XFS_ILOG_CORE;
> > > +	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> > > +		if (IS_I_VERSION(inode) &&
> > > +		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > > +			iversion_flags = XFS_ILOG_CORE;
> > >  	}
> > >  
> > > -	tp->t_flags |= XFS_TRANS_DIRTY;
> > > +	/*
> > > +	 * Record the specific change for fdatasync optimisation. This allows
> > > +	 * fdatasync to skip log forces for inodes that are only timestamp
> > > +	 * dirty. We do this before the change count so that the core being
> > > +	 * logged in this case does not impact on fdatasync behaviour.
> > > +	 */
> > 
> > We no longer do this before the change count logic so that part of the
> > comment is bogus.
> 
> Ugh. Another 6 patch conflicts to resolve coming right up....
> 
> > > +	spin_lock(&iip->ili_lock);
> > > +	iip->ili_fsync_fields |= flags;
> > >  
> > >  	/*
> > > -	 * Always OR in the bits from the ili_last_fields field.
> > > -	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
> > > -	 * routines in the eventual clearing of the ili_fields bits.
> > > -	 * See the big comment in xfs_iflush() for an explanation of
> > > -	 * this coordination mechanism.
> > > +	 * Always OR in the bits from the ili_last_fields field.  This is to
> > > +	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
> > > +	 * the eventual clearing of the ili_fields bits.  See the big comment in
> > > +	 * xfs_iflush() for an explanation of this coordination mechanism.
> > >  	 */
> > > -	flags |= ip->i_itemp->ili_last_fields;
> > > -	ip->i_itemp->ili_fields |= flags;
> > > +	iip->ili_fields |= (flags | iip->ili_last_fields |
> > > +			    iversion_flags);
> > > +	spin_unlock(&iip->ili_lock);
> > >  }
> > >  
> > >  int
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 403c90309a8ff..0abf770b77498 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -94,6 +94,7 @@ xfs_file_fsync(
> > >  {
> > >  	struct inode		*inode = file->f_mapping->host;
> > >  	struct xfs_inode	*ip = XFS_I(inode);
> > > +	struct xfs_inode_log_item *iip = ip->i_itemp;
> > >  	struct xfs_mount	*mp = ip->i_mount;
> > >  	int			error = 0;
> > >  	int			log_flushed = 0;
> > > @@ -137,13 +138,15 @@ xfs_file_fsync(
> > >  	xfs_ilock(ip, XFS_ILOCK_SHARED);
> > >  	if (xfs_ipincount(ip)) {
> > >  		if (!datasync ||
> > > -		    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > > -			lsn = ip->i_itemp->ili_last_lsn;
> > > +		    (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > > +			lsn = iip->ili_last_lsn;
> > 
> > I am still a little confused why the lock is elided in other read cases,
> > such as this one or perhaps the similar check in xfs_bmbt_to_iomap()..?
> 
> They are still all serialised against those field changing the same
> way they currently are. i.e. they are all under the ILOCK, so
> changes that occur during IO submission will never occur.  Hence the
> only thing that we can race with is IO completion clearing the
> fields, in which case the subsequent operations if the item is now
> clean turn into no-ops.
> 
> i.e:
> - ILOCK serialises transaction logging vs IO submission.
> - iflock serialises IO submission vs IO completion.
> - Nothing serialises transaction logging vs IO completion.
> 
> The latter is what the ili_lock is intended for; everything else is
> still protected by the existing serialisation mechanisms that they
> are now. Any races in areas outside xfs_trans_log_inode() vs
> xfs_iflush_done/abort() is largely outside the scope of this patch
> and this lock...
> 

Ok, but in this particular case we use the ili_lock around the
ili_fsync_fields reset (but not the read in the same function), and that
field is cleared when the inode is flushed. Is the lock used here for
the abort case?

I think I'll probably have to get through the rest of the series, see
how the lock is used with the logging changes in place, and then come
back and see if I can grok this aspect of it a little better..

> > Similarly, it looks like we set the ili_[flush|last]_lsn fields outside
> > of this lock (though last_lsn looks like it's also covered by ilock),
> > yet the update to the inode_log_item struct implies they should be
> > protected. What's the intent there?
> 
> The lsn fields are updated via xfs_trans_ail_lsn_copy(), which on 32
> bit systems takes the AIL lock, and I don't think it's a good idea
> to put the AIL lock inside the inode item lock.
> 

Ok.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous
  2020-06-30 16:52   ` Darrick J. Wong
@ 2020-06-30 21:51     ` Dave Chinner
  0 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-30 21:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 30, 2020 at 09:52:03AM -0700, Darrick J. Wong wrote:
> On Mon, Jun 29, 2020 at 04:01:30PM -0700, Darrick J. Wong wrote:
> > Both of these failure cases have been difficult to reproduce, which is
> > to say that I can't get them to repro reliably.  Turning PREEMPT on
> > seems to make it reproduce faster, which makes me wonder if something in
> > this patchset is screwing up concurrency handling or something?  KASAN
> > and kmemleak have nothing to say.  I've also noticed that the less
> > heavily loaded the underlying VM host's storage system, the less likely
> > it is to happen, though that could be a coincidence.
> > 
> > Anyway, if I figure something out I'll holler, but I thought it was past
> > time to braindump on the mailing list.
> 
> Last night, Dave and I did some live debugging of a failed VM test
> system, and discovered that the xfs_reclaim_inodes() call does not
> actually reclaim all the IRECLAIMABLE inodes.  Because we fail to call
> xfs_reclaim_inode() on all the inodes, there are still inodes in the
> incore inode xarray, and they still have dquots attached.
> 
> This would explain the symptoms I've seen -- since we didn't reclaim the
> inodes, we didn't dqdetach them either, and so the dqpurge_all will spin
> forever on the still-referenced dquots.  This also explains the slub
> complaints about active xfs_inode/xfs_inode_log_item objects if I turn
> off quotas, since we didn't clean those up either.
> 
> Further analysis (aka adding tracepoints) shows xfs_reclaim_inode_grab
> deciding to skip some inodes because IFLOCK is set.  Adding code to
> cycle the i_flags_lock ahead of the unlocked IFLOCK test didn't make the
> symptoms go away, so I instrumented the inode flush "lock" functions to
> see what was going on (full version available here [1]):

[...]

> Bingo!  The xfs_ail_push_all_sync in xfs_unmountfs takes a bunch of
> inode iflocks, starts the inode cluster buffer write, and since the AIL
> is now empty, returns.  The unmount process moves on to calling
> xfs_reclaim_inodes, which as you can see in the last four lines:
> 
>           umount-10409 [001]    44.118882: xfs_reclaim_inode_grab: dev 259:0 ino 0x8a
> 
> This ^^^ is logged at the start of xfs_reclaim_inode_grab.
> 
>           umount-10409 [001]    44.118883: xfs_reclaim_inode_grab_iflock: dev 259:0 ino 0x8a
> 
> This is logged when x_r_i_g observes that the IFLOCK is set and bails out.
> 
>      kworker/2:1-50    [002]    44.118883: xfs_ifunlock:         dev 259:0 ino 0x8a
> 
> And finally this is the inode cluster buffer IO completion calling
> xfs_buf_inode_iodone -> xfs_iflush_done from a workqueue.
> 
> So it seems to me that inode reclaim races with the AIL for the IFLOCK,
> and when unmount inode reclaim loses, it does the wrong thing.

Yeah, that's what I suspected when I finished up yesterday, but I
couldn't quite connect how the AIL wasn't waiting for the inode
completion.

The moment I looked at it again this morning, I realised that it was
simply that xfs_ail_push_all_sync() is woken when the AIL is
emptied, and that happens about 20 lines of code before the flush
lock is dropped, and if the wakeup to the sleeping task is fast
enough, it can be running before the IO completion finishes the
wakeup.

And with a PREEMPT kernel, we might do preempts on wakeup (that was
the path to the scheduler bug we kept hitting), hence increasing the
chance that the unmount task will run before the IO completion
finishes and drops the inode flush lock.

> Questions: Do we need to teach xfs_reclaim_inodes_ag to increment
> @skipped if xfs_reclaim_inode_grab rejects an inode?  xfs_reclaim_inodes
> is the only consumer of the @skipped value, and elevated skipped will
> cause it to rerun the scan, so I think this will work.

No, we just need to get rid of the racy check in
xfs_reclaim_inode_grab(). I'm going to get rid of the whole skipped
thing, too.

> Or, do we need to wait for the ail items to complete after xfsaild does
> its xfs_buf_delwri_submit_nowait thing?

We've already waited for the -AIL items- to complete, and that's
really all we should be doing at the xfs_ail_push_all_sync layer.

The issue is that xfs_ail_push_all_sync() doesn't quite wait for IO
to complete so we've been conflating these two different operations
for a long time (essentially since we moved to logging everything
and tracking all dirty metadata in the AIL). In general, they mean
the same thing, but in this specific corner case the subtle
distinction actually matters.

It's easy enough to avoid - just get rid of what, independently,
this patchset makes a questionable optimisation in
xfs_reclaim_inode_grab(). i.e we no longer block reclaim on locks
and so optimisations to avoid blocking on locks....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous
  2020-06-29 23:01 ` Darrick J. Wong
@ 2020-06-30 16:52   ` Darrick J. Wong
  2020-06-30 21:51     ` Dave Chinner
  0 siblings, 1 reply; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-30 16:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 29, 2020 at 04:01:30PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 22, 2020 at 06:15:35PM +1000, Dave Chinner wrote:
> > Hi folks,
> > 
> > Inode flushing requires that we first lock an inode, then check it,
> > then lock the underlying buffer, flush the inode to the buffer and
> > finally add the inode to the buffer to be unlocked on IO completion.
> > We then walk all the other cached inodes in the buffer range and
> > optimistically lock and flush them to the buffer without blocking.
> 
> Well, I've been banging my head against this patchset for the past
> couple of weeks now, and I still can't get it to finish fstests
> reliably.
> 
> Last week, Dave and I were stymied by a bug in the scheduler that was
> fixed in -rc3, but even with that applied I still see weird failures.  I
> /think/ there are only two now:
> 
> 1) If I run xfs/305 (with all three quotas enabled) in a tight loop (and
> rmmod xfs after each run), after 20-30 minutes I will start to see the
> slub cache start complaining about leftovers in the xfs_ili (inode log
> item) and xfs_inode caches.
> 
> Unfortunately, due to the kernel's new security posture of never
> allowing kernel pointer values to be logged, the slub complaints are
> mostly useless because it no longer prints anything that would enable me
> to figure out /which/ inodes are being left behind:
> 
>  =============================================================================
>  BUG xfs_ili (Tainted: G    B            ): Objects remaining in xfs_ili on __kmem_cache_shutdown()
>  -----------------------------------------------------------------------------
>  
>  INFO: Slab 0x000000007e8837cf objects=31 used=9 fp=0x000000007017e948 flags=0x000000000010200
>  CPU: 1 PID: 80614 Comm: rmmod Tainted: G    B             5.8.0-rc3-djw #rc3
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
>  Call Trace:
>   dump_stack+0x78/0xa0
>   slab_err+0xb7/0xdc
>   ? trace_hardirqs_on+0x1c/0xf0
>   __kmem_cache_shutdown.cold+0x3a/0x163
>   ? __mutex_unlock_slowpath+0x45/0x2a0
>   kmem_cache_destroy+0x55/0x110
>   xfs_destroy_zones+0x6a/0xe2 [xfs]
>   exit_xfs_fs+0x5f/0xb7b [xfs]
>   __x64_sys_delete_module+0x120/0x210
>   ? __prepare_exit_to_usermode+0xe4/0x170
>   do_syscall_64+0x56/0xa0
>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
>  RIP: 0033:0x7ff204672a3b
>  Code: Bad RIP value.
>  RSP: 002b:00007ffe60155378 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
>  RAX: ffffffffffffffda RBX: 0000558f2bfa2780 RCX: 00007ff204672a3b
>  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000558f2bfa27e8
>  RBP: 00007ffe601553d8 R08: 0000000000000000 R09: 0000000000000000
>  R10: 00007ff2046eeac0 R11: 0000000000000206 R12: 00007ffe601555b0
>  R13: 00007ffe6015703d R14: 0000558f2bfa12a0 R15: 0000558f2bfa2780
>  INFO: Object 0x00000000a92e3c34 @offset=0
>  INFO: Object 0x00000000650eb3bf @offset=792
>  INFO: Object 0x00000000eabfef0f @offset=1320
>  INFO: Object 0x00000000cdaae406 @offset=4224
>  INFO: Object 0x000000007d9bbde1 @offset=4488
>  INFO: Object 0x00000000e35f4716 @offset=5016
>  INFO: Object 0x0000000008e636d2 @offset=5280
>  INFO: Object 0x00000000170762ee @offset=5808
>  INFO: Object 0x0000000046425f04 @offset=7920
> 
> Note all the 64-bit values that have the 32 upper bits set to 0; this
> is the pointer hashing safety algorithm at work.  I've patched around
> that bit of training-wheels drain bamage, but now I get to wait until it
> happens again.
> 
> 2) If /that/ doesn't happen, a regular fstests run (again with all three
> quotas enabled) will (usually very quickly) wedge in unmount:
> 
> [<0>] xfs_qm_dquot_walk+0x19c/0x2b0 [xfs]
> [<0>] xfs_qm_dqpurge_all+0x31/0x70 [xfs]
> [<0>] xfs_qm_unmount+0x1d/0x30 [xfs]
> [<0>] xfs_unmountfs+0xa0/0x1a0 [xfs]
> [<0>] xfs_fs_put_super+0x35/0x80 [xfs]
> [<0>] generic_shutdown_super+0x67/0x100
> [<0>] kill_block_super+0x21/0x50
> [<0>] deactivate_locked_super+0x31/0x70
> [<0>] cleanup_mnt+0x100/0x160
> [<0>] task_work_run+0x5f/0xa0
> [<0>] __prepare_exit_to_usermode+0x13d/0x170
> [<0>] do_syscall_64+0x62/0xa0
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> AFAICT it's usually the root dquot and dqpurge won't touch it because
> the quota nrefs > 0.  Poking around in gdb, I find that whichever
> xfs_mount is stalled does not seem to have any vfs inodes attached to
> it, so it's clear that we flushed and freed all the incore inode state,
> which means that all the dquots should be unreferenced.
> 
> Both of these failure cases have been difficult to reproduce, which is
> to say that I can't get them to repro reliably.  Turning PREEMPT on
> seems to make it reproduce faster, which makes me wonder if something in
> this patchset is screwing up concurrency handling or something?  KASAN
> and kmemleak have nothing to say.  I've also noticed that the less
> heavily loaded the underlying VM host's storage system, the less likely
> it is to happen, though that could be a coincidence.
> 
> Anyway, if I figure something out I'll holler, but I thought it was past
> time to braindump on the mailing list.

Last night, Dave and I did some live debugging of a failed VM test
system, and discovered that the xfs_reclaim_inodes() call does not
actually reclaim all the IRECLAIMABLE inodes.  Because we fail to call
xfs_reclaim_inode() on all the inodes, there are still inodes in the
incore inode xarray, and they still have dquots attached.

This would explain the symptoms I've seen -- since we didn't reclaim the
inodes, we didn't dqdetach them either, and so the dqpurge_all will spin
forever on the still-referenced dquots.  This also explains the slub
complaints about active xfs_inode/xfs_inode_log_item objects if I turn
off quotas, since we didn't clean those up either.

Further analysis (aka adding tracepoints) shows xfs_reclaim_inode_grab
deciding to skip some inodes because IFLOCK is set.  Adding code to
cycle the i_flags_lock ahead of the unlocked IFLOCK test didn't make the
symptoms go away, so I instrumented the inode flush "lock" functions to
see what was going on (full version available here [1]):

          umount-10409 [001]    44.117599: console:              [   43.980314] XFS (pmem1): Unmounting Filesystem
          umount-10409 [001]    44.118314: xfs_dquot_dqdetach:   dev 259:0 ino 0x80
<snip>
   xfsaild/pmem1-10315 [002]    44.118395: xfs_iflock_nowait:    dev 259:0 ino 0x83
   xfsaild/pmem1-10315 [002]    44.118407: xfs_iflock_nowait:    dev 259:0 ino 0x80
   xfsaild/pmem1-10315 [002]    44.118416: xfs_iflock_nowait:    dev 259:0 ino 0x84
   xfsaild/pmem1-10315 [002]    44.118421: xfs_iflock_nowait:    dev 259:0 ino 0x85
   xfsaild/pmem1-10315 [002]    44.118426: xfs_iflock_nowait:    dev 259:0 ino 0x86
   xfsaild/pmem1-10315 [002]    44.118430: xfs_iflock_nowait:    dev 259:0 ino 0x87
   xfsaild/pmem1-10315 [002]    44.118435: xfs_iflock_nowait:    dev 259:0 ino 0x88
   xfsaild/pmem1-10315 [002]    44.118440: xfs_iflock_nowait:    dev 259:0 ino 0x89
   xfsaild/pmem1-10315 [002]    44.118445: xfs_iflock_nowait:    dev 259:0 ino 0x8a
   xfsaild/pmem1-10315 [002]    44.118449: xfs_iflock_nowait:    dev 259:0 ino 0x8b
   xfsaild/pmem1-10315 [002]    44.118454: xfs_iflock_nowait:    dev 259:0 ino 0x8c
   xfsaild/pmem1-10315 [002]    44.118458: xfs_iflock_nowait:    dev 259:0 ino 0x8d
   xfsaild/pmem1-10315 [002]    44.118463: xfs_iflock_nowait:    dev 259:0 ino 0x8e
   xfsaild/pmem1-10315 [002]    44.118467: xfs_iflock_nowait:    dev 259:0 ino 0x8f
   xfsaild/pmem1-10315 [002]    44.118472: xfs_iflock_nowait:    dev 259:0 ino 0x90
   xfsaild/pmem1-10315 [002]    44.118477: xfs_iflock_nowait:    dev 259:0 ino 0x91
   xfsaild/pmem1-10315 [002]    44.118481: xfs_iflock_nowait:    dev 259:0 ino 0x92
     kworker/2:1-50    [002]    44.118858: xfs_ifunlock:         dev 259:0 ino 0x83
     kworker/2:1-50    [002]    44.118862: xfs_ifunlock:         dev 259:0 ino 0x80
     kworker/2:1-50    [002]    44.118865: xfs_ifunlock:         dev 259:0 ino 0x84
     kworker/2:1-50    [002]    44.118868: xfs_ifunlock:         dev 259:0 ino 0x85
     kworker/2:1-50    [002]    44.118871: xfs_ifunlock:         dev 259:0 ino 0x86
          umount-10409 [001]    44.118872: xfs_reclaim_ag_inodes: dev 259:0 agno 0 agbno 0 len 0
     kworker/2:1-50    [002]    44.118874: xfs_ifunlock:         dev 259:0 ino 0x87
          umount-10409 [001]    44.118874: xfs_reclaim_inode_grab: dev 259:0 ino 0x80
          umount-10409 [001]    44.118875: xfs_reclaim_inode_grab: dev 259:0 ino 0x81
          umount-10409 [001]    44.118876: xfs_reclaim_inode_grab: dev 259:0 ino 0x82
          umount-10409 [001]    44.118877: xfs_reclaim_inode_grab: dev 259:0 ino 0x83
     kworker/2:1-50    [002]    44.118877: xfs_ifunlock:         dev 259:0 ino 0x88
          umount-10409 [001]    44.118877: xfs_reclaim_inode_grab: dev 259:0 ino 0x84
          umount-10409 [001]    44.118878: xfs_reclaim_inode_grab: dev 259:0 ino 0x85
          umount-10409 [001]    44.118879: xfs_reclaim_inode_grab: dev 259:0 ino 0x86
          umount-10409 [001]    44.118879: xfs_reclaim_inode_grab: dev 259:0 ino 0x87
     kworker/2:1-50    [002]    44.118880: xfs_ifunlock:         dev 259:0 ino 0x89
          umount-10409 [001]    44.118880: xfs_reclaim_inode_grab: dev 259:0 ino 0x88
          umount-10409 [001]    44.118881: xfs_reclaim_inode_grab: dev 259:0 ino 0x89
          umount-10409 [001]    44.118882: xfs_reclaim_inode_grab: dev 259:0 ino 0x8a
          umount-10409 [001]    44.118883: xfs_reclaim_inode_grab_iflock: dev 259:0 ino 0x8a
          umount-10409 [001]    44.118883: xfs_reclaim_inode_grab: dev 259:0 ino 0x8b
     kworker/2:1-50    [002]    44.118883: xfs_ifunlock:         dev 259:0 ino 0x8a

Bingo!  The xfs_ail_push_all_sync in xfs_unmountfs takes a bunch of
inode iflocks, starts the inode cluster buffer write, and since the AIL
is now empty, returns.  The unmount process moves on to calling
xfs_reclaim_inodes, which as you can see in the last four lines:

          umount-10409 [001]    44.118882: xfs_reclaim_inode_grab: dev 259:0 ino 0x8a

This ^^^ is logged at the start of xfs_reclaim_inode_grab.

          umount-10409 [001]    44.118883: xfs_reclaim_inode_grab_iflock: dev 259:0 ino 0x8a

This is logged when x_r_i_g observes that the IFLOCK is set and bails out.

     kworker/2:1-50    [002]    44.118883: xfs_ifunlock:         dev 259:0 ino 0x8a

And finally this is the inode cluster buffer IO completion calling
xfs_buf_inode_iodone -> xfs_iflush_done from a workqueue.

So it seems to me that inode reclaim races with the AIL for the IFLOCK,
and when unmount inode reclaim loses, it does the wrong thing.

Questions: Do we need to teach xfs_reclaim_inodes_ag to increment
@skipped if xfs_reclaim_inode_grab rejects an inode?  xfs_reclaim_inodes
is the only consumer of the @skipped value, and elevated skipped will
cause it to rerun the scan, so I think this will work.

Or, do we need to wait for the ail items to complete after xfsaild does
its xfs_buf_delwri_submit_nowait thing?

--D

[1] https://djwong.org/docs/tmp/barf.txt.gz

> --D
> 
> > This cluster write effectively repeats the same code we do with the
> > initial inode, except now it has to special case that initial inode
> > that is already locked. Hence we have multiple copies of very
> > similar code, and it is a result of inode cluster flushing being
> > based on a specific inode rather than grabbing the buffer and
> > flushing all available inodes to it.
> > 
> > The problem with this at the moment is that we we can't look up the
> > buffer until we have guaranteed that an inode is held exclusively
> > and it's not going away while we get the buffer through an imap
> > lookup. Hence we are kinda stuck locking an inode before we can look
> > up the buffer.
> > 
> > This is also a result of inodes being detached from the cluster
> > buffer except when IO is being done. This has the further problem
> > that the cluster buffer can be reclaimed from memory and then the
> > inode can be dirtied. At this point cleaning the inode requires a
> > read-modify-write cycle on the cluster buffer. If we then are put
> > under memory pressure, cleaning that dirty inode to reclaim it
> > requires allocating memory for the cluster buffer and this leads to
> > all sorts of problems.
> > 
> > We used synchronous inode writeback in reclaim as a throttle that
> > provided a forwards progress mechanism when RMW cycles were required
> > to clean inodes. Async writeback of inodes (e.g. via the AIL) would
> > immediately exhaust remaining memory reserves trying to allocate
> > inode cluster after inode cluster. The synchronous writeback of an
> > inode cluster allowed reclaim to release the inode cluster and have
> > it freed almost immediately which could then be used to allocate the
> > next inode cluster buffer. Hence the IO based throttling mechanism
> > largely guaranteed forwards progress in inode reclaim. By removing
> > the requirement for require memory allocation for inode writeback
> > filesystem level, we can issue writeback asynchrnously and not have
> > to worry about the memory exhaustion anymore.
> > 
> > Another issue is that if we have slow disks, we can build up dirty
> > inodes in memory that can then take hours for an operation like
> > unmount to flush. A RMW cycle per inode on a slow RAID6 device can
> > mean we only clean 50 inodes a second, and when there are hundreds
> > of thousands of dirty inodes that need to be cleaned this can take a
> > long time. PInning the cluster buffers will greatly speed up inode
> > writeback on slow storage systems like this.
> > 
> > These limitations all stem from the same source: inode writeback is
> > inode centric, And they are largely solved by the same architectural
> > change: make inode writeback cluster buffer centric.  This series is
> > makes that architectural change.
> > 
> > Firstly, we start by pinning the inode backing buffer in memory
> > when an inode is marked dirty (i.e. when it is logged). By tracking
> > the number of dirty inodes on a buffer as a counter rather than a
> > flag, we avoid the problem of overlapping inode dirtying and buffer
> > flushing racing to set/clear the dirty flag. Hence as long as there
> > is a dirty inode in memory, the buffer will not be able to be
> > reclaimed. We can safely do this inode cluster buffer lookup when we
> > dirty an inode as we do not hold the buffer locked - we merely take
> > a reference to it and then release it - and hence we don't cause any
> > new lock order issues.
> > 
> > When the inode is finally cleaned, the reference to the buffer can
> > be removed from the inode log item and the buffer released. This is
> > done from the inode completion callbacks that are attached to the
> > buffer when the inode is flushed.
> > 
> > Pinning the cluster buffer in this way immediately avoids the RMW
> > problem in inode writeback and reclaim contexts by moving the memory
> > allocation and the blocking buffer read into the transaction context
> > that dirties the inode.  This inverts our dirty inode throttling
> > mechanism - we now throttle the rate at which we can dirty inodes to
> > rate at which we can allocate memory and read inode cluster buffers
> > into memory rather than via throttling reclaim to rate at which we
> > can clean dirty inodes.
> > 
> > Hence if we are under memory pressure, we'll block on memory
> > allocation when trying to dirty the referenced inode, rather than in
> > the memory reclaim path where we are trying to clean unreferenced
> > inodes to free memory.  Hence we no longer have to guarantee
> > forwards progress in inode reclaim as we aren't doing memory
> > allocation, and that means we can remove inode writeback from the
> > XFS inode shrinker completely without changing the system tolerance
> > for low memory operation.
> > 
> > Tracking the buffers via the inode log item also allows us to
> > completely rework the inode flushing mechanism. While the inode log
> > item is in the AIL, it is safe for the AIL to access any member of
> > the log item. Hence the AIL push mechanisms can access the buffer
> > attached to the inode without first having to lock the inode.
> > 
> > This means we can essentially lock the buffer directly and then
> > call xfs_iflush_cluster() without first going through xfs_iflush()
> > to find the buffer. Hence we can remove xfs_iflush() altogether,
> > because the two places that call it - the inode item push code and
> > inode reclaim - no longer need to flush inodes directly.
> > 
> > This can be further optimised by attaching the inode to the cluster
> > buffer when the inode is dirtied. i.e. when we add the buffer
> > reference to the inode log item, we also attach the inode to the
> > buffer for IO processing. This leads to the dirty inodes always
> > being attached to the buffer and hence we no longer need to add them
> > when we flush the inode and remove them when IO completes. Instead
> > the inodes are attached when the node log item is dirtied, and
> > removed when the inode log item is cleaned.
> > 
> > With this structure in place, we no longer need to do
> > lookups to find the dirty inodes in the cache to attach to the
> > buffer in xfs_iflush_cluster() - they are already attached to the
> > buffer. Hence when the AIL pushes an inode, we just grab the buffer
> > from the log item, and then walk the buffer log item list to lock
> > and flush the dirty inodes attached to the buffer.
> > 
> > This greatly simplifies inode writeback, and removes another memory
> > allocation from the inode writeback path (the array used for the
> > radix tree gang lookup). And while the radix tree lookups are fast,
> > walking the linked list of dirty inodes is faster.
> > 
> > There is followup work I am doing that uses the inode cluster buffer
> > as a replacement in the AIL for tracking dirty inodes. This part of
> > the series is not ready yet as it has some intricate locking
> > requirements. That is an optimisation, so I've left that out because
> > solving the inode reclaim blocking problems is the important part of
> > this work.
> > 
> > In short, this series simplifies inode writeback and fixes the long
> > standing inode reclaim blocking issues without requiring any changes
> > to the memory reclaim infrastructure.
> > 
> > Note: dquots should probably be converted to cluster flushing in a
> > similar way, as they have many of the same issues as inode flushing.
> > 
> > Thoughts, comments and improvemnts welcome.
> > 
> > -Dave.
> > 
> > Version 4:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-4
> > 
> > - rebase on 5.8-rc2 + for-next
> > - fix buffer retry logic braino (p13)
> > - removed unnecessary asserts (p24)
> > - removed unnecessary delwri queue checks from
> >   xfs_inode_item_push (p24)
> > - rework return value from xfs_iflush_cluster to indicate -EAGAIN if
> >   no inodes were flushed and handle that case in the caller. (p28)
> > - rewrite comment about shutdown case in xfs_iflush_cluster (p28)
> > - always clear XFS_LI_FAILED for items requiring AIL processing
> >   (p29)
> > 
> > 
> > Version 3
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-3
> > 
> > - rebase on 5.7 + for-next
> > - update comments (p3)
> > - update commit message (p4)
> > - renamed xfs_buf_ioerror_sync() (p13)
> > - added enum for return value from xfs_buf_iodone_error() (p13)
> > - moved clearing of buffer error to iodone functions (p13)
> > - whitespace (p13)
> > - rebase p14 (p13 conflicts)
> > - rebase p16 (p13 conflicts)
> > - removed a superfluous assert (p16)
> > - moved comment and check in xfs_iflush_done() from p16 to p25
> > - rebase p25 (p16 conflicts)
> > 
> > 
> > 
> > Version 2
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-2
> > 
> > - describe ili_lock better (p2)
> > - clean up inode logging code some more (p2)
> > - move "early read completion" for xfs_buf_ioend() up into p3 from
> >   p4.
> > - fixed conflicts in p4 due to p3 changes.
> > - fixed conflicts in p5 due to p4 changes.
> > - s/_XBF_LOGRCVY/_XBF_LOG_RECOVERY/ (p5)
> > - renamed the buf log item iodone callback to xfs_buf_item_iodone and
> >   reused the xfs_buf_iodone() name for the catch-all buffer write
> >   iodone completion. (p6)
> > - history update for commit message (p7)
> > - subject update for p8
> > - rework loop in xfs_dquot_done() (p9)
> > - Fixed conflicts in p10 due to p6 changes
> > - got rid of entire comments around li_cb (p11)
> > - new patch to rework buffer io error callbacks
> > - new patch to unwind ->iop_error calls and remove ->iop_error
> > - new patch to lift xfs_clear_li_failed() out of
> >   xfs_ail_delete_one()
> > - rebased p12 on all the prior changes
> > - reworked LI_FAILED handling when pinning inodes to the cluster
> >   buffer (p12) 
> > - fixed comment about holding buffer references in
> >   xfs_trans_log_inode() (p12)
> > - fixed indenting of xfs_iflush_abort() (p12)
> > - added comments explaining "skipped" indoe reclaim return value
> >   (p14)
> > - cleaned up error return stack in xfs_reclaim_inode() (p14)
> > - cleaned up skipped return in xfs_reclaim_inodes() (p14)
> > - fixed bug where skipped wasn't incremented if reclaim cursor was
> >   not zero. This could leave inodes between the start of the AG and
> >   the cursor unreclaimed (p15)
> > - reinstate the patch removing SYNC_WAIT from xfs_reclaim_inodes().
> >   Exposed "skipped" bug in p15.
> > - cleaned up inode reclaim comments (p18)
> > - split p19 into two - one to change xfs_ifree_cluster(), one
> >   for the buffer pinning.
> > - xfs_ifree_mark_inode_stale() now takes the cluster buffer and we
> >   get the perag from that rather than having to do a lookup in
> >   xfs_ifree_cluster().
> > - moved extra IO reference for xfs_iflush_cluster() from AIL pushing
> >   to initial xfs_iflush_cluster rework (p22 -> p20)
> > - fixed static declaration on xfs_iflush() (p22)
> > - fixed incorrect EIO return from xfs_iflush_cluster()
> > - rebase p23 because it all rejects now.
> > - fix INODE_ITEM() usage in p23
> > - removed long lines from commit message in p24
> > - new patch to fix logging of XFS_ISTALE inodes which pushes dirty
> >   inodes through reclaim.
> > 
> > 
> > 
> > Dave Chinner (30):
> >   xfs: Don't allow logging of XFS_ISTALE inodes
> >   xfs: remove logged flag from inode log item
> >   xfs: add an inode item lock
> >   xfs: mark inode buffers in cache
> >   xfs: mark dquot buffers in cache
> >   xfs: mark log recovery buffers for completion
> >   xfs: call xfs_buf_iodone directly
> >   xfs: clean up whacky buffer log item list reinit
> >   xfs: make inode IO completion buffer centric
> >   xfs: use direct calls for dquot IO completion
> >   xfs: clean up the buffer iodone callback functions
> >   xfs: get rid of log item callbacks
> >   xfs: handle buffer log item IO errors directly
> >   xfs: unwind log item error flagging
> >   xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()
> >   xfs: pin inode backing buffer to the inode log item
> >   xfs: make inode reclaim almost non-blocking
> >   xfs: remove IO submission from xfs_reclaim_inode()
> >   xfs: allow multiple reclaimers per AG
> >   xfs: don't block inode reclaim on the ILOCK
> >   xfs: remove SYNC_TRYLOCK from inode reclaim
> >   xfs: remove SYNC_WAIT from xfs_reclaim_inodes()
> >   xfs: clean up inode reclaim comments
> >   xfs: rework stale inodes in xfs_ifree_cluster
> >   xfs: attach inodes to the cluster buffer when dirtied
> >   xfs: xfs_iflush() is no longer necessary
> >   xfs: rename xfs_iflush_int()
> >   xfs: rework xfs_iflush_cluster() dirty inode iteration
> >   xfs: factor xfs_iflush_done
> >   xfs: remove xfs_inobp_check()
> > 
> >  fs/xfs/libxfs/xfs_inode_buf.c   |  27 +-
> >  fs/xfs/libxfs/xfs_inode_buf.h   |   6 -
> >  fs/xfs/libxfs/xfs_trans_inode.c | 110 +++++--
> >  fs/xfs/xfs_buf.c                |  40 ++-
> >  fs/xfs/xfs_buf.h                |  48 ++-
> >  fs/xfs/xfs_buf_item.c           | 419 +++++++++++------------
> >  fs/xfs/xfs_buf_item.h           |   8 +-
> >  fs/xfs/xfs_buf_item_recover.c   |   5 +-
> >  fs/xfs/xfs_dquot.c              |  29 +-
> >  fs/xfs/xfs_dquot.h              |   1 +
> >  fs/xfs/xfs_dquot_item.c         |  18 -
> >  fs/xfs/xfs_dquot_item_recover.c |   2 +-
> >  fs/xfs/xfs_file.c               |   9 +-
> >  fs/xfs/xfs_icache.c             | 333 ++++++-------------
> >  fs/xfs/xfs_icache.h             |   2 +-
> >  fs/xfs/xfs_inode.c              | 567 ++++++++++++--------------------
> >  fs/xfs/xfs_inode.h              |   2 +-
> >  fs/xfs/xfs_inode_item.c         | 303 +++++++++--------
> >  fs/xfs/xfs_inode_item.h         |  24 +-
> >  fs/xfs/xfs_inode_item_recover.c |   2 +-
> >  fs/xfs/xfs_log_recover.c        |   5 +-
> >  fs/xfs/xfs_mount.c              |  15 +-
> >  fs/xfs/xfs_mount.h              |   1 -
> >  fs/xfs/xfs_super.c              |   3 -
> >  fs/xfs/xfs_trans.h              |   5 -
> >  fs/xfs/xfs_trans_ail.c          |  10 +-
> >  fs/xfs/xfs_trans_buf.c          |  15 +-
> >  27 files changed, 889 insertions(+), 1120 deletions(-)
> > 
> > -- 
> > 2.26.2.761.g0e0b3e54be
> > 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous
  2020-06-22  8:15 Dave Chinner
@ 2020-06-29 23:01 ` Darrick J. Wong
  2020-06-30 16:52   ` Darrick J. Wong
  0 siblings, 1 reply; 80+ messages in thread
From: Darrick J. Wong @ 2020-06-29 23:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 22, 2020 at 06:15:35PM +1000, Dave Chinner wrote:
> Hi folks,
> 
> Inode flushing requires that we first lock an inode, then check it,
> then lock the underlying buffer, flush the inode to the buffer and
> finally add the inode to the buffer to be unlocked on IO completion.
> We then walk all the other cached inodes in the buffer range and
> optimistically lock and flush them to the buffer without blocking.

Well, I've been banging my head against this patchset for the past
couple of weeks now, and I still can't get it to finish fstests
reliably.

Last week, Dave and I were stymied by a bug in the scheduler that was
fixed in -rc3, but even with that applied I still see weird failures.  I
/think/ there are only two now:

1) If I run xfs/305 (with all three quotas enabled) in a tight loop (and
rmmod xfs after each run), after 20-30 minutes I will start to see the
slub cache start complaining about leftovers in the xfs_ili (inode log
item) and xfs_inode caches.

Unfortunately, due to the kernel's new security posture of never
allowing kernel pointer values to be logged, the slub complaints are
mostly useless because it no longer prints anything that would enable me
to figure out /which/ inodes are being left behind:

 =============================================================================
 BUG xfs_ili (Tainted: G    B            ): Objects remaining in xfs_ili on __kmem_cache_shutdown()
 -----------------------------------------------------------------------------
 
 INFO: Slab 0x000000007e8837cf objects=31 used=9 fp=0x000000007017e948 flags=0x000000000010200
 CPU: 1 PID: 80614 Comm: rmmod Tainted: G    B             5.8.0-rc3-djw #rc3
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
 Call Trace:
  dump_stack+0x78/0xa0
  slab_err+0xb7/0xdc
  ? trace_hardirqs_on+0x1c/0xf0
  __kmem_cache_shutdown.cold+0x3a/0x163
  ? __mutex_unlock_slowpath+0x45/0x2a0
  kmem_cache_destroy+0x55/0x110
  xfs_destroy_zones+0x6a/0xe2 [xfs]
  exit_xfs_fs+0x5f/0xb7b [xfs]
  __x64_sys_delete_module+0x120/0x210
  ? __prepare_exit_to_usermode+0xe4/0x170
  do_syscall_64+0x56/0xa0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7ff204672a3b
 Code: Bad RIP value.
 RSP: 002b:00007ffe60155378 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: 0000558f2bfa2780 RCX: 00007ff204672a3b
 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000558f2bfa27e8
 RBP: 00007ffe601553d8 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007ff2046eeac0 R11: 0000000000000206 R12: 00007ffe601555b0
 R13: 00007ffe6015703d R14: 0000558f2bfa12a0 R15: 0000558f2bfa2780
 INFO: Object 0x00000000a92e3c34 @offset=0
 INFO: Object 0x00000000650eb3bf @offset=792
 INFO: Object 0x00000000eabfef0f @offset=1320
 INFO: Object 0x00000000cdaae406 @offset=4224
 INFO: Object 0x000000007d9bbde1 @offset=4488
 INFO: Object 0x00000000e35f4716 @offset=5016
 INFO: Object 0x0000000008e636d2 @offset=5280
 INFO: Object 0x00000000170762ee @offset=5808
 INFO: Object 0x0000000046425f04 @offset=7920

Note all the 64-bit values that have the 32 upper bits set to 0; this
is the pointer hashing safety algorithm at work.  I've patched around
that bit of training-wheels drain bamage, but now I get to wait until it
happens again.

2) If /that/ doesn't happen, a regular fstests run (again with all three
quotas enabled) will (usually very quickly) wedge in unmount:

[<0>] xfs_qm_dquot_walk+0x19c/0x2b0 [xfs]
[<0>] xfs_qm_dqpurge_all+0x31/0x70 [xfs]
[<0>] xfs_qm_unmount+0x1d/0x30 [xfs]
[<0>] xfs_unmountfs+0xa0/0x1a0 [xfs]
[<0>] xfs_fs_put_super+0x35/0x80 [xfs]
[<0>] generic_shutdown_super+0x67/0x100
[<0>] kill_block_super+0x21/0x50
[<0>] deactivate_locked_super+0x31/0x70
[<0>] cleanup_mnt+0x100/0x160
[<0>] task_work_run+0x5f/0xa0
[<0>] __prepare_exit_to_usermode+0x13d/0x170
[<0>] do_syscall_64+0x62/0xa0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

AFAICT it's usually the root dquot and dqpurge won't touch it because
the quota nrefs > 0.  Poking around in gdb, I find that whichever
xfs_mount is stalled does not seem to have any vfs inodes attached to
it, so it's clear that we flushed and freed all the incore inode state,
which means that all the dquots should be unreferenced.

Both of these failure cases have been difficult to reproduce, which is
to say that I can't get them to repro reliably.  Turning PREEMPT on
seems to make it reproduce faster, which makes me wonder if something in
this patchset is screwing up concurrency handling or something?  KASAN
and kmemleak have nothing to say.  I've also noticed that the less
heavily loaded the underlying VM host's storage system, the less likely
it is to happen, though that could be a coincidence.

Anyway, if I figure something out I'll holler, but I thought it was past
time to braindump on the mailing list.

--D

> This cluster write effectively repeats the same code we do with the
> initial inode, except now it has to special case that initial inode
> that is already locked. Hence we have multiple copies of very
> similar code, and it is a result of inode cluster flushing being
> based on a specific inode rather than grabbing the buffer and
> flushing all available inodes to it.
> 
> The problem with this at the moment is that we we can't look up the
> buffer until we have guaranteed that an inode is held exclusively
> and it's not going away while we get the buffer through an imap
> lookup. Hence we are kinda stuck locking an inode before we can look
> up the buffer.
> 
> This is also a result of inodes being detached from the cluster
> buffer except when IO is being done. This has the further problem
> that the cluster buffer can be reclaimed from memory and then the
> inode can be dirtied. At this point cleaning the inode requires a
> read-modify-write cycle on the cluster buffer. If we then are put
> under memory pressure, cleaning that dirty inode to reclaim it
> requires allocating memory for the cluster buffer and this leads to
> all sorts of problems.
> 
> We used synchronous inode writeback in reclaim as a throttle that
> provided a forwards progress mechanism when RMW cycles were required
> to clean inodes. Async writeback of inodes (e.g. via the AIL) would
> immediately exhaust remaining memory reserves trying to allocate
> inode cluster after inode cluster. The synchronous writeback of an
> inode cluster allowed reclaim to release the inode cluster and have
> it freed almost immediately which could then be used to allocate the
> next inode cluster buffer. Hence the IO based throttling mechanism
> largely guaranteed forwards progress in inode reclaim. By removing
> the requirement for require memory allocation for inode writeback
> filesystem level, we can issue writeback asynchrnously and not have
> to worry about the memory exhaustion anymore.
> 
> Another issue is that if we have slow disks, we can build up dirty
> inodes in memory that can then take hours for an operation like
> unmount to flush. A RMW cycle per inode on a slow RAID6 device can
> mean we only clean 50 inodes a second, and when there are hundreds
> of thousands of dirty inodes that need to be cleaned this can take a
> long time. PInning the cluster buffers will greatly speed up inode
> writeback on slow storage systems like this.
> 
> These limitations all stem from the same source: inode writeback is
> inode centric, And they are largely solved by the same architectural
> change: make inode writeback cluster buffer centric.  This series is
> makes that architectural change.
> 
> Firstly, we start by pinning the inode backing buffer in memory
> when an inode is marked dirty (i.e. when it is logged). By tracking
> the number of dirty inodes on a buffer as a counter rather than a
> flag, we avoid the problem of overlapping inode dirtying and buffer
> flushing racing to set/clear the dirty flag. Hence as long as there
> is a dirty inode in memory, the buffer will not be able to be
> reclaimed. We can safely do this inode cluster buffer lookup when we
> dirty an inode as we do not hold the buffer locked - we merely take
> a reference to it and then release it - and hence we don't cause any
> new lock order issues.
> 
> When the inode is finally cleaned, the reference to the buffer can
> be removed from the inode log item and the buffer released. This is
> done from the inode completion callbacks that are attached to the
> buffer when the inode is flushed.
> 
> Pinning the cluster buffer in this way immediately avoids the RMW
> problem in inode writeback and reclaim contexts by moving the memory
> allocation and the blocking buffer read into the transaction context
> that dirties the inode.  This inverts our dirty inode throttling
> mechanism - we now throttle the rate at which we can dirty inodes to
> rate at which we can allocate memory and read inode cluster buffers
> into memory rather than via throttling reclaim to rate at which we
> can clean dirty inodes.
> 
> Hence if we are under memory pressure, we'll block on memory
> allocation when trying to dirty the referenced inode, rather than in
> the memory reclaim path where we are trying to clean unreferenced
> inodes to free memory.  Hence we no longer have to guarantee
> forwards progress in inode reclaim as we aren't doing memory
> allocation, and that means we can remove inode writeback from the
> XFS inode shrinker completely without changing the system tolerance
> for low memory operation.
> 
> Tracking the buffers via the inode log item also allows us to
> completely rework the inode flushing mechanism. While the inode log
> item is in the AIL, it is safe for the AIL to access any member of
> the log item. Hence the AIL push mechanisms can access the buffer
> attached to the inode without first having to lock the inode.
> 
> This means we can essentially lock the buffer directly and then
> call xfs_iflush_cluster() without first going through xfs_iflush()
> to find the buffer. Hence we can remove xfs_iflush() altogether,
> because the two places that call it - the inode item push code and
> inode reclaim - no longer need to flush inodes directly.
> 
> This can be further optimised by attaching the inode to the cluster
> buffer when the inode is dirtied. i.e. when we add the buffer
> reference to the inode log item, we also attach the inode to the
> buffer for IO processing. This leads to the dirty inodes always
> being attached to the buffer and hence we no longer need to add them
> when we flush the inode and remove them when IO completes. Instead
> the inodes are attached when the node log item is dirtied, and
> removed when the inode log item is cleaned.
> 
> With this structure in place, we no longer need to do
> lookups to find the dirty inodes in the cache to attach to the
> buffer in xfs_iflush_cluster() - they are already attached to the
> buffer. Hence when the AIL pushes an inode, we just grab the buffer
> from the log item, and then walk the buffer log item list to lock
> and flush the dirty inodes attached to the buffer.
> 
> This greatly simplifies inode writeback, and removes another memory
> allocation from the inode writeback path (the array used for the
> radix tree gang lookup). And while the radix tree lookups are fast,
> walking the linked list of dirty inodes is faster.
> 
> There is followup work I am doing that uses the inode cluster buffer
> as a replacement in the AIL for tracking dirty inodes. This part of
> the series is not ready yet as it has some intricate locking
> requirements. That is an optimisation, so I've left that out because
> solving the inode reclaim blocking problems is the important part of
> this work.
> 
> In short, this series simplifies inode writeback and fixes the long
> standing inode reclaim blocking issues without requiring any changes
> to the memory reclaim infrastructure.
> 
> Note: dquots should probably be converted to cluster flushing in a
> similar way, as they have many of the same issues as inode flushing.
> 
> Thoughts, comments and improvemnts welcome.
> 
> -Dave.
> 
> Version 4:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-4
> 
> - rebase on 5.8-rc2 + for-next
> - fix buffer retry logic braino (p13)
> - removed unnecessary asserts (p24)
> - removed unnecessary delwri queue checks from
>   xfs_inode_item_push (p24)
> - rework return value from xfs_iflush_cluster to indicate -EAGAIN if
>   no inodes were flushed and handle that case in the caller. (p28)
> - rewrite comment about shutdown case in xfs_iflush_cluster (p28)
> - always clear XFS_LI_FAILED for items requiring AIL processing
>   (p29)
> 
> 
> Version 3
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-3
> 
> - rebase on 5.7 + for-next
> - update comments (p3)
> - update commit message (p4)
> - renamed xfs_buf_ioerror_sync() (p13)
> - added enum for return value from xfs_buf_iodone_error() (p13)
> - moved clearing of buffer error to iodone functions (p13)
> - whitespace (p13)
> - rebase p14 (p13 conflicts)
> - rebase p16 (p13 conflicts)
> - removed a superfluous assert (p16)
> - moved comment and check in xfs_iflush_done() from p16 to p25
> - rebase p25 (p16 conflicts)
> 
> 
> 
> Version 2
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-2
> 
> - describe ili_lock better (p2)
> - clean up inode logging code some more (p2)
> - move "early read completion" for xfs_buf_ioend() up into p3 from
>   p4.
> - fixed conflicts in p4 due to p3 changes.
> - fixed conflicts in p5 due to p4 changes.
> - s/_XBF_LOGRCVY/_XBF_LOG_RECOVERY/ (p5)
> - renamed the buf log item iodone callback to xfs_buf_item_iodone and
>   reused the xfs_buf_iodone() name for the catch-all buffer write
>   iodone completion. (p6)
> - history update for commit message (p7)
> - subject update for p8
> - rework loop in xfs_dquot_done() (p9)
> - Fixed conflicts in p10 due to p6 changes
> - got rid of entire comments around li_cb (p11)
> - new patch to rework buffer io error callbacks
> - new patch to unwind ->iop_error calls and remove ->iop_error
> - new patch to lift xfs_clear_li_failed() out of
>   xfs_ail_delete_one()
> - rebased p12 on all the prior changes
> - reworked LI_FAILED handling when pinning inodes to the cluster
>   buffer (p12) 
> - fixed comment about holding buffer references in
>   xfs_trans_log_inode() (p12)
> - fixed indenting of xfs_iflush_abort() (p12)
> - added comments explaining "skipped" indoe reclaim return value
>   (p14)
> - cleaned up error return stack in xfs_reclaim_inode() (p14)
> - cleaned up skipped return in xfs_reclaim_inodes() (p14)
> - fixed bug where skipped wasn't incremented if reclaim cursor was
>   not zero. This could leave inodes between the start of the AG and
>   the cursor unreclaimed (p15)
> - reinstate the patch removing SYNC_WAIT from xfs_reclaim_inodes().
>   Exposed "skipped" bug in p15.
> - cleaned up inode reclaim comments (p18)
> - split p19 into two - one to change xfs_ifree_cluster(), one
>   for the buffer pinning.
> - xfs_ifree_mark_inode_stale() now takes the cluster buffer and we
>   get the perag from that rather than having to do a lookup in
>   xfs_ifree_cluster().
> - moved extra IO reference for xfs_iflush_cluster() from AIL pushing
>   to initial xfs_iflush_cluster rework (p22 -> p20)
> - fixed static declaration on xfs_iflush() (p22)
> - fixed incorrect EIO return from xfs_iflush_cluster()
> - rebase p23 because it all rejects now.
> - fix INODE_ITEM() usage in p23
> - removed long lines from commit message in p24
> - new patch to fix logging of XFS_ISTALE inodes which pushes dirty
>   inodes through reclaim.
> 
> 
> 
> Dave Chinner (30):
>   xfs: Don't allow logging of XFS_ISTALE inodes
>   xfs: remove logged flag from inode log item
>   xfs: add an inode item lock
>   xfs: mark inode buffers in cache
>   xfs: mark dquot buffers in cache
>   xfs: mark log recovery buffers for completion
>   xfs: call xfs_buf_iodone directly
>   xfs: clean up whacky buffer log item list reinit
>   xfs: make inode IO completion buffer centric
>   xfs: use direct calls for dquot IO completion
>   xfs: clean up the buffer iodone callback functions
>   xfs: get rid of log item callbacks
>   xfs: handle buffer log item IO errors directly
>   xfs: unwind log item error flagging
>   xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()
>   xfs: pin inode backing buffer to the inode log item
>   xfs: make inode reclaim almost non-blocking
>   xfs: remove IO submission from xfs_reclaim_inode()
>   xfs: allow multiple reclaimers per AG
>   xfs: don't block inode reclaim on the ILOCK
>   xfs: remove SYNC_TRYLOCK from inode reclaim
>   xfs: remove SYNC_WAIT from xfs_reclaim_inodes()
>   xfs: clean up inode reclaim comments
>   xfs: rework stale inodes in xfs_ifree_cluster
>   xfs: attach inodes to the cluster buffer when dirtied
>   xfs: xfs_iflush() is no longer necessary
>   xfs: rename xfs_iflush_int()
>   xfs: rework xfs_iflush_cluster() dirty inode iteration
>   xfs: factor xfs_iflush_done
>   xfs: remove xfs_inobp_check()
> 
>  fs/xfs/libxfs/xfs_inode_buf.c   |  27 +-
>  fs/xfs/libxfs/xfs_inode_buf.h   |   6 -
>  fs/xfs/libxfs/xfs_trans_inode.c | 110 +++++--
>  fs/xfs/xfs_buf.c                |  40 ++-
>  fs/xfs/xfs_buf.h                |  48 ++-
>  fs/xfs/xfs_buf_item.c           | 419 +++++++++++------------
>  fs/xfs/xfs_buf_item.h           |   8 +-
>  fs/xfs/xfs_buf_item_recover.c   |   5 +-
>  fs/xfs/xfs_dquot.c              |  29 +-
>  fs/xfs/xfs_dquot.h              |   1 +
>  fs/xfs/xfs_dquot_item.c         |  18 -
>  fs/xfs/xfs_dquot_item_recover.c |   2 +-
>  fs/xfs/xfs_file.c               |   9 +-
>  fs/xfs/xfs_icache.c             | 333 ++++++-------------
>  fs/xfs/xfs_icache.h             |   2 +-
>  fs/xfs/xfs_inode.c              | 567 ++++++++++++--------------------
>  fs/xfs/xfs_inode.h              |   2 +-
>  fs/xfs/xfs_inode_item.c         | 303 +++++++++--------
>  fs/xfs/xfs_inode_item.h         |  24 +-
>  fs/xfs/xfs_inode_item_recover.c |   2 +-
>  fs/xfs/xfs_log_recover.c        |   5 +-
>  fs/xfs/xfs_mount.c              |  15 +-
>  fs/xfs/xfs_mount.h              |   1 -
>  fs/xfs/xfs_super.c              |   3 -
>  fs/xfs/xfs_trans.h              |   5 -
>  fs/xfs/xfs_trans_ail.c          |  10 +-
>  fs/xfs/xfs_trans_buf.c          |  15 +-
>  27 files changed, 889 insertions(+), 1120 deletions(-)
> 
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous
@ 2020-06-22  8:15 Dave Chinner
  2020-06-29 23:01 ` Darrick J. Wong
  0 siblings, 1 reply; 80+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

Inode flushing requires that we first lock an inode, then check it,
then lock the underlying buffer, flush the inode to the buffer and
finally add the inode to the buffer to be unlocked on IO completion.
We then walk all the other cached inodes in the buffer range and
optimistically lock and flush them to the buffer without blocking.

This cluster write effectively repeats the same code we do with the
initial inode, except now it has to special case that initial inode
that is already locked. Hence we have multiple copies of very
similar code, and it is a result of inode cluster flushing being
based on a specific inode rather than grabbing the buffer and
flushing all available inodes to it.

The problem with this at the moment is that we we can't look up the
buffer until we have guaranteed that an inode is held exclusively
and it's not going away while we get the buffer through an imap
lookup. Hence we are kinda stuck locking an inode before we can look
up the buffer.

This is also a result of inodes being detached from the cluster
buffer except when IO is being done. This has the further problem
that the cluster buffer can be reclaimed from memory and then the
inode can be dirtied. At this point cleaning the inode requires a
read-modify-write cycle on the cluster buffer. If we then are put
under memory pressure, cleaning that dirty inode to reclaim it
requires allocating memory for the cluster buffer and this leads to
all sorts of problems.

We used synchronous inode writeback in reclaim as a throttle that
provided a forwards progress mechanism when RMW cycles were required
to clean inodes. Async writeback of inodes (e.g. via the AIL) would
immediately exhaust remaining memory reserves trying to allocate
inode cluster after inode cluster. The synchronous writeback of an
inode cluster allowed reclaim to release the inode cluster and have
it freed almost immediately which could then be used to allocate the
next inode cluster buffer. Hence the IO based throttling mechanism
largely guaranteed forwards progress in inode reclaim. By removing
the requirement for require memory allocation for inode writeback
filesystem level, we can issue writeback asynchrnously and not have
to worry about the memory exhaustion anymore.

Another issue is that if we have slow disks, we can build up dirty
inodes in memory that can then take hours for an operation like
unmount to flush. A RMW cycle per inode on a slow RAID6 device can
mean we only clean 50 inodes a second, and when there are hundreds
of thousands of dirty inodes that need to be cleaned this can take a
long time. PInning the cluster buffers will greatly speed up inode
writeback on slow storage systems like this.

These limitations all stem from the same source: inode writeback is
inode centric, And they are largely solved by the same architectural
change: make inode writeback cluster buffer centric.  This series is
makes that architectural change.

Firstly, we start by pinning the inode backing buffer in memory
when an inode is marked dirty (i.e. when it is logged). By tracking
the number of dirty inodes on a buffer as a counter rather than a
flag, we avoid the problem of overlapping inode dirtying and buffer
flushing racing to set/clear the dirty flag. Hence as long as there
is a dirty inode in memory, the buffer will not be able to be
reclaimed. We can safely do this inode cluster buffer lookup when we
dirty an inode as we do not hold the buffer locked - we merely take
a reference to it and then release it - and hence we don't cause any
new lock order issues.

When the inode is finally cleaned, the reference to the buffer can
be removed from the inode log item and the buffer released. This is
done from the inode completion callbacks that are attached to the
buffer when the inode is flushed.

Pinning the cluster buffer in this way immediately avoids the RMW
problem in inode writeback and reclaim contexts by moving the memory
allocation and the blocking buffer read into the transaction context
that dirties the inode.  This inverts our dirty inode throttling
mechanism - we now throttle the rate at which we can dirty inodes to
rate at which we can allocate memory and read inode cluster buffers
into memory rather than via throttling reclaim to rate at which we
can clean dirty inodes.

Hence if we are under memory pressure, we'll block on memory
allocation when trying to dirty the referenced inode, rather than in
the memory reclaim path where we are trying to clean unreferenced
inodes to free memory.  Hence we no longer have to guarantee
forwards progress in inode reclaim as we aren't doing memory
allocation, and that means we can remove inode writeback from the
XFS inode shrinker completely without changing the system tolerance
for low memory operation.

Tracking the buffers via the inode log item also allows us to
completely rework the inode flushing mechanism. While the inode log
item is in the AIL, it is safe for the AIL to access any member of
the log item. Hence the AIL push mechanisms can access the buffer
attached to the inode without first having to lock the inode.

This means we can essentially lock the buffer directly and then
call xfs_iflush_cluster() without first going through xfs_iflush()
to find the buffer. Hence we can remove xfs_iflush() altogether,
because the two places that call it - the inode item push code and
inode reclaim - no longer need to flush inodes directly.

This can be further optimised by attaching the inode to the cluster
buffer when the inode is dirtied. i.e. when we add the buffer
reference to the inode log item, we also attach the inode to the
buffer for IO processing. This leads to the dirty inodes always
being attached to the buffer and hence we no longer need to add them
when we flush the inode and remove them when IO completes. Instead
the inodes are attached when the node log item is dirtied, and
removed when the inode log item is cleaned.

With this structure in place, we no longer need to do
lookups to find the dirty inodes in the cache to attach to the
buffer in xfs_iflush_cluster() - they are already attached to the
buffer. Hence when the AIL pushes an inode, we just grab the buffer
from the log item, and then walk the buffer log item list to lock
and flush the dirty inodes attached to the buffer.

This greatly simplifies inode writeback, and removes another memory
allocation from the inode writeback path (the array used for the
radix tree gang lookup). And while the radix tree lookups are fast,
walking the linked list of dirty inodes is faster.

There is followup work I am doing that uses the inode cluster buffer
as a replacement in the AIL for tracking dirty inodes. This part of
the series is not ready yet as it has some intricate locking
requirements. That is an optimisation, so I've left that out because
solving the inode reclaim blocking problems is the important part of
this work.

In short, this series simplifies inode writeback and fixes the long
standing inode reclaim blocking issues without requiring any changes
to the memory reclaim infrastructure.

Note: dquots should probably be converted to cluster flushing in a
similar way, as they have many of the same issues as inode flushing.

Thoughts, comments and improvemnts welcome.

-Dave.

Version 4:

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-4

- rebase on 5.8-rc2 + for-next
- fix buffer retry logic braino (p13)
- removed unnecessary asserts (p24)
- removed unnecessary delwri queue checks from
  xfs_inode_item_push (p24)
- rework return value from xfs_iflush_cluster to indicate -EAGAIN if
  no inodes were flushed and handle that case in the caller. (p28)
- rewrite comment about shutdown case in xfs_iflush_cluster (p28)
- always clear XFS_LI_FAILED for items requiring AIL processing
  (p29)


Version 3

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-3

- rebase on 5.7 + for-next
- update comments (p3)
- update commit message (p4)
- renamed xfs_buf_ioerror_sync() (p13)
- added enum for return value from xfs_buf_iodone_error() (p13)
- moved clearing of buffer error to iodone functions (p13)
- whitespace (p13)
- rebase p14 (p13 conflicts)
- rebase p16 (p13 conflicts)
- removed a superfluous assert (p16)
- moved comment and check in xfs_iflush_done() from p16 to p25
- rebase p25 (p16 conflicts)



Version 2

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-2

- describe ili_lock better (p2)
- clean up inode logging code some more (p2)
- move "early read completion" for xfs_buf_ioend() up into p3 from
  p4.
- fixed conflicts in p4 due to p3 changes.
- fixed conflicts in p5 due to p4 changes.
- s/_XBF_LOGRCVY/_XBF_LOG_RECOVERY/ (p5)
- renamed the buf log item iodone callback to xfs_buf_item_iodone and
  reused the xfs_buf_iodone() name for the catch-all buffer write
  iodone completion. (p6)
- history update for commit message (p7)
- subject update for p8
- rework loop in xfs_dquot_done() (p9)
- Fixed conflicts in p10 due to p6 changes
- got rid of entire comments around li_cb (p11)
- new patch to rework buffer io error callbacks
- new patch to unwind ->iop_error calls and remove ->iop_error
- new patch to lift xfs_clear_li_failed() out of
  xfs_ail_delete_one()
- rebased p12 on all the prior changes
- reworked LI_FAILED handling when pinning inodes to the cluster
  buffer (p12) 
- fixed comment about holding buffer references in
  xfs_trans_log_inode() (p12)
- fixed indenting of xfs_iflush_abort() (p12)
- added comments explaining "skipped" indoe reclaim return value
  (p14)
- cleaned up error return stack in xfs_reclaim_inode() (p14)
- cleaned up skipped return in xfs_reclaim_inodes() (p14)
- fixed bug where skipped wasn't incremented if reclaim cursor was
  not zero. This could leave inodes between the start of the AG and
  the cursor unreclaimed (p15)
- reinstate the patch removing SYNC_WAIT from xfs_reclaim_inodes().
  Exposed "skipped" bug in p15.
- cleaned up inode reclaim comments (p18)
- split p19 into two - one to change xfs_ifree_cluster(), one
  for the buffer pinning.
- xfs_ifree_mark_inode_stale() now takes the cluster buffer and we
  get the perag from that rather than having to do a lookup in
  xfs_ifree_cluster().
- moved extra IO reference for xfs_iflush_cluster() from AIL pushing
  to initial xfs_iflush_cluster rework (p22 -> p20)
- fixed static declaration on xfs_iflush() (p22)
- fixed incorrect EIO return from xfs_iflush_cluster()
- rebase p23 because it all rejects now.
- fix INODE_ITEM() usage in p23
- removed long lines from commit message in p24
- new patch to fix logging of XFS_ISTALE inodes which pushes dirty
  inodes through reclaim.



Dave Chinner (30):
  xfs: Don't allow logging of XFS_ISTALE inodes
  xfs: remove logged flag from inode log item
  xfs: add an inode item lock
  xfs: mark inode buffers in cache
  xfs: mark dquot buffers in cache
  xfs: mark log recovery buffers for completion
  xfs: call xfs_buf_iodone directly
  xfs: clean up whacky buffer log item list reinit
  xfs: make inode IO completion buffer centric
  xfs: use direct calls for dquot IO completion
  xfs: clean up the buffer iodone callback functions
  xfs: get rid of log item callbacks
  xfs: handle buffer log item IO errors directly
  xfs: unwind log item error flagging
  xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()
  xfs: pin inode backing buffer to the inode log item
  xfs: make inode reclaim almost non-blocking
  xfs: remove IO submission from xfs_reclaim_inode()
  xfs: allow multiple reclaimers per AG
  xfs: don't block inode reclaim on the ILOCK
  xfs: remove SYNC_TRYLOCK from inode reclaim
  xfs: remove SYNC_WAIT from xfs_reclaim_inodes()
  xfs: clean up inode reclaim comments
  xfs: rework stale inodes in xfs_ifree_cluster
  xfs: attach inodes to the cluster buffer when dirtied
  xfs: xfs_iflush() is no longer necessary
  xfs: rename xfs_iflush_int()
  xfs: rework xfs_iflush_cluster() dirty inode iteration
  xfs: factor xfs_iflush_done
  xfs: remove xfs_inobp_check()

 fs/xfs/libxfs/xfs_inode_buf.c   |  27 +-
 fs/xfs/libxfs/xfs_inode_buf.h   |   6 -
 fs/xfs/libxfs/xfs_trans_inode.c | 110 +++++--
 fs/xfs/xfs_buf.c                |  40 ++-
 fs/xfs/xfs_buf.h                |  48 ++-
 fs/xfs/xfs_buf_item.c           | 419 +++++++++++------------
 fs/xfs/xfs_buf_item.h           |   8 +-
 fs/xfs/xfs_buf_item_recover.c   |   5 +-
 fs/xfs/xfs_dquot.c              |  29 +-
 fs/xfs/xfs_dquot.h              |   1 +
 fs/xfs/xfs_dquot_item.c         |  18 -
 fs/xfs/xfs_dquot_item_recover.c |   2 +-
 fs/xfs/xfs_file.c               |   9 +-
 fs/xfs/xfs_icache.c             | 333 ++++++-------------
 fs/xfs/xfs_icache.h             |   2 +-
 fs/xfs/xfs_inode.c              | 567 ++++++++++++--------------------
 fs/xfs/xfs_inode.h              |   2 +-
 fs/xfs/xfs_inode_item.c         | 303 +++++++++--------
 fs/xfs/xfs_inode_item.h         |  24 +-
 fs/xfs/xfs_inode_item_recover.c |   2 +-
 fs/xfs/xfs_log_recover.c        |   5 +-
 fs/xfs/xfs_mount.c              |  15 +-
 fs/xfs/xfs_mount.h              |   1 -
 fs/xfs/xfs_super.c              |   3 -
 fs/xfs/xfs_trans.h              |   5 -
 fs/xfs/xfs_trans_ail.c          |  10 +-
 fs/xfs/xfs_trans_buf.c          |  15 +-
 27 files changed, 889 insertions(+), 1120 deletions(-)

-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous
@ 2020-06-04  7:45 Dave Chinner
  0 siblings, 0 replies; 80+ messages in thread
From: Dave Chinner @ 2020-06-04  7:45 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

Inode flushing requires that we first lock an inode, then check it,
then lock the underlying buffer, flush the inode to the buffer and
finally add the inode to the buffer to be unlocked on IO completion.
We then walk all the other cached inodes in the buffer range and
optimistically lock and flush them to the buffer without blocking.

This cluster write effectively repeats the same code we do with the
initial inode, except now it has to special case that initial inode
that is already locked. Hence we have multiple copies of very
similar code, and it is a result of inode cluster flushing being
based on a specific inode rather than grabbing the buffer and
flushing all available inodes to it.

The problem with this at the moment is that we we can't look up the
buffer until we have guaranteed that an inode is held exclusively
and it's not going away while we get the buffer through an imap
lookup. Hence we are kinda stuck locking an inode before we can look
up the buffer.

This is also a result of inodes being detached from the cluster
buffer except when IO is being done. This has the further problem
that the cluster buffer can be reclaimed from memory and then the
inode can be dirtied. At this point cleaning the inode requires a
read-modify-write cycle on the cluster buffer. If we then are put
under memory pressure, cleaning that dirty inode to reclaim it
requires allocating memory for the cluster buffer and this leads to
all sorts of problems.

We used synchronous inode writeback in reclaim as a throttle that
provided a forwards progress mechanism when RMW cycles were required
to clean inodes. Async writeback of inodes (e.g. via the AIL) would
immediately exhaust remaining memory reserves trying to allocate
inode cluster after inode cluster. The synchronous writeback of an
inode cluster allowed reclaim to release the inode cluster and have
it freed almost immediately which could then be used to allocate the
next inode cluster buffer. Hence the IO based throttling mechanism
largely guaranteed forwards progress in inode reclaim. By removing
the requirement for require memory allocation for inode writeback
filesystem level, we can issue writeback asynchrnously and not have
to worry about the memory exhaustion anymore.

Another issue is that if we have slow disks, we can build up dirty
inodes in memory that can then take hours for an operation like
unmount to flush. A RMW cycle per inode on a slow RAID6 device can
mean we only clean 50 inodes a second, and when there are hundreds
of thousands of dirty inodes that need to be cleaned this can take a
long time. PInning the cluster buffers will greatly speed up inode
writeback on slow storage systems like this.

These limitations all stem from the same source: inode writeback is
inode centric, And they are largely solved by the same architectural
change: make inode writeback cluster buffer centric.  This series is
makes that architectural change.

Firstly, we start by pinning the inode backing buffer in memory
when an inode is marked dirty (i.e. when it is logged). By tracking
the number of dirty inodes on a buffer as a counter rather than a
flag, we avoid the problem of overlapping inode dirtying and buffer
flushing racing to set/clear the dirty flag. Hence as long as there
is a dirty inode in memory, the buffer will not be able to be
reclaimed. We can safely do this inode cluster buffer lookup when we
dirty an inode as we do not hold the buffer locked - we merely take
a reference to it and then release it - and hence we don't cause any
new lock order issues.

When the inode is finally cleaned, the reference to the buffer can
be removed from the inode log item and the buffer released. This is
done from the inode completion callbacks that are attached to the
buffer when the inode is flushed.

Pinning the cluster buffer in this way immediately avoids the RMW
problem in inode writeback and reclaim contexts by moving the memory
allocation and the blocking buffer read into the transaction context
that dirties the inode.  This inverts our dirty inode throttling
mechanism - we now throttle the rate at which we can dirty inodes to
rate at which we can allocate memory and read inode cluster buffers
into memory rather than via throttling reclaim to rate at which we
can clean dirty inodes.

Hence if we are under memory pressure, we'll block on memory
allocation when trying to dirty the referenced inode, rather than in
the memory reclaim path where we are trying to clean unreferenced
inodes to free memory.  Hence we no longer have to guarantee
forwards progress in inode reclaim as we aren't doing memory
allocation, and that means we can remove inode writeback from the
XFS inode shrinker completely without changing the system tolerance
for low memory operation.

Tracking the buffers via the inode log item also allows us to
completely rework the inode flushing mechanism. While the inode log
item is in the AIL, it is safe for the AIL to access any member of
the log item. Hence the AIL push mechanisms can access the buffer
attached to the inode without first having to lock the inode.

This means we can essentially lock the buffer directly and then
call xfs_iflush_cluster() without first going through xfs_iflush()
to find the buffer. Hence we can remove xfs_iflush() altogether,
because the two places that call it - the inode item push code and
inode reclaim - no longer need to flush inodes directly.

This can be further optimised by attaching the inode to the cluster
buffer when the inode is dirtied. i.e. when we add the buffer
reference to the inode log item, we also attach the inode to the
buffer for IO processing. This leads to the dirty inodes always
being attached to the buffer and hence we no longer need to add them
when we flush the inode and remove them when IO completes. Instead
the inodes are attached when the node log item is dirtied, and
removed when the inode log item is cleaned.

With this structure in place, we no longer need to do
lookups to find the dirty inodes in the cache to attach to the
buffer in xfs_iflush_cluster() - they are already attached to the
buffer. Hence when the AIL pushes an inode, we just grab the buffer
from the log item, and then walk the buffer log item list to lock
and flush the dirty inodes attached to the buffer.

This greatly simplifies inode writeback, and removes another memory
allocation from the inode writeback path (the array used for the
radix tree gang lookup). And while the radix tree lookups are fast,
walking the linked list of dirty inodes is faster.

There is followup work I am doing that uses the inode cluster buffer
as a replacement in the AIL for tracking dirty inodes. This part of
the series is not ready yet as it has some intricate locking
requirements. That is an optimisation, so I've left that out because
solving the inode reclaim blocking problems is the important part of
this work.

In short, this series simplifies inode writeback and fixes the long
standing inode reclaim blocking issues without requiring any changes
to the memory reclaim infrastructure.

Note: dquots should probably be converted to cluster flushing in a
similar way, as they have many of the same issues as inode flushing.

Thoughts, comments and improvemnts welcome.

-Dave.

Version 3

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-3

- rebase on 5.7 + for-next
- update comments (p3)
- update commit message (p4)
- renamed xfs_buf_ioerror_sync() (p13)
- added enum for return value from xfs_buf_iodone_error() (p13)
- moved clearing of buffer error to iodone functions (p13)
- whitespace (p13)
- rebase p14 (p13 conflicts)
- rebase p16 (p13 conflicts)
- removed a superfluous assert (p16)
- moved comment and check in xfs_iflush_done() from p16 to p25
- rebase p25 (p16 conflicts)



Version 2

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-2

- describe ili_lock better (p2)
- clean up inode logging code some more (p2)
- move "early read completion" for xfs_buf_ioend() up into p3 from
  p4.
- fixed conflicts in p4 due to p3 changes.
- fixed conflicts in p5 due to p4 changes.
- s/_XBF_LOGRCVY/_XBF_LOG_RECOVERY/ (p5)
- renamed the buf log item iodone callback to xfs_buf_item_iodone and
  reused the xfs_buf_iodone() name for the catch-all buffer write
  iodone completion. (p6)
- history update for commit message (p7)
- subject update for p8
- rework loop in xfs_dquot_done() (p9)
- Fixed conflicts in p10 due to p6 changes
- got rid of entire comments around li_cb (p11)
- new patch to rework buffer io error callbacks
- new patch to unwind ->iop_error calls and remove ->iop_error
- new patch to lift xfs_clear_li_failed() out of
  xfs_ail_delete_one()
- rebased p12 on all the prior changes
- reworked LI_FAILED handling when pinning inodes to the cluster
  buffer (p12) 
- fixed comment about holding buffer references in
  xfs_trans_log_inode() (p12)
- fixed indenting of xfs_iflush_abort() (p12)
- added comments explaining "skipped" indoe reclaim return value
  (p14)
- cleaned up error return stack in xfs_reclaim_inode() (p14)
- cleaned up skipped return in xfs_reclaim_inodes() (p14)
- fixed bug where skipped wasn't incremented if reclaim cursor was
  not zero. This could leave inodes between the start of the AG and
  the cursor unreclaimed (p15)
- reinstate the patch removing SYNC_WAIT from xfs_reclaim_inodes().
  Exposed "skipped" bug in p15.
- cleaned up inode reclaim comments (p18)
- split p19 into two - one to change xfs_ifree_cluster(), one
  for the buffer pinning.
- xfs_ifree_mark_inode_stale() now takes the cluster buffer and we
  get the perag from that rather than having to do a lookup in
  xfs_ifree_cluster().
- moved extra IO reference for xfs_iflush_cluster() from AIL pushing
  to initial xfs_iflush_cluster rework (p22 -> p20)
- fixed static declaration on xfs_iflush() (p22)
- fixed incorrect EIO return from xfs_iflush_cluster()
- rebase p23 because it all rejects now.
- fix INODE_ITEM() usage in p23
- removed long lines from commit message in p24
- new patch to fix logging of XFS_ISTALE inodes which pushes dirty
  inodes through reclaim.



Dave Chinner (30):
  xfs: Don't allow logging of XFS_ISTALE inodes
  xfs: remove logged flag from inode log item
  xfs: add an inode item lock
  xfs: mark inode buffers in cache
  xfs: mark dquot buffers in cache
  xfs: mark log recovery buffers for completion
  xfs: call xfs_buf_iodone directly
  xfs: clean up whacky buffer log item list reinit
  xfs: make inode IO completion buffer centric
  xfs: use direct calls for dquot IO completion
  xfs: clean up the buffer iodone callback functions
  xfs: get rid of log item callbacks
  xfs: handle buffer log item IO errors directly
  xfs: unwind log item error flagging
  xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()
  xfs: pin inode backing buffer to the inode log item
  xfs: make inode reclaim almost non-blocking
  xfs: remove IO submission from xfs_reclaim_inode()
  xfs: allow multiple reclaimers per AG
  xfs: don't block inode reclaim on the ILOCK
  xfs: remove SYNC_TRYLOCK from inode reclaim
  xfs: remove SYNC_WAIT from xfs_reclaim_inodes()
  xfs: clean up inode reclaim comments
  xfs: rework stale inodes in xfs_ifree_cluster
  xfs: attach inodes to the cluster buffer when dirtied
  xfs: xfs_iflush() is no longer necessary
  xfs: rename xfs_iflush_int()
  xfs: rework xfs_iflush_cluster() dirty inode iteration
  xfs: factor xfs_iflush_done
  xfs: remove xfs_inobp_check()

 fs/xfs/libxfs/xfs_inode_buf.c   |  27 +-
 fs/xfs/libxfs/xfs_inode_buf.h   |   6 -
 fs/xfs/libxfs/xfs_trans_inode.c | 110 +++++--
 fs/xfs/xfs_buf.c                |  40 ++-
 fs/xfs/xfs_buf.h                |  48 +--
 fs/xfs/xfs_buf_item.c           | 420 ++++++++++++------------
 fs/xfs/xfs_buf_item.h           |   8 +-
 fs/xfs/xfs_buf_item_recover.c   |   5 +-
 fs/xfs/xfs_dquot.c              |  29 +-
 fs/xfs/xfs_dquot.h              |   1 +
 fs/xfs/xfs_dquot_item.c         |  18 --
 fs/xfs/xfs_dquot_item_recover.c |   2 +-
 fs/xfs/xfs_file.c               |   9 +-
 fs/xfs/xfs_icache.c             | 333 ++++++-------------
 fs/xfs/xfs_icache.h             |   2 +-
 fs/xfs/xfs_inode.c              | 554 ++++++++++++--------------------
 fs/xfs/xfs_inode.h              |   2 +-
 fs/xfs/xfs_inode_item.c         | 301 +++++++++--------
 fs/xfs/xfs_inode_item.h         |  24 +-
 fs/xfs/xfs_inode_item_recover.c |   2 +-
 fs/xfs/xfs_log_recover.c        |   5 +-
 fs/xfs/xfs_mount.c              |  15 +-
 fs/xfs/xfs_mount.h              |   1 -
 fs/xfs/xfs_super.c              |   3 -
 fs/xfs/xfs_trans.h              |   5 -
 fs/xfs/xfs_trans_ail.c          |  10 +-
 fs/xfs/xfs_trans_buf.c          |  15 +-
 27 files changed, 881 insertions(+), 1114 deletions(-)

-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2020-06-30 21:51 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-01 21:42 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-06-01 21:42 ` [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Dave Chinner
2020-06-02  4:30   ` Darrick J. Wong
2020-06-02  7:06     ` Dave Chinner
2020-06-02 16:32   ` Brian Foster
2020-06-01 21:42 ` [PATCH 02/30] xfs: remove logged flag from inode log item Dave Chinner
2020-06-02 16:32   ` Brian Foster
2020-06-01 21:42 ` [PATCH 03/30] xfs: add an inode item lock Dave Chinner
2020-06-02 16:34   ` Brian Foster
2020-06-04  1:54     ` Dave Chinner
2020-06-04 14:03       ` Brian Foster
2020-06-01 21:42 ` [PATCH 04/30] xfs: mark inode buffers in cache Dave Chinner
2020-06-02 16:45   ` Brian Foster
2020-06-02 19:22     ` Darrick J. Wong
2020-06-02 21:29     ` Dave Chinner
2020-06-03 14:57       ` Brian Foster
2020-06-03 21:21         ` Dave Chinner
2020-06-01 21:42 ` [PATCH 05/30] xfs: mark dquot " Dave Chinner
2020-06-02 16:45   ` Brian Foster
2020-06-02 19:00   ` Darrick J. Wong
2020-06-01 21:42 ` [PATCH 06/30] xfs: mark log recovery buffers for completion Dave Chinner
2020-06-02 16:45   ` Brian Foster
2020-06-02 19:24   ` Darrick J. Wong
2020-06-01 21:42 ` [PATCH 07/30] xfs: call xfs_buf_iodone directly Dave Chinner
2020-06-02 16:47   ` Brian Foster
2020-06-02 21:38     ` Dave Chinner
2020-06-03 14:58       ` Brian Foster
2020-06-01 21:42 ` [PATCH 08/30] xfs: clean up whacky buffer log item list reinit Dave Chinner
2020-06-02 16:47   ` Brian Foster
2020-06-01 21:42 ` [PATCH 09/30] xfs: make inode IO completion buffer centric Dave Chinner
2020-06-03 14:58   ` Brian Foster
2020-06-01 21:42 ` [PATCH 10/30] xfs: use direct calls for dquot IO completion Dave Chinner
2020-06-02 19:25   ` Darrick J. Wong
2020-06-03 14:58   ` Brian Foster
2020-06-01 21:42 ` [PATCH 11/30] xfs: clean up the buffer iodone callback functions Dave Chinner
2020-06-03 14:58   ` Brian Foster
2020-06-01 21:42 ` [PATCH 12/30] xfs: get rid of log item callbacks Dave Chinner
2020-06-03 14:58   ` Brian Foster
2020-06-01 21:42 ` [PATCH 13/30] xfs: handle buffer log item IO errors directly Dave Chinner
2020-06-02 20:39   ` Darrick J. Wong
2020-06-02 22:17     ` Dave Chinner
2020-06-03 15:02   ` Brian Foster
2020-06-03 21:34     ` Dave Chinner
2020-06-01 21:42 ` [PATCH 14/30] xfs: unwind log item error flagging Dave Chinner
2020-06-02 20:45   ` Darrick J. Wong
2020-06-03 15:02   ` Brian Foster
2020-06-01 21:42 ` [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() Dave Chinner
2020-06-02 20:47   ` Darrick J. Wong
2020-06-03 15:02   ` Brian Foster
2020-06-01 21:42 ` [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Dave Chinner
2020-06-02 22:30   ` Darrick J. Wong
2020-06-02 22:53     ` Dave Chinner
2020-06-03 18:58   ` Brian Foster
2020-06-03 22:15     ` Dave Chinner
2020-06-04 14:03       ` Brian Foster
2020-06-01 21:42 ` [PATCH 17/30] xfs: make inode reclaim almost non-blocking Dave Chinner
2020-06-01 21:42 ` [PATCH 18/30] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
2020-06-02 22:36   ` Darrick J. Wong
2020-06-01 21:42 ` [PATCH 19/30] xfs: allow multiple reclaimers per AG Dave Chinner
2020-06-01 21:42 ` [PATCH 20/30] xfs: don't block inode reclaim on the ILOCK Dave Chinner
2020-06-01 21:42 ` [PATCH 21/30] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
2020-06-01 21:42 ` [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Dave Chinner
2020-06-02 22:43   ` Darrick J. Wong
2020-06-01 21:42 ` [PATCH 23/30] xfs: clean up inode reclaim comments Dave Chinner
2020-06-02 22:45   ` Darrick J. Wong
2020-06-01 21:42 ` [PATCH 24/30] xfs: rework stale inodes in xfs_ifree_cluster Dave Chinner
2020-06-02 23:01   ` Darrick J. Wong
2020-06-01 21:42 ` [PATCH 25/30] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
2020-06-02 23:03   ` Darrick J. Wong
2020-06-01 21:42 ` [PATCH 26/30] xfs: xfs_iflush() is no longer necessary Dave Chinner
2020-06-01 21:42 ` [PATCH 27/30] xfs: rename xfs_iflush_int() Dave Chinner
2020-06-01 21:42 ` [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
2020-06-02 23:23   ` Darrick J. Wong
2020-06-01 21:42 ` [PATCH 29/30] xfs: factor xfs_iflush_done Dave Chinner
2020-06-01 21:42 ` [PATCH 30/30] xfs: remove xfs_inobp_check() Dave Chinner
2020-06-04  7:45 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-06-22  8:15 Dave Chinner
2020-06-29 23:01 ` Darrick J. Wong
2020-06-30 16:52   ` Darrick J. Wong
2020-06-30 21:51     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.