All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous
@ 2020-06-22  8:15 Dave Chinner
  2020-06-22  8:15 ` [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Dave Chinner
                   ` (30 more replies)
  0 siblings, 31 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

Inode flushing requires that we first lock an inode, then check it,
then lock the underlying buffer, flush the inode to the buffer and
finally add the inode to the buffer to be unlocked on IO completion.
We then walk all the other cached inodes in the buffer range and
optimistically lock and flush them to the buffer without blocking.

This cluster write effectively repeats the same code we do with the
initial inode, except now it has to special case that initial inode
that is already locked. Hence we have multiple copies of very
similar code, and it is a result of inode cluster flushing being
based on a specific inode rather than grabbing the buffer and
flushing all available inodes to it.

The problem with this at the moment is that we we can't look up the
buffer until we have guaranteed that an inode is held exclusively
and it's not going away while we get the buffer through an imap
lookup. Hence we are kinda stuck locking an inode before we can look
up the buffer.

This is also a result of inodes being detached from the cluster
buffer except when IO is being done. This has the further problem
that the cluster buffer can be reclaimed from memory and then the
inode can be dirtied. At this point cleaning the inode requires a
read-modify-write cycle on the cluster buffer. If we then are put
under memory pressure, cleaning that dirty inode to reclaim it
requires allocating memory for the cluster buffer and this leads to
all sorts of problems.

We used synchronous inode writeback in reclaim as a throttle that
provided a forwards progress mechanism when RMW cycles were required
to clean inodes. Async writeback of inodes (e.g. via the AIL) would
immediately exhaust remaining memory reserves trying to allocate
inode cluster after inode cluster. The synchronous writeback of an
inode cluster allowed reclaim to release the inode cluster and have
it freed almost immediately which could then be used to allocate the
next inode cluster buffer. Hence the IO based throttling mechanism
largely guaranteed forwards progress in inode reclaim. By removing
the requirement for require memory allocation for inode writeback
filesystem level, we can issue writeback asynchrnously and not have
to worry about the memory exhaustion anymore.

Another issue is that if we have slow disks, we can build up dirty
inodes in memory that can then take hours for an operation like
unmount to flush. A RMW cycle per inode on a slow RAID6 device can
mean we only clean 50 inodes a second, and when there are hundreds
of thousands of dirty inodes that need to be cleaned this can take a
long time. PInning the cluster buffers will greatly speed up inode
writeback on slow storage systems like this.

These limitations all stem from the same source: inode writeback is
inode centric, And they are largely solved by the same architectural
change: make inode writeback cluster buffer centric.  This series is
makes that architectural change.

Firstly, we start by pinning the inode backing buffer in memory
when an inode is marked dirty (i.e. when it is logged). By tracking
the number of dirty inodes on a buffer as a counter rather than a
flag, we avoid the problem of overlapping inode dirtying and buffer
flushing racing to set/clear the dirty flag. Hence as long as there
is a dirty inode in memory, the buffer will not be able to be
reclaimed. We can safely do this inode cluster buffer lookup when we
dirty an inode as we do not hold the buffer locked - we merely take
a reference to it and then release it - and hence we don't cause any
new lock order issues.

When the inode is finally cleaned, the reference to the buffer can
be removed from the inode log item and the buffer released. This is
done from the inode completion callbacks that are attached to the
buffer when the inode is flushed.

Pinning the cluster buffer in this way immediately avoids the RMW
problem in inode writeback and reclaim contexts by moving the memory
allocation and the blocking buffer read into the transaction context
that dirties the inode.  This inverts our dirty inode throttling
mechanism - we now throttle the rate at which we can dirty inodes to
rate at which we can allocate memory and read inode cluster buffers
into memory rather than via throttling reclaim to rate at which we
can clean dirty inodes.

Hence if we are under memory pressure, we'll block on memory
allocation when trying to dirty the referenced inode, rather than in
the memory reclaim path where we are trying to clean unreferenced
inodes to free memory.  Hence we no longer have to guarantee
forwards progress in inode reclaim as we aren't doing memory
allocation, and that means we can remove inode writeback from the
XFS inode shrinker completely without changing the system tolerance
for low memory operation.

Tracking the buffers via the inode log item also allows us to
completely rework the inode flushing mechanism. While the inode log
item is in the AIL, it is safe for the AIL to access any member of
the log item. Hence the AIL push mechanisms can access the buffer
attached to the inode without first having to lock the inode.

This means we can essentially lock the buffer directly and then
call xfs_iflush_cluster() without first going through xfs_iflush()
to find the buffer. Hence we can remove xfs_iflush() altogether,
because the two places that call it - the inode item push code and
inode reclaim - no longer need to flush inodes directly.

This can be further optimised by attaching the inode to the cluster
buffer when the inode is dirtied. i.e. when we add the buffer
reference to the inode log item, we also attach the inode to the
buffer for IO processing. This leads to the dirty inodes always
being attached to the buffer and hence we no longer need to add them
when we flush the inode and remove them when IO completes. Instead
the inodes are attached when the node log item is dirtied, and
removed when the inode log item is cleaned.

With this structure in place, we no longer need to do
lookups to find the dirty inodes in the cache to attach to the
buffer in xfs_iflush_cluster() - they are already attached to the
buffer. Hence when the AIL pushes an inode, we just grab the buffer
from the log item, and then walk the buffer log item list to lock
and flush the dirty inodes attached to the buffer.

This greatly simplifies inode writeback, and removes another memory
allocation from the inode writeback path (the array used for the
radix tree gang lookup). And while the radix tree lookups are fast,
walking the linked list of dirty inodes is faster.

There is followup work I am doing that uses the inode cluster buffer
as a replacement in the AIL for tracking dirty inodes. This part of
the series is not ready yet as it has some intricate locking
requirements. That is an optimisation, so I've left that out because
solving the inode reclaim blocking problems is the important part of
this work.

In short, this series simplifies inode writeback and fixes the long
standing inode reclaim blocking issues without requiring any changes
to the memory reclaim infrastructure.

Note: dquots should probably be converted to cluster flushing in a
similar way, as they have many of the same issues as inode flushing.

Thoughts, comments and improvemnts welcome.

-Dave.

Version 4:

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-4

- rebase on 5.8-rc2 + for-next
- fix buffer retry logic braino (p13)
- removed unnecessary asserts (p24)
- removed unnecessary delwri queue checks from
  xfs_inode_item_push (p24)
- rework return value from xfs_iflush_cluster to indicate -EAGAIN if
  no inodes were flushed and handle that case in the caller. (p28)
- rewrite comment about shutdown case in xfs_iflush_cluster (p28)
- always clear XFS_LI_FAILED for items requiring AIL processing
  (p29)


Version 3

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-3

- rebase on 5.7 + for-next
- update comments (p3)
- update commit message (p4)
- renamed xfs_buf_ioerror_sync() (p13)
- added enum for return value from xfs_buf_iodone_error() (p13)
- moved clearing of buffer error to iodone functions (p13)
- whitespace (p13)
- rebase p14 (p13 conflicts)
- rebase p16 (p13 conflicts)
- removed a superfluous assert (p16)
- moved comment and check in xfs_iflush_done() from p16 to p25
- rebase p25 (p16 conflicts)



Version 2

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-2

- describe ili_lock better (p2)
- clean up inode logging code some more (p2)
- move "early read completion" for xfs_buf_ioend() up into p3 from
  p4.
- fixed conflicts in p4 due to p3 changes.
- fixed conflicts in p5 due to p4 changes.
- s/_XBF_LOGRCVY/_XBF_LOG_RECOVERY/ (p5)
- renamed the buf log item iodone callback to xfs_buf_item_iodone and
  reused the xfs_buf_iodone() name for the catch-all buffer write
  iodone completion. (p6)
- history update for commit message (p7)
- subject update for p8
- rework loop in xfs_dquot_done() (p9)
- Fixed conflicts in p10 due to p6 changes
- got rid of entire comments around li_cb (p11)
- new patch to rework buffer io error callbacks
- new patch to unwind ->iop_error calls and remove ->iop_error
- new patch to lift xfs_clear_li_failed() out of
  xfs_ail_delete_one()
- rebased p12 on all the prior changes
- reworked LI_FAILED handling when pinning inodes to the cluster
  buffer (p12) 
- fixed comment about holding buffer references in
  xfs_trans_log_inode() (p12)
- fixed indenting of xfs_iflush_abort() (p12)
- added comments explaining "skipped" indoe reclaim return value
  (p14)
- cleaned up error return stack in xfs_reclaim_inode() (p14)
- cleaned up skipped return in xfs_reclaim_inodes() (p14)
- fixed bug where skipped wasn't incremented if reclaim cursor was
  not zero. This could leave inodes between the start of the AG and
  the cursor unreclaimed (p15)
- reinstate the patch removing SYNC_WAIT from xfs_reclaim_inodes().
  Exposed "skipped" bug in p15.
- cleaned up inode reclaim comments (p18)
- split p19 into two - one to change xfs_ifree_cluster(), one
  for the buffer pinning.
- xfs_ifree_mark_inode_stale() now takes the cluster buffer and we
  get the perag from that rather than having to do a lookup in
  xfs_ifree_cluster().
- moved extra IO reference for xfs_iflush_cluster() from AIL pushing
  to initial xfs_iflush_cluster rework (p22 -> p20)
- fixed static declaration on xfs_iflush() (p22)
- fixed incorrect EIO return from xfs_iflush_cluster()
- rebase p23 because it all rejects now.
- fix INODE_ITEM() usage in p23
- removed long lines from commit message in p24
- new patch to fix logging of XFS_ISTALE inodes which pushes dirty
  inodes through reclaim.



Dave Chinner (30):
  xfs: Don't allow logging of XFS_ISTALE inodes
  xfs: remove logged flag from inode log item
  xfs: add an inode item lock
  xfs: mark inode buffers in cache
  xfs: mark dquot buffers in cache
  xfs: mark log recovery buffers for completion
  xfs: call xfs_buf_iodone directly
  xfs: clean up whacky buffer log item list reinit
  xfs: make inode IO completion buffer centric
  xfs: use direct calls for dquot IO completion
  xfs: clean up the buffer iodone callback functions
  xfs: get rid of log item callbacks
  xfs: handle buffer log item IO errors directly
  xfs: unwind log item error flagging
  xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()
  xfs: pin inode backing buffer to the inode log item
  xfs: make inode reclaim almost non-blocking
  xfs: remove IO submission from xfs_reclaim_inode()
  xfs: allow multiple reclaimers per AG
  xfs: don't block inode reclaim on the ILOCK
  xfs: remove SYNC_TRYLOCK from inode reclaim
  xfs: remove SYNC_WAIT from xfs_reclaim_inodes()
  xfs: clean up inode reclaim comments
  xfs: rework stale inodes in xfs_ifree_cluster
  xfs: attach inodes to the cluster buffer when dirtied
  xfs: xfs_iflush() is no longer necessary
  xfs: rename xfs_iflush_int()
  xfs: rework xfs_iflush_cluster() dirty inode iteration
  xfs: factor xfs_iflush_done
  xfs: remove xfs_inobp_check()

 fs/xfs/libxfs/xfs_inode_buf.c   |  27 +-
 fs/xfs/libxfs/xfs_inode_buf.h   |   6 -
 fs/xfs/libxfs/xfs_trans_inode.c | 110 +++++--
 fs/xfs/xfs_buf.c                |  40 ++-
 fs/xfs/xfs_buf.h                |  48 ++-
 fs/xfs/xfs_buf_item.c           | 419 +++++++++++------------
 fs/xfs/xfs_buf_item.h           |   8 +-
 fs/xfs/xfs_buf_item_recover.c   |   5 +-
 fs/xfs/xfs_dquot.c              |  29 +-
 fs/xfs/xfs_dquot.h              |   1 +
 fs/xfs/xfs_dquot_item.c         |  18 -
 fs/xfs/xfs_dquot_item_recover.c |   2 +-
 fs/xfs/xfs_file.c               |   9 +-
 fs/xfs/xfs_icache.c             | 333 ++++++-------------
 fs/xfs/xfs_icache.h             |   2 +-
 fs/xfs/xfs_inode.c              | 567 ++++++++++++--------------------
 fs/xfs/xfs_inode.h              |   2 +-
 fs/xfs/xfs_inode_item.c         | 303 +++++++++--------
 fs/xfs/xfs_inode_item.h         |  24 +-
 fs/xfs/xfs_inode_item_recover.c |   2 +-
 fs/xfs/xfs_log_recover.c        |   5 +-
 fs/xfs/xfs_mount.c              |  15 +-
 fs/xfs/xfs_mount.h              |   1 -
 fs/xfs/xfs_super.c              |   3 -
 fs/xfs/xfs_trans.h              |   5 -
 fs/xfs/xfs_trans_ail.c          |  10 +-
 fs/xfs/xfs_trans_buf.c          |  15 +-
 27 files changed, 889 insertions(+), 1120 deletions(-)

-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 02/30] xfs: remove logged flag from inode log item Dave Chinner
                   ` (29 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

In tracking down a problem in this patchset, I discovered we are
reclaiming dirty stale inodes. This wasn't discovered until inodes
were always attached to the cluster buffer and then the rcu callback
that freed inodes was assert failing because the inode still had an
active pointer to the cluster buffer after it had been reclaimed.

Debugging the issue indicated that this was a pre-existing issue
resulting from the way the inodes are handled in xfs_inactive_ifree.
When we free a cluster buffer from xfs_ifree_cluster, all the inodes
in cache are marked XFS_ISTALE. Those that are clean have nothing
else done to them and so eventually get cleaned up by background
reclaim. i.e. it is assumed we'll never dirty/relog an inode marked
XFS_ISTALE.

On journal commit dirty stale inodes as are handled by both
buffer and inode log items to run though xfs_istale_done() and
removed from the AIL (buffer log item commit) or the log item will
simply unpin it because the buffer log item will clean it. What happens
to any specific inode is entirely dependent on which log item wins
the commit race, but the result is the same - stale inodes are
clean, not attached to the cluster buffer, and not in the AIL. Hence
inode reclaim can just free these inodes without further care.

However, if the stale inode is relogged, it gets dirtied again and
relogged into the CIL. Most of the time this isn't an issue, because
relogging simply changes the inode's location in the current
checkpoint. Problems arise, however, when the CIL checkpoints
between two transactions in the xfs_inactive_ifree() deferops
processing. This results in the XFS_ISTALE inode being redirtied
and inserted into the CIL without any of the other stale cluster
buffer infrastructure being in place.

Hence on journal commit, it simply gets unpinned, so it remains
dirty in memory. Everything in inode writeback avoids XFS_ISTALE
inodes so it can't be written back, and it is not tracked in the AIL
so there's not even a trigger to attempt to clean the inode. Hence
the inode just sits dirty in memory until inode reclaim comes along,
sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming
of a dirty inode caused use after free, list corruptions and other
nasty issues later in this patchset.

Hence this patch addresses a violation of the "never log XFS_ISTALE
inodes" caused by the deferops processing rolling a transaction
and relogging a stale inode in xfs_inactive_free. It also adds a
bunch of asserts to catch this problem in debug kernels so that
we don't reintroduce this problem in future.

Reproducer for this issue was generic/558 on a v4 filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_trans_inode.c |  2 ++
 fs/xfs/xfs_icache.c             |  3 ++-
 fs/xfs/xfs_inode.c              | 25 ++++++++++++++++++++++---
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index b5dfb6654842..4504d215cd59 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -36,6 +36,7 @@ xfs_trans_ijoin(
 
 	ASSERT(iip->ili_lock_flags == 0);
 	iip->ili_lock_flags = lock_flags;
+	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
 
 	/*
 	 * Get a log_item_desc to point at the new item.
@@ -89,6 +90,7 @@ xfs_trans_log_inode(
 
 	ASSERT(ip->i_itemp != NULL);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
 
 	/*
 	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 5daef654956c..59dea8178ae3 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1141,7 +1141,7 @@ xfs_reclaim_inode(
 			goto out_ifunlock;
 		xfs_iunpin_wait(ip);
 	}
-	if (xfs_iflags_test(ip, XFS_ISTALE) || xfs_inode_clean(ip)) {
+	if (xfs_inode_clean(ip)) {
 		xfs_ifunlock(ip);
 		goto reclaim;
 	}
@@ -1228,6 +1228,7 @@ xfs_reclaim_inode(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_qm_dqdetach(ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	ASSERT(xfs_inode_clean(ip));
 
 	__xfs_inode_free(ip);
 	return error;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9aea7d68d8ab..6d70daf1c92a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1740,10 +1740,31 @@ xfs_inactive_ifree(
 		return error;
 	}
 
+	/*
+	 * We do not hold the inode locked across the entire rolling transaction
+	 * here. We only need to hold it for the first transaction that
+	 * xfs_ifree() builds, which may mark the inode XFS_ISTALE if the
+	 * underlying cluster buffer is freed. Relogging an XFS_ISTALE inode
+	 * here breaks the relationship between cluster buffer invalidation and
+	 * stale inode invalidation on cluster buffer item journal commit
+	 * completion, and can result in leaving dirty stale inodes hanging
+	 * around in memory.
+	 *
+	 * We have no need for serialising this inode operation against other
+	 * operations - we freed the inode and hence reallocation is required
+	 * and that will serialise on reallocating the space the deferops need
+	 * to free. Hence we can unlock the inode on the first commit of
+	 * the transaction rather than roll it right through the deferops. This
+	 * avoids relogging the XFS_ISTALE inode.
+	 *
+	 * We check that xfs_ifree() hasn't grown an internal transaction roll
+	 * by asserting that the inode is still locked when it returns.
+	 */
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	xfs_trans_ijoin(tp, ip, 0);
+	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
 
 	error = xfs_ifree(tp, ip);
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	if (error) {
 		/*
 		 * If we fail to free the inode, shut down.  The cancel
@@ -1756,7 +1777,6 @@ xfs_inactive_ifree(
 			xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
 		}
 		xfs_trans_cancel(tp);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		return error;
 	}
 
@@ -1774,7 +1794,6 @@ xfs_inactive_ifree(
 		xfs_notice(mp, "%s: xfs_trans_commit returned error %d",
 			__func__, error);
 
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return 0;
 }
 
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 02/30] xfs: remove logged flag from inode log item
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
  2020-06-22  8:15 ` [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 03/30] xfs: add an inode item lock Dave Chinner
                   ` (28 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

This was used to track if the item had logged fields being flushed
to disk. We log everything in the inode these days, so this logic is
no longer needed. Remove it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode.c      | 13 ++++---------
 fs/xfs/xfs_inode_item.c | 35 ++++++++++-------------------------
 fs/xfs/xfs_inode_item.h |  1 -
 3 files changed, 14 insertions(+), 35 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 6d70daf1c92a..2e5fa6dea83a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2679,7 +2679,6 @@ xfs_ifree_cluster(
 		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
 			if (lip->li_type == XFS_LI_INODE) {
 				iip = (struct xfs_inode_log_item *)lip;
-				ASSERT(iip->ili_logged == 1);
 				lip->li_cb = xfs_istale_done;
 				xfs_trans_ail_copy_lsn(mp->m_ail,
 							&iip->ili_flush_lsn,
@@ -2708,7 +2707,6 @@ xfs_ifree_cluster(
 			iip->ili_last_fields = iip->ili_fields;
 			iip->ili_fields = 0;
 			iip->ili_fsync_fields = 0;
-			iip->ili_logged = 1;
 			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 						&iip->ili_item.li_lsn);
 
@@ -3840,19 +3838,16 @@ xfs_iflush_int(
 	 *
 	 * We can play with the ili_fields bits here, because the inode lock
 	 * must be held exclusively in order to set bits there and the flush
-	 * lock protects the ili_last_fields bits.  Set ili_logged so the flush
-	 * done routine can tell whether or not to look in the AIL.  Also, store
-	 * the current LSN of the inode so that we can tell whether the item has
-	 * moved in the AIL from xfs_iflush_done().  In order to read the lsn we
-	 * need the AIL lock, because it is a 64 bit value that cannot be read
-	 * atomically.
+	 * lock protects the ili_last_fields bits.  Store the current LSN of the
+	 * inode so that we can tell whether the item has moved in the AIL from
+	 * xfs_iflush_done().  In order to read the lsn we need the AIL lock,
+	 * because it is a 64 bit value that cannot be read atomically.
 	 */
 	error = 0;
 flush_out:
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
 	iip->ili_fsync_fields = 0;
-	iip->ili_logged = 1;
 
 	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 				&iip->ili_item.li_lsn);
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index ba47bf65b772..b17384aa8df4 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -528,8 +528,6 @@ xfs_inode_item_push(
 	}
 
 	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
-	ASSERT(iip->ili_logged == 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
-
 	spin_unlock(&lip->li_ailp->ail_lock);
 
 	error = xfs_iflush(ip, &bp);
@@ -690,30 +688,24 @@ xfs_iflush_done(
 			continue;
 
 		list_move_tail(&blip->li_bio_list, &tmp);
-		/*
-		 * while we have the item, do the unlocked check for needing
-		 * the AIL lock.
-		 */
+
+		/* Do an unlocked check for needing the AIL lock. */
 		iip = INODE_ITEM(blip);
-		if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) ||
+		if (blip->li_lsn == iip->ili_flush_lsn ||
 		    test_bit(XFS_LI_FAILED, &blip->li_flags))
 			need_ail++;
 	}
 
 	/* make sure we capture the state of the initial inode. */
 	iip = INODE_ITEM(lip);
-	if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) ||
+	if (lip->li_lsn == iip->ili_flush_lsn ||
 	    test_bit(XFS_LI_FAILED, &lip->li_flags))
 		need_ail++;
 
 	/*
-	 * We only want to pull the item from the AIL if it is
-	 * actually there and its location in the log has not
-	 * changed since we started the flush.  Thus, we only bother
-	 * if the ili_logged flag is set and the inode's lsn has not
-	 * changed.  First we check the lsn outside
-	 * the lock since it's cheaper, and then we recheck while
-	 * holding the lock before removing the inode from the AIL.
+	 * We only want to pull the item from the AIL if it is actually there
+	 * and its location in the log has not changed since we started the
+	 * flush.  Thus, we only bother if the inode's lsn has not changed.
 	 */
 	if (need_ail) {
 		xfs_lsn_t	tail_lsn = 0;
@@ -721,8 +713,7 @@ xfs_iflush_done(
 		/* this is an opencoded batch version of xfs_trans_ail_delete */
 		spin_lock(&ailp->ail_lock);
 		list_for_each_entry(blip, &tmp, li_bio_list) {
-			if (INODE_ITEM(blip)->ili_logged &&
-			    blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
+			if (blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
 				/*
 				 * xfs_ail_update_finish() only cares about the
 				 * lsn of the first tail item removed, any
@@ -740,14 +731,13 @@ xfs_iflush_done(
 	}
 
 	/*
-	 * clean up and unlock the flush lock now we are done. We can clear the
+	 * Clean up and unlock the flush lock now we are done. We can clear the
 	 * ili_last_fields bits now that we know that the data corresponding to
 	 * them is safely on disk.
 	 */
 	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
 		list_del_init(&blip->li_bio_list);
 		iip = INODE_ITEM(blip);
-		iip->ili_logged = 0;
 		iip->ili_last_fields = 0;
 		xfs_ifunlock(iip->ili_inode);
 	}
@@ -768,16 +758,11 @@ xfs_iflush_abort(
 
 	if (iip) {
 		xfs_trans_ail_delete(&iip->ili_item, 0);
-		iip->ili_logged = 0;
-		/*
-		 * Clear the ili_last_fields bits now that we know that the
-		 * data corresponding to them is safely on disk.
-		 */
-		iip->ili_last_fields = 0;
 		/*
 		 * Clear the inode logging fields so no more flushes are
 		 * attempted.
 		 */
+		iip->ili_last_fields = 0;
 		iip->ili_fields = 0;
 		iip->ili_fsync_fields = 0;
 	}
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 60b34bb66e8e..4de5070e0765 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -19,7 +19,6 @@ struct xfs_inode_log_item {
 	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
 	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
 	unsigned short		ili_lock_flags;	   /* lock flags */
-	unsigned short		ili_logged;	   /* flushed logged data */
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
 	unsigned int		ili_fsync_fields;  /* logged since last fsync */
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 03/30] xfs: add an inode item lock
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
  2020-06-22  8:15 ` [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Dave Chinner
  2020-06-22  8:15 ` [PATCH 02/30] xfs: remove logged flag from inode log item Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-23  2:30   ` Darrick J. Wong
  2020-06-22  8:15 ` [PATCH 04/30] xfs: mark inode buffers in cache Dave Chinner
                   ` (27 subsequent siblings)
  30 siblings, 1 reply; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The inode log item is kind of special in that it can be aggregating
new changes in memory at the same time time existing changes are
being written back to disk. This means there are fields in the log
item that are accessed concurrently from contexts that don't share
any locking at all.

e.g. updating ili_last_fields occurs at flush time under the
ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
completion time, and is read under the ILOCK_EXCL when the inode is
logged.  Hence there is no actual serialisation between reading the
field during logging of the inode in transactions vs clearing the
field in IO completion.

We currently get away with this by the fact that we are only
clearing fields in IO completion, and nothing bad happens if we
accidentally log more of the inode than we actually modify. Worst
case is we consume a tiny bit more memory and log bandwidth.

However, if we want to do more complex state manipulations on the
log item that requires updates at all three of these potential
locations, we need to have some mechanism of serialising those
operations. To do this, introduce a spinlock into the log item to
serialise internal state.

This could be done via the xfs_inode i_flags_lock, but this then
leads to potential lock inversion issues where inode flag updates
need to occur inside locks that best nest inside the inode log item
locks (e.g. marking inodes stale during inode cluster freeing).
Using a separate spinlock avoids these sorts of problems and
simplifies future code.

This does not touch the use of ili_fields in the item formatting
code - that is entirely protected by the ILOCK_EXCL at this point in
time, so it remains untouched.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_inode.c | 52 ++++++++++++++++-----------------
 fs/xfs/xfs_file.c               |  9 ++++--
 fs/xfs/xfs_inode.c              | 20 ++++++++-----
 fs/xfs/xfs_inode_item.c         |  7 +++++
 fs/xfs/xfs_inode_item.h         | 18 ++++++++++--
 5 files changed, 66 insertions(+), 40 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 4504d215cd59..c66d9d1dd58b 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -82,16 +82,20 @@ xfs_trans_ichgtime(
  */
 void
 xfs_trans_log_inode(
-	xfs_trans_t	*tp,
-	xfs_inode_t	*ip,
-	uint		flags)
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	uint			flags)
 {
-	struct inode	*inode = VFS_I(ip);
+	struct xfs_inode_log_item *iip = ip->i_itemp;
+	struct inode		*inode = VFS_I(ip);
+	uint			iversion_flags = 0;
 
-	ASSERT(ip->i_itemp != NULL);
+	ASSERT(iip);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
 
+	tp->t_flags |= XFS_TRANS_DIRTY;
+
 	/*
 	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
 	 * don't matter - we either will need an extra transaction in 24 hours
@@ -104,15 +108,6 @@ xfs_trans_log_inode(
 		spin_unlock(&inode->i_lock);
 	}
 
-	/*
-	 * Record the specific change for fdatasync optimisation. This
-	 * allows fdatasync to skip log forces for inodes that are only
-	 * timestamp dirty. We do this before the change count so that
-	 * the core being logged in this case does not impact on fdatasync
-	 * behaviour.
-	 */
-	ip->i_itemp->ili_fsync_fields |= flags;
-
 	/*
 	 * First time we log the inode in a transaction, bump the inode change
 	 * counter if it is configured for this to occur. While we have the
@@ -122,23 +117,28 @@ xfs_trans_log_inode(
 	 * set however, then go ahead and bump the i_version counter
 	 * unconditionally.
 	 */
-	if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
-	    IS_I_VERSION(VFS_I(ip))) {
-		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
-			flags |= XFS_ILOG_CORE;
+	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
+		if (IS_I_VERSION(inode) &&
+		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
+			iversion_flags = XFS_ILOG_CORE;
 	}
 
-	tp->t_flags |= XFS_TRANS_DIRTY;
+	/*
+	 * Record the specific change for fdatasync optimisation. This allows
+	 * fdatasync to skip log forces for inodes that are only timestamp
+	 * dirty.
+	 */
+	spin_lock(&iip->ili_lock);
+	iip->ili_fsync_fields |= flags;
 
 	/*
-	 * Always OR in the bits from the ili_last_fields field.
-	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
-	 * routines in the eventual clearing of the ili_fields bits.
-	 * See the big comment in xfs_iflush() for an explanation of
-	 * this coordination mechanism.
+	 * Always OR in the bits from the ili_last_fields field.  This is to
+	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
+	 * the eventual clearing of the ili_fields bits.  See the big comment in
+	 * xfs_iflush() for an explanation of this coordination mechanism.
 	 */
-	flags |= ip->i_itemp->ili_last_fields;
-	ip->i_itemp->ili_fields |= flags;
+	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
+	spin_unlock(&iip->ili_lock);
 }
 
 int
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 00db81eac80d..7b05f8fd7b3d 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -94,6 +94,7 @@ xfs_file_fsync(
 {
 	struct inode		*inode = file->f_mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_inode_log_item *iip = ip->i_itemp;
 	struct xfs_mount	*mp = ip->i_mount;
 	int			error = 0;
 	int			log_flushed = 0;
@@ -137,13 +138,15 @@ xfs_file_fsync(
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	if (xfs_ipincount(ip)) {
 		if (!datasync ||
-		    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
-			lsn = ip->i_itemp->ili_last_lsn;
+		    (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
+			lsn = iip->ili_last_lsn;
 	}
 
 	if (lsn) {
 		error = xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
-		ip->i_itemp->ili_fsync_fields = 0;
+		spin_lock(&iip->ili_lock);
+		iip->ili_fsync_fields = 0;
+		spin_unlock(&iip->ili_lock);
 	}
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 2e5fa6dea83a..e2296f0b7f11 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2704,9 +2704,11 @@ xfs_ifree_cluster(
 				continue;
 
 			iip = ip->i_itemp;
+			spin_lock(&iip->ili_lock);
 			iip->ili_last_fields = iip->ili_fields;
 			iip->ili_fields = 0;
 			iip->ili_fsync_fields = 0;
+			spin_unlock(&iip->ili_lock);
 			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 						&iip->ili_item.li_lsn);
 
@@ -2742,6 +2744,7 @@ xfs_ifree(
 {
 	int			error;
 	struct xfs_icluster	xic = { 0 };
+	struct xfs_inode_log_item *iip = ip->i_itemp;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(VFS_I(ip)->i_nlink == 0);
@@ -2779,7 +2782,9 @@ xfs_ifree(
 	ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS;
 
 	/* Don't attempt to replay owner changes for a deleted inode */
-	ip->i_itemp->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);
+	spin_lock(&iip->ili_lock);
+	iip->ili_fields &= ~(XFS_ILOG_AOWNER | XFS_ILOG_DOWNER);
+	spin_unlock(&iip->ili_lock);
 
 	/*
 	 * Bump the generation count so no one will be confused
@@ -3835,20 +3840,19 @@ xfs_iflush_int(
 	 * know that the information those bits represent is permanently on
 	 * disk.  As long as the flush completes before the inode is logged
 	 * again, then both ili_fields and ili_last_fields will be cleared.
-	 *
-	 * We can play with the ili_fields bits here, because the inode lock
-	 * must be held exclusively in order to set bits there and the flush
-	 * lock protects the ili_last_fields bits.  Store the current LSN of the
-	 * inode so that we can tell whether the item has moved in the AIL from
-	 * xfs_iflush_done().  In order to read the lsn we need the AIL lock,
-	 * because it is a 64 bit value that cannot be read atomically.
 	 */
 	error = 0;
 flush_out:
+	spin_lock(&iip->ili_lock);
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
 	iip->ili_fsync_fields = 0;
+	spin_unlock(&iip->ili_lock);
 
+	/*
+	 * Store the current LSN of the inode so that we can tell whether the
+	 * item has moved in the AIL from xfs_iflush_done().
+	 */
 	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 				&iip->ili_item.li_lsn);
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index b17384aa8df4..6ef9cbcfc94a 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -637,6 +637,7 @@ xfs_inode_item_init(
 	iip = ip->i_itemp = kmem_zone_zalloc(xfs_ili_zone, 0);
 
 	iip->ili_inode = ip;
+	spin_lock_init(&iip->ili_lock);
 	xfs_log_item_init(mp, &iip->ili_item, XFS_LI_INODE,
 						&xfs_inode_item_ops);
 }
@@ -738,7 +739,11 @@ xfs_iflush_done(
 	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
 		list_del_init(&blip->li_bio_list);
 		iip = INODE_ITEM(blip);
+
+		spin_lock(&iip->ili_lock);
 		iip->ili_last_fields = 0;
+		spin_unlock(&iip->ili_lock);
+
 		xfs_ifunlock(iip->ili_inode);
 	}
 	list_del(&tmp);
@@ -762,9 +767,11 @@ xfs_iflush_abort(
 		 * Clear the inode logging fields so no more flushes are
 		 * attempted.
 		 */
+		spin_lock(&iip->ili_lock);
 		iip->ili_last_fields = 0;
 		iip->ili_fields = 0;
 		iip->ili_fsync_fields = 0;
+		spin_unlock(&iip->ili_lock);
 	}
 	/*
 	 * Release the inode's flush lock since we're done with it.
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 4de5070e0765..4a10a1b92ee9 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -16,12 +16,24 @@ struct xfs_mount;
 struct xfs_inode_log_item {
 	struct xfs_log_item	ili_item;	   /* common portion */
 	struct xfs_inode	*ili_inode;	   /* inode ptr */
-	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
-	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
-	unsigned short		ili_lock_flags;	   /* lock flags */
+	unsigned short		ili_lock_flags;	   /* inode lock flags */
+	/*
+	 * The ili_lock protects the interactions between the dirty state and
+	 * the flush state of the inode log item. This allows us to do atomic
+	 * modifications of multiple state fields without having to hold a
+	 * specific inode lock to serialise them.
+	 *
+	 * We need atomic changes between inode dirtying, inode flushing and
+	 * inode completion, but these all hold different combinations of
+	 * ILOCK and iflock and hence we need some other method of serialising
+	 * updates to the flush state.
+	 */
+	spinlock_t		ili_lock;	   /* flush state lock */
 	unsigned int		ili_last_fields;   /* fields when flushed */
 	unsigned int		ili_fields;	   /* fields to be logged */
 	unsigned int		ili_fsync_fields;  /* logged since last fsync */
+	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
+	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
 };
 
 static inline int xfs_inode_clean(xfs_inode_t *ip)
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 04/30] xfs: mark inode buffers in cache
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (2 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 03/30] xfs: add an inode item lock Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-23  2:32   ` Darrick J. Wong
  2020-06-22  8:15 ` [PATCH 05/30] xfs: mark dquot " Dave Chinner
                   ` (26 subsequent siblings)
  30 siblings, 1 reply; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Inode buffers always have write IO callbacks, so by marking them
directly we can avoid needing to attach ->b_iodone functions to
them. This avoids an indirect call, and makes future modifications
much simpler.

While this is largely a refactor of existing functionality, we
broaden the scope of the flag to beyond where inodes are explicitly
attached because future changes need to know what type of log items
are attached to the buffer. Adding this buffer flag may invoke the
inode iodone callback in cases where it wouldn't have been
previously, but this is not a functional change because the callback
is identical to the normal buffer write iodone callback when inodes
are not attached.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf.c       | 21 ++++++++++++++++-----
 fs/xfs/xfs_buf.h       | 38 +++++++++++++++++++++++++-------------
 fs/xfs/xfs_buf_item.c  | 42 +++++++++++++++++++++++++++++++-----------
 fs/xfs/xfs_buf_item.h  |  1 +
 fs/xfs/xfs_inode.c     |  2 +-
 fs/xfs/xfs_trans_buf.c |  3 +++
 6 files changed, 77 insertions(+), 30 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 20b748f7e186..ae0c923574df 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -14,6 +14,8 @@
 #include "xfs_mount.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
+#include "xfs_trans.h"
+#include "xfs_buf_item.h"
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 
@@ -1202,12 +1204,21 @@ xfs_buf_ioend(
 		bp->b_flags |= XBF_DONE;
 	}
 
-	if (bp->b_iodone)
+	if (read)
+		goto out_finish;
+
+	if (bp->b_flags & _XBF_INODES) {
+		xfs_buf_inode_iodone(bp);
+		return;
+	}
+
+	if (bp->b_iodone) {
 		(*(bp->b_iodone))(bp);
-	else if (bp->b_flags & XBF_ASYNC)
-		xfs_buf_relse(bp);
-	else
-		complete(&bp->b_iowait);
+		return;
+	}
+
+out_finish:
+	xfs_buf_ioend_finish(bp);
 }
 
 static void
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 050c53b739e2..2400cb90a04c 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -30,15 +30,18 @@
 #define XBF_STALE	 (1 << 6) /* buffer has been staled, do not find it */
 #define XBF_WRITE_FAIL	 (1 << 7) /* async writes have failed on this buffer */
 
-/* flags used only as arguments to access routines */
-#define XBF_TRYLOCK	 (1 << 16)/* lock requested, but do not wait */
-#define XBF_UNMAPPED	 (1 << 17)/* do not map the buffer */
+/* buffer type flags for write callbacks */
+#define _XBF_INODES	 (1 << 16)/* inode buffer */
 
 /* flags used only internally */
 #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
 #define _XBF_KMEM	 (1 << 21)/* backed by heap memory */
 #define _XBF_DELWRI_Q	 (1 << 22)/* buffer on a delwri queue */
 
+/* flags used only as arguments to access routines */
+#define XBF_TRYLOCK	 (1 << 30)/* lock requested, but do not wait */
+#define XBF_UNMAPPED	 (1 << 31)/* do not map the buffer */
+
 typedef unsigned int xfs_buf_flags_t;
 
 #define XFS_BUF_FLAGS \
@@ -50,12 +53,13 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_DONE,		"DONE" }, \
 	{ XBF_STALE,		"STALE" }, \
 	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
-	{ XBF_TRYLOCK,		"TRYLOCK" },	/* should never be set */\
-	{ XBF_UNMAPPED,		"UNMAPPED" },	/* ditto */\
+	{ _XBF_INODES,		"INODES" }, \
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
-	{ _XBF_DELWRI_Q,	"DELWRI_Q" }
-
+	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
+	/* The following interface flags should never be set */ \
+	{ XBF_TRYLOCK,		"TRYLOCK" }, \
+	{ XBF_UNMAPPED,		"UNMAPPED" }
 
 /*
  * Internal state flags.
@@ -257,9 +261,23 @@ extern void xfs_buf_unlock(xfs_buf_t *);
 #define xfs_buf_islocked(bp) \
 	((bp)->b_sema.count <= 0)
 
+static inline void xfs_buf_relse(xfs_buf_t *bp)
+{
+	xfs_buf_unlock(bp);
+	xfs_buf_rele(bp);
+}
+
 /* Buffer Read and Write Routines */
 extern int xfs_bwrite(struct xfs_buf *bp);
 extern void xfs_buf_ioend(struct xfs_buf *bp);
+static inline void xfs_buf_ioend_finish(struct xfs_buf *bp)
+{
+	if (bp->b_flags & XBF_ASYNC)
+		xfs_buf_relse(bp);
+	else
+		complete(&bp->b_iowait);
+}
+
 extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error,
 		xfs_failaddr_t failaddr);
 #define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address)
@@ -324,12 +342,6 @@ static inline int xfs_buf_ispinned(struct xfs_buf *bp)
 	return atomic_read(&bp->b_pin_count);
 }
 
-static inline void xfs_buf_relse(xfs_buf_t *bp)
-{
-	xfs_buf_unlock(bp);
-	xfs_buf_rele(bp);
-}
-
 static inline int
 xfs_buf_verify_cksum(struct xfs_buf *bp, unsigned long cksum_offset)
 {
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 9e75e8d6042e..8659cf4282a6 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1158,20 +1158,15 @@ xfs_buf_iodone_callback_error(
 	return false;
 }
 
-/*
- * This is the iodone() function for buffers which have had callbacks attached
- * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
- * callback list, mark the buffer as having no more callbacks and then push the
- * buffer through IO completion processing.
- */
-void
-xfs_buf_iodone_callbacks(
+static void
+xfs_buf_run_callbacks(
 	struct xfs_buf		*bp)
 {
+
 	/*
-	 * If there is an error, process it. Some errors require us
-	 * to run callbacks after failure processing is done so we
-	 * detect that and take appropriate action.
+	 * If there is an error, process it. Some errors require us to run
+	 * callbacks after failure processing is done so we detect that and take
+	 * appropriate action.
 	 */
 	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
 		return;
@@ -1188,9 +1183,34 @@ xfs_buf_iodone_callbacks(
 	bp->b_log_item = NULL;
 	list_del_init(&bp->b_li_list);
 	bp->b_iodone = NULL;
+}
+
+/*
+ * This is the iodone() function for buffers which have had callbacks attached
+ * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
+ * callback list, mark the buffer as having no more callbacks and then push the
+ * buffer through IO completion processing.
+ */
+void
+xfs_buf_iodone_callbacks(
+	struct xfs_buf		*bp)
+{
+	xfs_buf_run_callbacks(bp);
 	xfs_buf_ioend(bp);
 }
 
+/*
+ * Inode buffer iodone callback function.
+ */
+void
+xfs_buf_inode_iodone(
+	struct xfs_buf		*bp)
+{
+	xfs_buf_run_callbacks(bp);
+	xfs_buf_ioend_finish(bp);
+}
+
+
 /*
  * This is the iodone() function for buffers which have been
  * logged.  It is called when they are eventually flushed out.
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index c9c57e2da932..a342933ad9b8 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -59,6 +59,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
 			      struct xfs_log_item *);
 void	xfs_buf_iodone_callbacks(struct xfs_buf *);
 void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
+void	xfs_buf_inode_iodone(struct xfs_buf *);
 bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
 
 extern kmem_zone_t	*xfs_buf_item_zone;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index e2296f0b7f11..b0760a900e11 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3862,13 +3862,13 @@ xfs_iflush_int(
 	 * completion on the buffer to remove the inode from the AIL and release
 	 * the flush lock.
 	 */
+	bp->b_flags |= _XBF_INODES;
 	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);
 
 	/* generate the checksum. */
 	xfs_dinode_calc_crc(mp, dip);
 
 	ASSERT(!list_empty(&bp->b_li_list));
-	ASSERT(bp->b_iodone != NULL);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 08174ffa2118..552d0869aa0f 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -626,6 +626,7 @@ xfs_trans_inode_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_INODE_BUF;
+	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
 
@@ -651,6 +652,7 @@ xfs_trans_stale_inode_buf(
 
 	bip->bli_flags |= XFS_BLI_STALE_INODE;
 	bip->bli_item.li_cb = xfs_buf_iodone;
+	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
 
@@ -675,6 +677,7 @@ xfs_trans_inode_alloc_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_INODE_ALLOC_BUF;
+	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
 
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 05/30] xfs: mark dquot buffers in cache
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (3 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 04/30] xfs: mark inode buffers in cache Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 06/30] xfs: mark log recovery buffers for completion Dave Chinner
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

dquot buffers always have write IO callbacks, so by marking them
directly we can avoid needing to attach ->b_iodone functions to
them. This avoids an indirect call, and makes future modifications
much simpler.

This is largely a rearrangement of the code at this point - no IO
completion functionality changes at this point, just how the
code is run is modified.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_buf.c       |  5 +++++
 fs/xfs/xfs_buf.h       |  2 ++
 fs/xfs/xfs_buf_item.c  | 10 ++++++++++
 fs/xfs/xfs_buf_item.h  |  1 +
 fs/xfs/xfs_dquot.c     |  1 +
 fs/xfs/xfs_trans_buf.c |  1 +
 6 files changed, 20 insertions(+)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index ae0c923574df..517932675b12 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1212,6 +1212,11 @@ xfs_buf_ioend(
 		return;
 	}
 
+	if (bp->b_flags & _XBF_DQUOTS) {
+		xfs_buf_dquot_iodone(bp);
+		return;
+	}
+
 	if (bp->b_iodone) {
 		(*(bp->b_iodone))(bp);
 		return;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 2400cb90a04c..c1d0843206dd 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -32,6 +32,7 @@
 
 /* buffer type flags for write callbacks */
 #define _XBF_INODES	 (1 << 16)/* inode buffer */
+#define _XBF_DQUOTS	 (1 << 17)/* dquot buffer */
 
 /* flags used only internally */
 #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
@@ -54,6 +55,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_STALE,		"STALE" }, \
 	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
 	{ _XBF_INODES,		"INODES" }, \
+	{ _XBF_DQUOTS,		"DQUOTS" }, \
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 8659cf4282a6..a42cdf9ccc47 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1210,6 +1210,16 @@ xfs_buf_inode_iodone(
 	xfs_buf_ioend_finish(bp);
 }
 
+/*
+ * Dquot buffer iodone callback function.
+ */
+void
+xfs_buf_dquot_iodone(
+	struct xfs_buf		*bp)
+{
+	xfs_buf_run_callbacks(bp);
+	xfs_buf_ioend_finish(bp);
+}
 
 /*
  * This is the iodone() function for buffers which have been
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index a342933ad9b8..27d13d29b5bb 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -60,6 +60,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
 void	xfs_buf_iodone_callbacks(struct xfs_buf *);
 void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
 void	xfs_buf_inode_iodone(struct xfs_buf *);
+void	xfs_buf_dquot_iodone(struct xfs_buf *);
 bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
 
 extern kmem_zone_t	*xfs_buf_item_zone;
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index d5b7f03e93c8..2e2146fa0914 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1179,6 +1179,7 @@ xfs_qm_dqflush(
 	 * Attach an iodone routine so that we can remove this dquot from the
 	 * AIL and release the flush lock once the dquot is synced to disk.
 	 */
+	bp->b_flags |= _XBF_DQUOTS;
 	xfs_buf_attach_iodone(bp, xfs_qm_dqflush_done,
 				  &dqp->q_logitem.qli_item);
 
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 552d0869aa0f..93d62cb864c1 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -788,5 +788,6 @@ xfs_trans_dquot_buf(
 		break;
 	}
 
+	bp->b_flags |= _XBF_DQUOTS;
 	xfs_trans_buf_set_type(tp, bp, type);
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 06/30] xfs: mark log recovery buffers for completion
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (4 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 05/30] xfs: mark dquot " Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 07/30] xfs: call xfs_buf_iodone directly Dave Chinner
                   ` (24 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Log recovery has it's own buffer write completion handler for
buffers that it directly recovers. Convert these to direct calls by
flagging these buffers as being log recovery buffers. The flag will
get cleared by the log recovery IO completion routine, so it will
never leak out of log recovery.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_buf.c                | 10 ++++++++++
 fs/xfs/xfs_buf.h                |  2 ++
 fs/xfs/xfs_buf_item_recover.c   |  5 ++---
 fs/xfs/xfs_dquot_item_recover.c |  2 +-
 fs/xfs/xfs_inode_item_recover.c |  2 +-
 fs/xfs/xfs_log_recover.c        |  5 ++---
 6 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 517932675b12..6a2c942372fc 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -14,6 +14,7 @@
 #include "xfs_mount.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
+#include "xfs_log_recover.h"
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
 #include "xfs_errortag.h"
@@ -1207,6 +1208,15 @@ xfs_buf_ioend(
 	if (read)
 		goto out_finish;
 
+	/*
+	 * If this is a log recovery buffer, we aren't doing transactional IO
+	 * yet so we need to let it handle IO completions.
+	 */
+	if (bp->b_flags & _XBF_LOGRECOVERY) {
+		xlog_recover_iodone(bp);
+		return;
+	}
+
 	if (bp->b_flags & _XBF_INODES) {
 		xfs_buf_inode_iodone(bp);
 		return;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index c1d0843206dd..30dabc5bae96 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -33,6 +33,7 @@
 /* buffer type flags for write callbacks */
 #define _XBF_INODES	 (1 << 16)/* inode buffer */
 #define _XBF_DQUOTS	 (1 << 17)/* dquot buffer */
+#define _XBF_LOGRECOVERY	 (1 << 18)/* log recovery buffer */
 
 /* flags used only internally */
 #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
@@ -56,6 +57,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
 	{ _XBF_INODES,		"INODES" }, \
 	{ _XBF_DQUOTS,		"DQUOTS" }, \
+	{ _XBF_LOGRECOVERY,		"LOG_RECOVERY" }, \
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
index 04faa7310c4f..74c851f60eee 100644
--- a/fs/xfs/xfs_buf_item_recover.c
+++ b/fs/xfs/xfs_buf_item_recover.c
@@ -419,8 +419,7 @@ xlog_recover_validate_buf_type(
 	if (bp->b_ops) {
 		struct xfs_buf_log_item	*bip;
 
-		ASSERT(!bp->b_iodone || bp->b_iodone == xlog_recover_iodone);
-		bp->b_iodone = xlog_recover_iodone;
+		bp->b_flags |= _XBF_LOGRECOVERY;
 		xfs_buf_item_init(bp, mp);
 		bip = bp->b_log_item;
 		bip->bli_item.li_lsn = current_lsn;
@@ -963,7 +962,7 @@ xlog_recover_buf_commit_pass2(
 		error = xfs_bwrite(bp);
 	} else {
 		ASSERT(bp->b_mount == mp);
-		bp->b_iodone = xlog_recover_iodone;
+		bp->b_flags |= _XBF_LOGRECOVERY;
 		xfs_buf_delwri_queue(bp, buffer_list);
 	}
 
diff --git a/fs/xfs/xfs_dquot_item_recover.c b/fs/xfs/xfs_dquot_item_recover.c
index 3400be4c88f0..f9ea9f55aa7c 100644
--- a/fs/xfs/xfs_dquot_item_recover.c
+++ b/fs/xfs/xfs_dquot_item_recover.c
@@ -153,7 +153,7 @@ xlog_recover_dquot_commit_pass2(
 
 	ASSERT(dq_f->qlf_size == 2);
 	ASSERT(bp->b_mount == mp);
-	bp->b_iodone = xlog_recover_iodone;
+	bp->b_flags |= _XBF_LOGRECOVERY;
 	xfs_buf_delwri_queue(bp, buffer_list);
 
 out_release:
diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
index dc3e26ff16c9..5e0d291835b3 100644
--- a/fs/xfs/xfs_inode_item_recover.c
+++ b/fs/xfs/xfs_inode_item_recover.c
@@ -376,7 +376,7 @@ xlog_recover_inode_commit_pass2(
 	xfs_dinode_calc_crc(log->l_mp, dip);
 
 	ASSERT(bp->b_mount == mp);
-	bp->b_iodone = xlog_recover_iodone;
+	bp->b_flags |= _XBF_LOGRECOVERY;
 	xfs_buf_delwri_queue(bp, buffer_list);
 
 out_release:
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index ec015df55b77..52a65a74208f 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -287,9 +287,8 @@ xlog_recover_iodone(
 	if (bp->b_log_item)
 		xfs_buf_item_relse(bp);
 	ASSERT(bp->b_log_item == NULL);
-
-	bp->b_iodone = NULL;
-	xfs_buf_ioend(bp);
+	bp->b_flags &= ~_XBF_LOGRECOVERY;
+	xfs_buf_ioend_finish(bp);
 }
 
 /*
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 07/30] xfs: call xfs_buf_iodone directly
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (5 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 06/30] xfs: mark log recovery buffers for completion Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 08/30] xfs: clean up whacky buffer log item list reinit Dave Chinner
                   ` (23 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

All unmarked dirty buffers should be in the AIL and have log items
attached to them. Hence when they are written, we will run a
callback to remove the item from the AIL if appropriate. Now that
we've handled inode and dquot buffers, all remaining calls are to
xfs_buf_iodone() and so we can hard code this rather than use an
indirect call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf.c       | 24 ++++++++----------------
 fs/xfs/xfs_buf.h       |  6 +-----
 fs/xfs/xfs_buf_item.c  | 40 ++++++++++------------------------------
 fs/xfs/xfs_buf_item.h  |  4 ++--
 fs/xfs/xfs_trans_buf.c | 13 +++----------
 5 files changed, 24 insertions(+), 63 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 6a2c942372fc..dda0c9445879 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -658,7 +658,6 @@ xfs_buf_find(
 	 */
 	if (bp->b_flags & XBF_STALE) {
 		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
-		ASSERT(bp->b_iodone == NULL);
 		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
 		bp->b_ops = NULL;
 	}
@@ -1194,10 +1193,13 @@ xfs_buf_ioend(
 	if (!bp->b_error && bp->b_io_error)
 		xfs_buf_ioerror(bp, bp->b_io_error);
 
-	/* Only validate buffers that were read without errors */
-	if (read && !bp->b_error && bp->b_ops) {
-		ASSERT(!bp->b_iodone);
-		bp->b_ops->verify_read(bp);
+	if (read) {
+		if (!bp->b_error && bp->b_ops)
+			bp->b_ops->verify_read(bp);
+		if (!bp->b_error)
+			bp->b_flags |= XBF_DONE;
+		xfs_buf_ioend_finish(bp);
+		return;
 	}
 
 	if (!bp->b_error) {
@@ -1205,9 +1207,6 @@ xfs_buf_ioend(
 		bp->b_flags |= XBF_DONE;
 	}
 
-	if (read)
-		goto out_finish;
-
 	/*
 	 * If this is a log recovery buffer, we aren't doing transactional IO
 	 * yet so we need to let it handle IO completions.
@@ -1226,14 +1225,7 @@ xfs_buf_ioend(
 		xfs_buf_dquot_iodone(bp);
 		return;
 	}
-
-	if (bp->b_iodone) {
-		(*(bp->b_iodone))(bp);
-		return;
-	}
-
-out_finish:
-	xfs_buf_ioend_finish(bp);
+	xfs_buf_iodone(bp);
 }
 
 static void
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 30dabc5bae96..755b652e695a 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -18,6 +18,7 @@
 /*
  *	Base types
  */
+struct xfs_buf;
 
 #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
 
@@ -102,10 +103,6 @@ typedef struct xfs_buftarg {
 	struct ratelimit_state	bt_ioerror_rl;
 } xfs_buftarg_t;
 
-struct xfs_buf;
-typedef void (*xfs_buf_iodone_t)(struct xfs_buf *);
-
-
 #define XB_PAGES	2
 
 struct xfs_buf_map {
@@ -158,7 +155,6 @@ typedef struct xfs_buf {
 	xfs_buftarg_t		*b_target;	/* buffer target (device) */
 	void			*b_addr;	/* virtual address of buffer */
 	struct work_struct	b_ioend_work;
-	xfs_buf_iodone_t	b_iodone;	/* I/O completion function */
 	struct completion	b_iowait;	/* queue for I/O waiters */
 	struct xfs_buf_log_item	*b_log_item;
 	struct list_head	b_li_list;	/* Log items list head */
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index a42cdf9ccc47..d87ae6363a13 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -460,7 +460,6 @@ xfs_buf_item_unpin(
 			xfs_buf_do_callbacks(bp);
 			bp->b_log_item = NULL;
 			list_del_init(&bp->b_li_list);
-			bp->b_iodone = NULL;
 		} else {
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
 			xfs_buf_item_relse(bp);
@@ -936,11 +935,7 @@ xfs_buf_item_free(
 }
 
 /*
- * This is called when the buf log item is no longer needed.  It should
- * free the buf log item associated with the given buffer and clear
- * the buffer's pointer to the buf log item.  If there are no more
- * items in the list, clear the b_iodone field of the buffer (see
- * xfs_buf_attach_iodone() below).
+ * xfs_buf_item_relse() is called when the buf log item is no longer needed.
  */
 void
 xfs_buf_item_relse(
@@ -952,9 +947,6 @@ xfs_buf_item_relse(
 	ASSERT(!test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags));
 
 	bp->b_log_item = NULL;
-	if (list_empty(&bp->b_li_list))
-		bp->b_iodone = NULL;
-
 	xfs_buf_rele(bp);
 	xfs_buf_item_free(bip);
 }
@@ -962,10 +954,7 @@ xfs_buf_item_relse(
 
 /*
  * Add the given log item with its callback to the list of callbacks
- * to be called when the buffer's I/O completes.  If it is not set
- * already, set the buffer's b_iodone() routine to be
- * xfs_buf_iodone_callbacks() and link the log item into the list of
- * items rooted at b_li_list.
+ * to be called when the buffer's I/O completes.
  */
 void
 xfs_buf_attach_iodone(
@@ -977,10 +966,6 @@ xfs_buf_attach_iodone(
 
 	lip->li_cb = cb;
 	list_add_tail(&lip->li_bio_list, &bp->b_li_list);
-
-	ASSERT(bp->b_iodone == NULL ||
-	       bp->b_iodone == xfs_buf_iodone_callbacks);
-	bp->b_iodone = xfs_buf_iodone_callbacks;
 }
 
 /*
@@ -1096,7 +1081,6 @@ xfs_buf_iodone_callback_error(
 		goto out_stale;
 
 	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
-	ASSERT(bp->b_iodone != NULL);
 
 	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
 
@@ -1182,28 +1166,24 @@ xfs_buf_run_callbacks(
 	xfs_buf_do_callbacks(bp);
 	bp->b_log_item = NULL;
 	list_del_init(&bp->b_li_list);
-	bp->b_iodone = NULL;
 }
 
 /*
- * This is the iodone() function for buffers which have had callbacks attached
- * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
- * callback list, mark the buffer as having no more callbacks and then push the
- * buffer through IO completion processing.
+ * Inode buffer iodone callback function.
  */
 void
-xfs_buf_iodone_callbacks(
+xfs_buf_inode_iodone(
 	struct xfs_buf		*bp)
 {
 	xfs_buf_run_callbacks(bp);
-	xfs_buf_ioend(bp);
+	xfs_buf_ioend_finish(bp);
 }
 
 /*
- * Inode buffer iodone callback function.
+ * Dquot buffer iodone callback function.
  */
 void
-xfs_buf_inode_iodone(
+xfs_buf_dquot_iodone(
 	struct xfs_buf		*bp)
 {
 	xfs_buf_run_callbacks(bp);
@@ -1211,10 +1191,10 @@ xfs_buf_inode_iodone(
 }
 
 /*
- * Dquot buffer iodone callback function.
+ * Dirty buffer iodone callback function.
  */
 void
-xfs_buf_dquot_iodone(
+xfs_buf_iodone(
 	struct xfs_buf		*bp)
 {
 	xfs_buf_run_callbacks(bp);
@@ -1229,7 +1209,7 @@ xfs_buf_dquot_iodone(
  * care of cleaning up the buffer itself.
  */
 void
-xfs_buf_iodone(
+xfs_buf_item_iodone(
 	struct xfs_buf		*bp,
 	struct xfs_log_item	*lip)
 {
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 27d13d29b5bb..610cd0019328 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -57,10 +57,10 @@ bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
 void	xfs_buf_attach_iodone(struct xfs_buf *,
 			      void(*)(struct xfs_buf *, struct xfs_log_item *),
 			      struct xfs_log_item *);
-void	xfs_buf_iodone_callbacks(struct xfs_buf *);
-void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
+void	xfs_buf_item_iodone(struct xfs_buf *, struct xfs_log_item *);
 void	xfs_buf_inode_iodone(struct xfs_buf *);
 void	xfs_buf_dquot_iodone(struct xfs_buf *);
+void	xfs_buf_iodone(struct xfs_buf *);
 bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
 
 extern kmem_zone_t	*xfs_buf_item_zone;
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 93d62cb864c1..6752676b94fe 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -465,24 +465,17 @@ xfs_trans_dirty_buf(
 
 	ASSERT(bp->b_transp == tp);
 	ASSERT(bip != NULL);
-	ASSERT(bp->b_iodone == NULL ||
-	       bp->b_iodone == xfs_buf_iodone_callbacks);
 
 	/*
 	 * Mark the buffer as needing to be written out eventually,
 	 * and set its iodone function to remove the buffer's buf log
 	 * item from the AIL and free it when the buffer is flushed
-	 * to disk.  See xfs_buf_attach_iodone() for more details
-	 * on li_cb and xfs_buf_iodone_callbacks().
-	 * If we end up aborting this transaction, we trap this buffer
-	 * inside the b_bdstrat callback so that this won't get written to
-	 * disk.
+	 * to disk.
 	 */
 	bp->b_flags |= XBF_DONE;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
-	bp->b_iodone = xfs_buf_iodone_callbacks;
-	bip->bli_item.li_cb = xfs_buf_iodone;
+	bip->bli_item.li_cb = xfs_buf_item_iodone;
 
 	/*
 	 * If we invalidated the buffer within this transaction, then
@@ -651,7 +644,7 @@ xfs_trans_stale_inode_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_STALE_INODE;
-	bip->bli_item.li_cb = xfs_buf_iodone;
+	bip->bli_item.li_cb = xfs_buf_item_iodone;
 	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 08/30] xfs: clean up whacky buffer log item list reinit
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (6 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 07/30] xfs: call xfs_buf_iodone directly Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 09/30] xfs: make inode IO completion buffer centric Dave Chinner
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we've emptied the buffer log item list, it does a list_del_init
on itself to reset it's pointers to itself. This is unnecessary as
the list is already empty at this point - it was a left-over
fragment from the list_head conversion of the buffer log item list.
Remove them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index d87ae6363a13..5b3cd5e90947 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -459,7 +459,6 @@ xfs_buf_item_unpin(
 		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
 			xfs_buf_do_callbacks(bp);
 			bp->b_log_item = NULL;
-			list_del_init(&bp->b_li_list);
 		} else {
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
 			xfs_buf_item_relse(bp);
@@ -1165,7 +1164,6 @@ xfs_buf_run_callbacks(
 
 	xfs_buf_do_callbacks(bp);
 	bp->b_log_item = NULL;
-	list_del_init(&bp->b_li_list);
 }
 
 /*
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 09/30] xfs: make inode IO completion buffer centric
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (7 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 08/30] xfs: clean up whacky buffer log item list reinit Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 10/30] xfs: use direct calls for dquot IO completion Dave Chinner
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Having different io completion callbacks for different inode states
makes things complex. We can detect if the inode is stale via the
XFS_ISTALE flag in IO completion, so we don't need a special
callback just for this.

This means inodes only have a single iodone callback, and inode IO
completion is entirely buffer centric at this point. Hence we no
longer need to use a log item callback at all as we can just call
xfs_iflush_done() directly from the buffer completions and walk the
buffer log item list to complete the all inodes under IO.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf_item.c   | 35 ++++++++++++++++++----
 fs/xfs/xfs_inode.c      |  6 ++--
 fs/xfs/xfs_inode_item.c | 65 ++++++++++++++---------------------------
 fs/xfs/xfs_inode_item.h |  5 ++--
 4 files changed, 56 insertions(+), 55 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 5b3cd5e90947..a4e416af5c61 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -13,6 +13,8 @@
 #include "xfs_mount.h"
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
+#include "xfs_inode.h"
+#include "xfs_inode_item.h"
 #include "xfs_trans_priv.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
@@ -457,7 +459,8 @@ xfs_buf_item_unpin(
 		 * the AIL lock.
 		 */
 		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
-			xfs_buf_do_callbacks(bp);
+			lip->li_cb(bp, lip);
+			xfs_iflush_done(bp);
 			bp->b_log_item = NULL;
 		} else {
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
@@ -1141,8 +1144,8 @@ xfs_buf_iodone_callback_error(
 	return false;
 }
 
-static void
-xfs_buf_run_callbacks(
+static inline bool
+xfs_buf_had_callback_errors(
 	struct xfs_buf		*bp)
 {
 
@@ -1152,7 +1155,7 @@ xfs_buf_run_callbacks(
 	 * appropriate action.
 	 */
 	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
-		return;
+		return true;
 
 	/*
 	 * Successful IO or permanent error. Either way, we can clear the
@@ -1161,7 +1164,16 @@ xfs_buf_run_callbacks(
 	bp->b_last_error = 0;
 	bp->b_retries = 0;
 	bp->b_first_retry_time = 0;
+	return false;
+}
 
+static void
+xfs_buf_run_callbacks(
+	struct xfs_buf		*bp)
+{
+
+	if (xfs_buf_had_callback_errors(bp))
+		return;
 	xfs_buf_do_callbacks(bp);
 	bp->b_log_item = NULL;
 }
@@ -1173,7 +1185,20 @@ void
 xfs_buf_inode_iodone(
 	struct xfs_buf		*bp)
 {
-	xfs_buf_run_callbacks(bp);
+	struct xfs_buf_log_item *blip = bp->b_log_item;
+	struct xfs_log_item	*lip;
+
+	if (xfs_buf_had_callback_errors(bp))
+		return;
+
+	/* If there is a buf_log_item attached, run its callback */
+	if (blip) {
+		lip = &blip->bli_item;
+		lip->li_cb(bp, lip);
+		bp->b_log_item = NULL;
+	}
+
+	xfs_iflush_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b0760a900e11..4875b6e38597 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2679,7 +2679,6 @@ xfs_ifree_cluster(
 		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
 			if (lip->li_type == XFS_LI_INODE) {
 				iip = (struct xfs_inode_log_item *)lip;
-				lip->li_cb = xfs_istale_done;
 				xfs_trans_ail_copy_lsn(mp->m_ail,
 							&iip->ili_flush_lsn,
 							&iip->ili_item.li_lsn);
@@ -2712,8 +2711,7 @@ xfs_ifree_cluster(
 			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 						&iip->ili_item.li_lsn);
 
-			xfs_buf_attach_iodone(bp, xfs_istale_done,
-						  &iip->ili_item);
+			xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
 
 			if (ip != free_ip)
 				xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -3863,7 +3861,7 @@ xfs_iflush_int(
 	 * the flush lock.
 	 */
 	bp->b_flags |= _XBF_INODES;
-	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);
+	xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
 
 	/* generate the checksum. */
 	xfs_dinode_calc_crc(mp, dip);
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6ef9cbcfc94a..7049f2ae8d18 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -668,40 +668,34 @@ xfs_inode_item_destroy(
  */
 void
 xfs_iflush_done(
-	struct xfs_buf		*bp,
-	struct xfs_log_item	*lip)
+	struct xfs_buf		*bp)
 {
 	struct xfs_inode_log_item *iip;
-	struct xfs_log_item	*blip, *n;
-	struct xfs_ail		*ailp = lip->li_ailp;
+	struct xfs_log_item	*lip, *n;
+	struct xfs_ail		*ailp = bp->b_mount->m_ail;
 	int			need_ail = 0;
 	LIST_HEAD(tmp);
 
 	/*
-	 * Scan the buffer IO completions for other inodes being completed and
-	 * attach them to the current inode log item.
+	 * Pull the attached inodes from the buffer one at a time and take the
+	 * appropriate action on them.
 	 */
-
-	list_add_tail(&lip->li_bio_list, &tmp);
-
-	list_for_each_entry_safe(blip, n, &bp->b_li_list, li_bio_list) {
-		if (lip->li_cb != xfs_iflush_done)
+	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
+		iip = INODE_ITEM(lip);
+		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
+			list_del_init(&lip->li_bio_list);
+			xfs_iflush_abort(iip->ili_inode);
 			continue;
+		}
 
-		list_move_tail(&blip->li_bio_list, &tmp);
+		list_move_tail(&lip->li_bio_list, &tmp);
 
 		/* Do an unlocked check for needing the AIL lock. */
-		iip = INODE_ITEM(blip);
-		if (blip->li_lsn == iip->ili_flush_lsn ||
-		    test_bit(XFS_LI_FAILED, &blip->li_flags))
+		if (lip->li_lsn == iip->ili_flush_lsn ||
+		    test_bit(XFS_LI_FAILED, &lip->li_flags))
 			need_ail++;
 	}
-
-	/* make sure we capture the state of the initial inode. */
-	iip = INODE_ITEM(lip);
-	if (lip->li_lsn == iip->ili_flush_lsn ||
-	    test_bit(XFS_LI_FAILED, &lip->li_flags))
-		need_ail++;
+	ASSERT(list_empty(&bp->b_li_list));
 
 	/*
 	 * We only want to pull the item from the AIL if it is actually there
@@ -713,19 +707,13 @@ xfs_iflush_done(
 
 		/* this is an opencoded batch version of xfs_trans_ail_delete */
 		spin_lock(&ailp->ail_lock);
-		list_for_each_entry(blip, &tmp, li_bio_list) {
-			if (blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) {
-				/*
-				 * xfs_ail_update_finish() only cares about the
-				 * lsn of the first tail item removed, any
-				 * others will be at the same or higher lsn so
-				 * we just ignore them.
-				 */
-				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, blip);
+		list_for_each_entry(lip, &tmp, li_bio_list) {
+			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
+				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
 				if (!tail_lsn && lsn)
 					tail_lsn = lsn;
 			} else {
-				xfs_clear_li_failed(blip);
+				xfs_clear_li_failed(lip);
 			}
 		}
 		xfs_ail_update_finish(ailp, tail_lsn);
@@ -736,9 +724,9 @@ xfs_iflush_done(
 	 * ili_last_fields bits now that we know that the data corresponding to
 	 * them is safely on disk.
 	 */
-	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
-		list_del_init(&blip->li_bio_list);
-		iip = INODE_ITEM(blip);
+	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
+		list_del_init(&lip->li_bio_list);
+		iip = INODE_ITEM(lip);
 
 		spin_lock(&iip->ili_lock);
 		iip->ili_last_fields = 0;
@@ -746,7 +734,6 @@ xfs_iflush_done(
 
 		xfs_ifunlock(iip->ili_inode);
 	}
-	list_del(&tmp);
 }
 
 /*
@@ -779,14 +766,6 @@ xfs_iflush_abort(
 	xfs_ifunlock(ip);
 }
 
-void
-xfs_istale_done(
-	struct xfs_buf		*bp,
-	struct xfs_log_item	*lip)
-{
-	xfs_iflush_abort(INODE_ITEM(lip)->ili_inode);
-}
-
 /*
  * convert an xfs_inode_log_format struct from the old 32 bit version
  * (which can have different field alignments) to the native 64 bit version
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 4a10a1b92ee9..048b5e7dee90 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -36,15 +36,14 @@ struct xfs_inode_log_item {
 	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
 };
 
-static inline int xfs_inode_clean(xfs_inode_t *ip)
+static inline int xfs_inode_clean(struct xfs_inode *ip)
 {
 	return !ip->i_itemp || !(ip->i_itemp->ili_fields & XFS_ILOG_ALL);
 }
 
 extern void xfs_inode_item_init(struct xfs_inode *, struct xfs_mount *);
 extern void xfs_inode_item_destroy(struct xfs_inode *);
-extern void xfs_iflush_done(struct xfs_buf *, struct xfs_log_item *);
-extern void xfs_istale_done(struct xfs_buf *, struct xfs_log_item *);
+extern void xfs_iflush_done(struct xfs_buf *);
 extern void xfs_iflush_abort(struct xfs_inode *);
 extern int xfs_inode_item_format_convert(xfs_log_iovec_t *,
 					 struct xfs_inode_log_format *);
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 10/30] xfs: use direct calls for dquot IO completion
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (8 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 09/30] xfs: make inode IO completion buffer centric Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 11/30] xfs: clean up the buffer iodone callback functions Dave Chinner
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Similar to inodes, we can call the dquot IO completion functions
directly from the buffer completion code, removing another user of
log item callbacks for IO completion processing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_buf_item.c | 18 +++++++++++++++++-
 fs/xfs/xfs_dquot.c    | 18 ++++++++++++++----
 fs/xfs/xfs_dquot.h    |  1 +
 3 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index a4e416af5c61..f46e5ec28111 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -15,6 +15,9 @@
 #include "xfs_buf_item.h"
 #include "xfs_inode.h"
 #include "xfs_inode_item.h"
+#include "xfs_quota.h"
+#include "xfs_dquot_item.h"
+#include "xfs_dquot.h"
 #include "xfs_trans_priv.h"
 #include "xfs_trace.h"
 #include "xfs_log.h"
@@ -1209,7 +1212,20 @@ void
 xfs_buf_dquot_iodone(
 	struct xfs_buf		*bp)
 {
-	xfs_buf_run_callbacks(bp);
+	struct xfs_buf_log_item *blip = bp->b_log_item;
+	struct xfs_log_item	*lip;
+
+	if (xfs_buf_had_callback_errors(bp))
+		return;
+
+	/* a newly allocated dquot buffer might have a log item attached */
+	if (blip) {
+		lip = &blip->bli_item;
+		lip->li_cb(bp, lip);
+		bp->b_log_item = NULL;
+	}
+
+	xfs_dquot_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
 
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 2e2146fa0914..403bc4e9f21f 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1048,9 +1048,8 @@ xfs_qm_dqrele(
  * from the AIL if it has not been re-logged, and unlocking the dquot's
  * flush lock. This behavior is very similar to that of inodes..
  */
-STATIC void
+static void
 xfs_qm_dqflush_done(
-	struct xfs_buf		*bp,
 	struct xfs_log_item	*lip)
 {
 	struct xfs_dq_logitem	*qip = (struct xfs_dq_logitem *)lip;
@@ -1091,6 +1090,18 @@ xfs_qm_dqflush_done(
 	xfs_dqfunlock(dqp);
 }
 
+void
+xfs_dquot_done(
+	struct xfs_buf		*bp)
+{
+	struct xfs_log_item	*lip, *n;
+
+	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
+		list_del_init(&lip->li_bio_list);
+		xfs_qm_dqflush_done(lip);
+	}
+}
+
 /*
  * Write a modified dquot to disk.
  * The dquot must be locked and the flush lock too taken by caller.
@@ -1180,8 +1191,7 @@ xfs_qm_dqflush(
 	 * AIL and release the flush lock once the dquot is synced to disk.
 	 */
 	bp->b_flags |= _XBF_DQUOTS;
-	xfs_buf_attach_iodone(bp, xfs_qm_dqflush_done,
-				  &dqp->q_logitem.qli_item);
+	xfs_buf_attach_iodone(bp, NULL, &dqp->q_logitem.qli_item);
 
 	/*
 	 * If the buffer is pinned then push on the log so we won't
diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
index 71e36c85e20b..fe9cc3e08ed6 100644
--- a/fs/xfs/xfs_dquot.h
+++ b/fs/xfs/xfs_dquot.h
@@ -174,6 +174,7 @@ void		xfs_qm_dqput(struct xfs_dquot *dqp);
 void		xfs_dqlock2(struct xfs_dquot *, struct xfs_dquot *);
 
 void		xfs_dquot_set_prealloc_limits(struct xfs_dquot *);
+void		xfs_dquot_done(struct xfs_buf *);
 
 static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp)
 {
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 11/30] xfs: clean up the buffer iodone callback functions
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (9 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 10/30] xfs: use direct calls for dquot IO completion Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 12/30] xfs: get rid of log item callbacks Dave Chinner
                   ` (19 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we've sorted inode and dquot buffers, we can apply the same
cleanups to dirty buffers with buffer log items. They only have one
callback, too, so we don't need the log item callback. Collapse the
iodone functions and remove all the now unnecessary infrastructure
around callback processing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf_item.c  | 140 +++++++++--------------------------------
 fs/xfs/xfs_buf_item.h  |   1 -
 fs/xfs/xfs_trans_buf.c |   2 -
 3 files changed, 29 insertions(+), 114 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index f46e5ec28111..0ece5de9dd71 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -30,7 +30,7 @@ static inline struct xfs_buf_log_item *BUF_ITEM(struct xfs_log_item *lip)
 	return container_of(lip, struct xfs_buf_log_item, bli_item);
 }
 
-STATIC void	xfs_buf_do_callbacks(struct xfs_buf *bp);
+static void xfs_buf_item_done(struct xfs_buf *bp);
 
 /* Is this log iovec plausibly large enough to contain the buffer log format? */
 bool
@@ -462,9 +462,8 @@ xfs_buf_item_unpin(
 		 * the AIL lock.
 		 */
 		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
-			lip->li_cb(bp, lip);
+			xfs_buf_item_done(bp);
 			xfs_iflush_done(bp);
-			bp->b_log_item = NULL;
 		} else {
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
 			xfs_buf_item_relse(bp);
@@ -973,46 +972,6 @@ xfs_buf_attach_iodone(
 	list_add_tail(&lip->li_bio_list, &bp->b_li_list);
 }
 
-/*
- * We can have many callbacks on a buffer. Running the callbacks individually
- * can cause a lot of contention on the AIL lock, so we allow for a single
- * callback to be able to scan the remaining items in bp->b_li_list for other
- * items of the same type and callback to be processed in the first call.
- *
- * As a result, the loop walking the callback list below will also modify the
- * list. it removes the first item from the list and then runs the callback.
- * The loop then restarts from the new first item int the list. This allows the
- * callback to scan and modify the list attached to the buffer and we don't
- * have to care about maintaining a next item pointer.
- */
-STATIC void
-xfs_buf_do_callbacks(
-	struct xfs_buf		*bp)
-{
-	struct xfs_buf_log_item *blip = bp->b_log_item;
-	struct xfs_log_item	*lip;
-
-	/* If there is a buf_log_item attached, run its callback */
-	if (blip) {
-		lip = &blip->bli_item;
-		lip->li_cb(bp, lip);
-	}
-
-	while (!list_empty(&bp->b_li_list)) {
-		lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
-				       li_bio_list);
-
-		/*
-		 * Remove the item from the list, so we don't have any
-		 * confusion if the item is added to another buf.
-		 * Don't touch the log item after calling its
-		 * callback, because it could have freed itself.
-		 */
-		list_del_init(&lip->li_bio_list);
-		lip->li_cb(bp, lip);
-	}
-}
-
 /*
  * Invoke the error state callback for each log item affected by the failed I/O.
  *
@@ -1025,8 +984,8 @@ STATIC void
 xfs_buf_do_callbacks_fail(
 	struct xfs_buf		*bp)
 {
+	struct xfs_ail		*ailp = bp->b_mount->m_ail;
 	struct xfs_log_item	*lip;
-	struct xfs_ail		*ailp;
 
 	/*
 	 * Buffer log item errors are handled directly by xfs_buf_item_push()
@@ -1036,9 +995,6 @@ xfs_buf_do_callbacks_fail(
 	if (list_empty(&bp->b_li_list))
 		return;
 
-	lip = list_first_entry(&bp->b_li_list, struct xfs_log_item,
-			li_bio_list);
-	ailp = lip->li_ailp;
 	spin_lock(&ailp->ail_lock);
 	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
 		if (lip->li_ops->iop_error)
@@ -1051,22 +1007,11 @@ static bool
 xfs_buf_iodone_callback_error(
 	struct xfs_buf		*bp)
 {
-	struct xfs_buf_log_item	*bip = bp->b_log_item;
-	struct xfs_log_item	*lip;
-	struct xfs_mount	*mp;
+	struct xfs_mount	*mp = bp->b_mount;
 	static ulong		lasttime;
 	static xfs_buftarg_t	*lasttarg;
 	struct xfs_error_cfg	*cfg;
 
-	/*
-	 * The failed buffer might not have a buf_log_item attached or the
-	 * log_item list might be empty. Get the mp from the available
-	 * xfs_log_item
-	 */
-	lip = list_first_entry_or_null(&bp->b_li_list, struct xfs_log_item,
-				       li_bio_list);
-	mp = lip ? lip->li_mountp : bip->bli_item.li_mountp;
-
 	/*
 	 * If we've already decided to shutdown the filesystem because of
 	 * I/O errors, there's no point in giving this a retry.
@@ -1171,14 +1116,27 @@ xfs_buf_had_callback_errors(
 }
 
 static void
-xfs_buf_run_callbacks(
+xfs_buf_item_done(
 	struct xfs_buf		*bp)
 {
+	struct xfs_buf_log_item	*bip = bp->b_log_item;
 
-	if (xfs_buf_had_callback_errors(bp))
+	if (!bip)
 		return;
-	xfs_buf_do_callbacks(bp);
+
+	/*
+	 * If we are forcibly shutting down, this may well be off the AIL
+	 * already. That's because we simulate the log-committed callbacks to
+	 * unpin these buffers. Or we may never have put this item on AIL
+	 * because of the transaction was aborted forcibly.
+	 * xfs_trans_ail_delete() takes care of these.
+	 *
+	 * Either way, AIL is useless if we're forcing a shutdown.
+	 */
+	xfs_trans_ail_delete(&bip->bli_item, SHUTDOWN_CORRUPT_INCORE);
 	bp->b_log_item = NULL;
+	xfs_buf_item_free(bip);
+	xfs_buf_rele(bp);
 }
 
 /*
@@ -1188,19 +1146,10 @@ void
 xfs_buf_inode_iodone(
 	struct xfs_buf		*bp)
 {
-	struct xfs_buf_log_item *blip = bp->b_log_item;
-	struct xfs_log_item	*lip;
-
 	if (xfs_buf_had_callback_errors(bp))
 		return;
 
-	/* If there is a buf_log_item attached, run its callback */
-	if (blip) {
-		lip = &blip->bli_item;
-		lip->li_cb(bp, lip);
-		bp->b_log_item = NULL;
-	}
-
+	xfs_buf_item_done(bp);
 	xfs_iflush_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
@@ -1212,59 +1161,28 @@ void
 xfs_buf_dquot_iodone(
 	struct xfs_buf		*bp)
 {
-	struct xfs_buf_log_item *blip = bp->b_log_item;
-	struct xfs_log_item	*lip;
-
 	if (xfs_buf_had_callback_errors(bp))
 		return;
 
 	/* a newly allocated dquot buffer might have a log item attached */
-	if (blip) {
-		lip = &blip->bli_item;
-		lip->li_cb(bp, lip);
-		bp->b_log_item = NULL;
-	}
-
+	xfs_buf_item_done(bp);
 	xfs_dquot_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
 
 /*
  * Dirty buffer iodone callback function.
+ *
+ * Note that for things like remote attribute buffers, there may not be a buffer
+ * log item here, so processing the buffer log item must remain be optional.
  */
 void
 xfs_buf_iodone(
 	struct xfs_buf		*bp)
 {
-	xfs_buf_run_callbacks(bp);
-	xfs_buf_ioend_finish(bp);
-}
-
-/*
- * This is the iodone() function for buffers which have been
- * logged.  It is called when they are eventually flushed out.
- * It should remove the buf item from the AIL, and free the buf item.
- * It is called by xfs_buf_iodone_callbacks() above which will take
- * care of cleaning up the buffer itself.
- */
-void
-xfs_buf_item_iodone(
-	struct xfs_buf		*bp,
-	struct xfs_log_item	*lip)
-{
-	ASSERT(BUF_ITEM(lip)->bli_buf == bp);
-
-	xfs_buf_rele(bp);
+	if (xfs_buf_had_callback_errors(bp))
+		return;
 
-	/*
-	 * If we are forcibly shutting down, this may well be off the AIL
-	 * already. That's because we simulate the log-committed callbacks to
-	 * unpin these buffers. Or we may never have put this item on AIL
-	 * because of the transaction was aborted forcibly.
-	 * xfs_trans_ail_delete() takes care of these.
-	 *
-	 * Either way, AIL is useless if we're forcing a shutdown.
-	 */
-	xfs_trans_ail_delete(lip, SHUTDOWN_CORRUPT_INCORE);
-	xfs_buf_item_free(BUF_ITEM(lip));
+	xfs_buf_item_done(bp);
+	xfs_buf_ioend_finish(bp);
 }
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 610cd0019328..7c0bd2a210af 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -57,7 +57,6 @@ bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
 void	xfs_buf_attach_iodone(struct xfs_buf *,
 			      void(*)(struct xfs_buf *, struct xfs_log_item *),
 			      struct xfs_log_item *);
-void	xfs_buf_item_iodone(struct xfs_buf *, struct xfs_log_item *);
 void	xfs_buf_inode_iodone(struct xfs_buf *);
 void	xfs_buf_dquot_iodone(struct xfs_buf *);
 void	xfs_buf_iodone(struct xfs_buf *);
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 6752676b94fe..11cd666cd99a 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -475,7 +475,6 @@ xfs_trans_dirty_buf(
 	bp->b_flags |= XBF_DONE;
 
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
-	bip->bli_item.li_cb = xfs_buf_item_iodone;
 
 	/*
 	 * If we invalidated the buffer within this transaction, then
@@ -644,7 +643,6 @@ xfs_trans_stale_inode_buf(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 
 	bip->bli_flags |= XFS_BLI_STALE_INODE;
-	bip->bli_item.li_cb = xfs_buf_item_iodone;
 	bp->b_flags |= _XBF_INODES;
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 12/30] xfs: get rid of log item callbacks
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (10 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 11/30] xfs: clean up the buffer iodone callback functions Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 13/30] xfs: handle buffer log item IO errors directly Dave Chinner
                   ` (18 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

They are not used anymore, so remove them from the log item and the
buffer iodone attachment interfaces.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 17 -----------------
 fs/xfs/xfs_buf_item.h |  3 ---
 fs/xfs/xfs_dquot.c    |  6 +++---
 fs/xfs/xfs_inode.c    |  5 +++--
 fs/xfs/xfs_trans.h    |  4 ----
 5 files changed, 6 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 0ece5de9dd71..09bfe9c52dbd 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -955,23 +955,6 @@ xfs_buf_item_relse(
 	xfs_buf_item_free(bip);
 }
 
-
-/*
- * Add the given log item with its callback to the list of callbacks
- * to be called when the buffer's I/O completes.
- */
-void
-xfs_buf_attach_iodone(
-	struct xfs_buf		*bp,
-	void			(*cb)(struct xfs_buf *, struct xfs_log_item *),
-	struct xfs_log_item	*lip)
-{
-	ASSERT(xfs_buf_islocked(bp));
-
-	lip->li_cb = cb;
-	list_add_tail(&lip->li_bio_list, &bp->b_li_list);
-}
-
 /*
  * Invoke the error state callback for each log item affected by the failed I/O.
  *
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 7c0bd2a210af..23507cbb4c41 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -54,9 +54,6 @@ void	xfs_buf_item_relse(struct xfs_buf *);
 bool	xfs_buf_item_put(struct xfs_buf_log_item *);
 void	xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint);
 bool	xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
-void	xfs_buf_attach_iodone(struct xfs_buf *,
-			      void(*)(struct xfs_buf *, struct xfs_log_item *),
-			      struct xfs_log_item *);
 void	xfs_buf_inode_iodone(struct xfs_buf *);
 void	xfs_buf_dquot_iodone(struct xfs_buf *);
 void	xfs_buf_iodone(struct xfs_buf *);
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 403bc4e9f21f..d5984a926d1d 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1187,11 +1187,11 @@ xfs_qm_dqflush(
 	}
 
 	/*
-	 * Attach an iodone routine so that we can remove this dquot from the
-	 * AIL and release the flush lock once the dquot is synced to disk.
+	 * Attach the dquot to the buffer so that we can remove this dquot from
+	 * the AIL and release the flush lock once the dquot is synced to disk.
 	 */
 	bp->b_flags |= _XBF_DQUOTS;
-	xfs_buf_attach_iodone(bp, NULL, &dqp->q_logitem.qli_item);
+	list_add_tail(&dqp->q_logitem.qli_item.li_bio_list, &bp->b_li_list);
 
 	/*
 	 * If the buffer is pinned then push on the log so we won't
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4875b6e38597..8d638f14ab4e 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2711,7 +2711,8 @@ xfs_ifree_cluster(
 			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 						&iip->ili_item.li_lsn);
 
-			xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
+			list_add_tail(&iip->ili_item.li_bio_list,
+						&bp->b_li_list);
 
 			if (ip != free_ip)
 				xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -3861,7 +3862,7 @@ xfs_iflush_int(
 	 * the flush lock.
 	 */
 	bp->b_flags |= _XBF_INODES;
-	xfs_buf_attach_iodone(bp, NULL, &iip->ili_item);
+	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
 
 	/* generate the checksum. */
 	xfs_dinode_calc_crc(mp, dip);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 8308bf6d7e40..99a9ab9cab25 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -37,10 +37,6 @@ struct xfs_log_item {
 	unsigned long			li_flags;	/* misc flags */
 	struct xfs_buf			*li_buf;	/* real buffer pointer */
 	struct list_head		li_bio_list;	/* buffer item list */
-	void				(*li_cb)(struct xfs_buf *,
-						 struct xfs_log_item *);
-							/* buffer item iodone */
-							/* callback func */
 	const struct xfs_item_ops	*li_ops;	/* function list */
 
 	/* delayed logging */
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 13/30] xfs: handle buffer log item IO errors directly
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (11 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 12/30] xfs: get rid of log item callbacks Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-23  2:38   ` Darrick J. Wong
  2020-06-22  8:15 ` [PATCH 14/30] xfs: unwind log item error flagging Dave Chinner
                   ` (17 subsequent siblings)
  30 siblings, 1 reply; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Currently when a buffer with attached log items has an IO error
it called ->iop_error for each attched log item. These all call
xfs_set_li_failed() to handle the error, but we are about to change
the way log items manage buffers. hence we first need to remove the
per-item dependency on buffer handling done by xfs_set_li_failed().

We already have specific buffer type IO completion routines, so move
the log item error handling out of the generic error handling and
into the log item specific functions so we can implement per-type
error handling easily.

This requires a more complex return value from the error handling
code so that we can take the correct action the failure handling
requires.  This results in some repeated boilerplate in the
functions, but that can be cleaned up later once all the changes
cascade through this code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 214 ++++++++++++++++++++++++++++--------------
 1 file changed, 144 insertions(+), 70 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 09bfe9c52dbd..f80fc5bd3bff 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -986,21 +986,24 @@ xfs_buf_do_callbacks_fail(
 	spin_unlock(&ailp->ail_lock);
 }
 
+/*
+ * Decide if we're going to retry the write after a failure, and prepare
+ * the buffer for retrying the write.
+ */
 static bool
-xfs_buf_iodone_callback_error(
+xfs_buf_ioerror_fail_without_retry(
 	struct xfs_buf		*bp)
 {
 	struct xfs_mount	*mp = bp->b_mount;
 	static ulong		lasttime;
 	static xfs_buftarg_t	*lasttarg;
-	struct xfs_error_cfg	*cfg;
 
 	/*
 	 * If we've already decided to shutdown the filesystem because of
 	 * I/O errors, there's no point in giving this a retry.
 	 */
 	if (XFS_FORCED_SHUTDOWN(mp))
-		goto out_stale;
+		return true;
 
 	if (bp->b_target != lasttarg ||
 	    time_after(jiffies, (lasttime + 5*HZ))) {
@@ -1011,91 +1014,114 @@ xfs_buf_iodone_callback_error(
 
 	/* synchronous writes will have callers process the error */
 	if (!(bp->b_flags & XBF_ASYNC))
-		goto out_stale;
-
-	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
-
-	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
+		return true;
+	return false;
+}
 
-	/*
-	 * If the write was asynchronous then no one will be looking for the
-	 * error.  If this is the first failure of this type, clear the error
-	 * state and write the buffer out again. This means we always retry an
-	 * async write failure at least once, but we also need to set the buffer
-	 * up to behave correctly now for repeated failures.
-	 */
-	if (!(bp->b_flags & (XBF_STALE | XBF_WRITE_FAIL)) ||
-	     bp->b_last_error != bp->b_error) {
-		bp->b_flags |= (XBF_WRITE | XBF_DONE | XBF_WRITE_FAIL);
-		bp->b_last_error = bp->b_error;
-		if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER &&
-		    !bp->b_first_retry_time)
-			bp->b_first_retry_time = jiffies;
+static bool
+xfs_buf_ioerror_retry(
+	struct xfs_buf		*bp,
+	struct xfs_error_cfg	*cfg)
+{
+	if ((bp->b_flags & (XBF_STALE | XBF_WRITE_FAIL)) &&
+	    bp->b_last_error == bp->b_error)
+		return false;
 
-		xfs_buf_ioerror(bp, 0);
-		xfs_buf_submit(bp);
-		return true;
-	}
+	bp->b_flags |= (XBF_WRITE | XBF_DONE | XBF_WRITE_FAIL);
+	bp->b_last_error = bp->b_error;
+	if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER &&
+	    !bp->b_first_retry_time)
+		bp->b_first_retry_time = jiffies;
+	return true;
+}
 
-	/*
-	 * Repeated failure on an async write. Take action according to the
-	 * error configuration we have been set up to use.
-	 */
+/*
+ * Account for this latest trip around the retry handler, and decide if
+ * we've failed enough times to constitute a permanent failure.
+ */
+static bool
+xfs_buf_ioerror_permanent(
+	struct xfs_buf		*bp,
+	struct xfs_error_cfg	*cfg)
+{
+	struct xfs_mount	*mp = bp->b_mount;
 
 	if (cfg->max_retries != XFS_ERR_RETRY_FOREVER &&
 	    ++bp->b_retries > cfg->max_retries)
-			goto permanent_error;
+		return true;
 	if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER &&
 	    time_after(jiffies, cfg->retry_timeout + bp->b_first_retry_time))
-			goto permanent_error;
+		return true;
 
 	/* At unmount we may treat errors differently */
 	if ((mp->m_flags & XFS_MOUNT_UNMOUNTING) && mp->m_fail_unmount)
-		goto permanent_error;
-
-	/*
-	 * Still a transient error, run IO completion failure callbacks and let
-	 * the higher layers retry the buffer.
-	 */
-	xfs_buf_do_callbacks_fail(bp);
-	xfs_buf_ioerror(bp, 0);
-	xfs_buf_relse(bp);
-	return true;
+		return true;
 
-	/*
-	 * Permanent error - we need to trigger a shutdown if we haven't already
-	 * to indicate that inconsistency will result from this action.
-	 */
-permanent_error:
-	xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
-out_stale:
-	xfs_buf_stale(bp);
-	bp->b_flags |= XBF_DONE;
-	trace_xfs_buf_error_relse(bp, _RET_IP_);
 	return false;
 }
 
-static inline bool
-xfs_buf_had_callback_errors(
+/*
+ * On a sync write or shutdown we just want to stale the buffer and let the
+ * caller handle the error in bp->b_error appropriately.
+ *
+ * If the write was asynchronous then no one will be looking for the error.  If
+ * this is the first failure of this type, clear the error state and write the
+ * buffer out again. This means we always retry an async write failure at least
+ * once, but we also need to set the buffer up to behave correctly now for
+ * repeated failures.
+ *
+ * If we get repeated async write failures, then we take action according to the
+ * error configuration we have been set up to use.
+ *
+ * Multi-state return value:
+ *
+ * XBF_IOERROR_FINISH: clear IO error retry state and run callback completions
+ * XBF_IOERROR_DONE: resubmitted immediately, do not run any completions
+ * XBF_IOERROR_FAIL: transient error, run failure callback completions and then
+ *    release the buffer
+ */
+enum {
+	XBF_IOERROR_FINISH,
+	XBF_IOERROR_DONE,
+	XBF_IOERROR_FAIL,
+};
+
+static int
+xfs_buf_iodone_error(
 	struct xfs_buf		*bp)
 {
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_error_cfg	*cfg;
 
-	/*
-	 * If there is an error, process it. Some errors require us to run
-	 * callbacks after failure processing is done so we detect that and take
-	 * appropriate action.
-	 */
-	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
-		return true;
+	if (xfs_buf_ioerror_fail_without_retry(bp))
+		goto out_stale;
+
+	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
+
+	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
+	if (xfs_buf_ioerror_retry(bp, cfg)) {
+		xfs_buf_ioerror(bp, 0);
+		xfs_buf_submit(bp);
+		return XBF_IOERROR_DONE;
+	}
 
 	/*
-	 * Successful IO or permanent error. Either way, we can clear the
-	 * retry state here in preparation for the next error that may occur.
+	 * Permanent error - we need to trigger a shutdown if we haven't already
+	 * to indicate that inconsistency will result from this action.
 	 */
-	bp->b_last_error = 0;
-	bp->b_retries = 0;
-	bp->b_first_retry_time = 0;
-	return false;
+	if (xfs_buf_ioerror_permanent(bp, cfg)) {
+		xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
+		goto out_stale;
+	}
+
+	/* Still considered a transient error. Caller will schedule retries. */
+	return XBF_IOERROR_FAIL;
+
+out_stale:
+	xfs_buf_stale(bp);
+	bp->b_flags |= XBF_DONE;
+	trace_xfs_buf_error_relse(bp, _RET_IP_);
+	return XBF_IOERROR_FINISH;
 }
 
 static void
@@ -1122,6 +1148,15 @@ xfs_buf_item_done(
 	xfs_buf_rele(bp);
 }
 
+static inline void
+xfs_buf_clear_ioerror_retry_state(
+	struct xfs_buf		*bp)
+{
+	bp->b_last_error = 0;
+	bp->b_retries = 0;
+	bp->b_first_retry_time = 0;
+}
+
 /*
  * Inode buffer iodone callback function.
  */
@@ -1129,9 +1164,22 @@ void
 xfs_buf_inode_iodone(
 	struct xfs_buf		*bp)
 {
-	if (xfs_buf_had_callback_errors(bp))
+	if (bp->b_error) {
+		int ret = xfs_buf_iodone_error(bp);
+
+		if (ret == XBF_IOERROR_FINISH)
+			goto finish_iodone;
+		if (ret == XBF_IOERROR_DONE)
+			return;
+		ASSERT(ret == XBF_IOERROR_FAIL);
+		xfs_buf_do_callbacks_fail(bp);
+		xfs_buf_ioerror(bp, 0);
+		xfs_buf_relse(bp);
 		return;
+	}
 
+finish_iodone:
+	xfs_buf_clear_ioerror_retry_state(bp);
 	xfs_buf_item_done(bp);
 	xfs_iflush_done(bp);
 	xfs_buf_ioend_finish(bp);
@@ -1144,9 +1192,22 @@ void
 xfs_buf_dquot_iodone(
 	struct xfs_buf		*bp)
 {
-	if (xfs_buf_had_callback_errors(bp))
+	if (bp->b_error) {
+		int ret = xfs_buf_iodone_error(bp);
+
+		if (ret == XBF_IOERROR_FINISH)
+			goto finish_iodone;
+		if (ret == XBF_IOERROR_DONE)
+			return;
+		ASSERT(ret == XBF_IOERROR_FAIL);
+		xfs_buf_do_callbacks_fail(bp);
+		xfs_buf_ioerror(bp, 0);
+		xfs_buf_relse(bp);
 		return;
+	}
 
+finish_iodone:
+	xfs_buf_clear_ioerror_retry_state(bp);
 	/* a newly allocated dquot buffer might have a log item attached */
 	xfs_buf_item_done(bp);
 	xfs_dquot_done(bp);
@@ -1163,9 +1224,22 @@ void
 xfs_buf_iodone(
 	struct xfs_buf		*bp)
 {
-	if (xfs_buf_had_callback_errors(bp))
+	if (bp->b_error) {
+		int ret = xfs_buf_iodone_error(bp);
+
+		if (ret == XBF_IOERROR_FINISH)
+			goto finish_iodone;
+		if (ret == XBF_IOERROR_DONE)
+			return;
+		ASSERT(ret == XBF_IOERROR_FAIL);
+		xfs_buf_do_callbacks_fail(bp);
+		xfs_buf_ioerror(bp, 0);
+		xfs_buf_relse(bp);
 		return;
+	}
 
+finish_iodone:
+	xfs_buf_clear_ioerror_retry_state(bp);
 	xfs_buf_item_done(bp);
 	xfs_buf_ioend_finish(bp);
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 14/30] xfs: unwind log item error flagging
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (12 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 13/30] xfs: handle buffer log item IO errors directly Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() Dave Chinner
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When an buffer IO error occurs, we want to mark all
the log items attached to the buffer as failed. Open code
the error handling loop so that we can modify the flagging for the
different types of objects directly and independently of each other.

This also allows us to remove the ->iop_error method from the log
item operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_buf_item.c   | 48 ++++++++++++-----------------------------
 fs/xfs/xfs_dquot_item.c | 18 ----------------
 fs/xfs/xfs_inode_item.c | 18 ----------------
 fs/xfs/xfs_trans.h      |  1 -
 4 files changed, 14 insertions(+), 71 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index f80fc5bd3bff..d61f20b989cd 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -12,6 +12,7 @@
 #include "xfs_bit.h"
 #include "xfs_mount.h"
 #include "xfs_trans.h"
+#include "xfs_trans_priv.h"
 #include "xfs_buf_item.h"
 #include "xfs_inode.h"
 #include "xfs_inode_item.h"
@@ -955,37 +956,6 @@ xfs_buf_item_relse(
 	xfs_buf_item_free(bip);
 }
 
-/*
- * Invoke the error state callback for each log item affected by the failed I/O.
- *
- * If a metadata buffer write fails with a non-permanent error, the buffer is
- * eventually resubmitted and so the completion callbacks are not run. The error
- * state may need to be propagated to the log items attached to the buffer,
- * however, so the next AIL push of the item knows hot to handle it correctly.
- */
-STATIC void
-xfs_buf_do_callbacks_fail(
-	struct xfs_buf		*bp)
-{
-	struct xfs_ail		*ailp = bp->b_mount->m_ail;
-	struct xfs_log_item	*lip;
-
-	/*
-	 * Buffer log item errors are handled directly by xfs_buf_item_push()
-	 * and xfs_buf_iodone_callback_error, and they have no IO error
-	 * callbacks. Check only for items in b_li_list.
-	 */
-	if (list_empty(&bp->b_li_list))
-		return;
-
-	spin_lock(&ailp->ail_lock);
-	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
-		if (lip->li_ops->iop_error)
-			lip->li_ops->iop_error(lip, bp);
-	}
-	spin_unlock(&ailp->ail_lock);
-}
-
 /*
  * Decide if we're going to retry the write after a failure, and prepare
  * the buffer for retrying the write.
@@ -1165,6 +1135,7 @@ xfs_buf_inode_iodone(
 	struct xfs_buf		*bp)
 {
 	if (bp->b_error) {
+		struct xfs_log_item *lip;
 		int ret = xfs_buf_iodone_error(bp);
 
 		if (ret == XBF_IOERROR_FINISH)
@@ -1172,7 +1143,11 @@ xfs_buf_inode_iodone(
 		if (ret == XBF_IOERROR_DONE)
 			return;
 		ASSERT(ret == XBF_IOERROR_FAIL);
-		xfs_buf_do_callbacks_fail(bp);
+		spin_lock(&bp->b_mount->m_ail->ail_lock);
+		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
+			xfs_set_li_failed(lip, bp);
+		}
+		spin_unlock(&bp->b_mount->m_ail->ail_lock);
 		xfs_buf_ioerror(bp, 0);
 		xfs_buf_relse(bp);
 		return;
@@ -1193,6 +1168,7 @@ xfs_buf_dquot_iodone(
 	struct xfs_buf		*bp)
 {
 	if (bp->b_error) {
+		struct xfs_log_item *lip;
 		int ret = xfs_buf_iodone_error(bp);
 
 		if (ret == XBF_IOERROR_FINISH)
@@ -1200,7 +1176,11 @@ xfs_buf_dquot_iodone(
 		if (ret == XBF_IOERROR_DONE)
 			return;
 		ASSERT(ret == XBF_IOERROR_FAIL);
-		xfs_buf_do_callbacks_fail(bp);
+		spin_lock(&bp->b_mount->m_ail->ail_lock);
+		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
+			xfs_set_li_failed(lip, bp);
+		}
+		spin_unlock(&bp->b_mount->m_ail->ail_lock);
 		xfs_buf_ioerror(bp, 0);
 		xfs_buf_relse(bp);
 		return;
@@ -1232,7 +1212,7 @@ xfs_buf_iodone(
 		if (ret == XBF_IOERROR_DONE)
 			return;
 		ASSERT(ret == XBF_IOERROR_FAIL);
-		xfs_buf_do_callbacks_fail(bp);
+		ASSERT(list_empty(&bp->b_li_list));
 		xfs_buf_ioerror(bp, 0);
 		xfs_buf_relse(bp);
 		return;
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 349c92d26570..d7e4de7151d7 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -113,23 +113,6 @@ xfs_qm_dqunpin_wait(
 	wait_event(dqp->q_pinwait, (atomic_read(&dqp->q_pincount) == 0));
 }
 
-/*
- * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer
- * have been failed during writeback
- *
- * this informs the AIL that the dquot is already flush locked on the next push,
- * and acquires a hold on the buffer to ensure that it isn't reclaimed before
- * dirty data makes it to disk.
- */
-STATIC void
-xfs_dquot_item_error(
-	struct xfs_log_item	*lip,
-	struct xfs_buf		*bp)
-{
-	ASSERT(!completion_done(&DQUOT_ITEM(lip)->qli_dquot->q_flush));
-	xfs_set_li_failed(lip, bp);
-}
-
 STATIC uint
 xfs_qm_dquot_logitem_push(
 	struct xfs_log_item	*lip,
@@ -216,7 +199,6 @@ static const struct xfs_item_ops xfs_dquot_item_ops = {
 	.iop_release	= xfs_qm_dquot_logitem_release,
 	.iop_committing	= xfs_qm_dquot_logitem_committing,
 	.iop_push	= xfs_qm_dquot_logitem_push,
-	.iop_error	= xfs_dquot_item_error
 };
 
 /*
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 7049f2ae8d18..86c783dec2ba 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -464,23 +464,6 @@ xfs_inode_item_unpin(
 		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
 }
 
-/*
- * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer
- * have been failed during writeback
- *
- * This informs the AIL that the inode is already flush locked on the next push,
- * and acquires a hold on the buffer to ensure that it isn't reclaimed before
- * dirty data makes it to disk.
- */
-STATIC void
-xfs_inode_item_error(
-	struct xfs_log_item	*lip,
-	struct xfs_buf		*bp)
-{
-	ASSERT(xfs_isiflocked(INODE_ITEM(lip)->ili_inode));
-	xfs_set_li_failed(lip, bp);
-}
-
 STATIC uint
 xfs_inode_item_push(
 	struct xfs_log_item	*lip,
@@ -619,7 +602,6 @@ static const struct xfs_item_ops xfs_inode_item_ops = {
 	.iop_committed	= xfs_inode_item_committed,
 	.iop_push	= xfs_inode_item_push,
 	.iop_committing	= xfs_inode_item_committing,
-	.iop_error	= xfs_inode_item_error
 };
 
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 99a9ab9cab25..b752501818d2 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -74,7 +74,6 @@ struct xfs_item_ops {
 	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
 	void (*iop_release)(struct xfs_log_item *);
 	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
-	void (*iop_error)(struct xfs_log_item *, xfs_buf_t *);
 	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
 	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
 };
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (13 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 14/30] xfs: unwind log item error flagging Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Dave Chinner
                   ` (15 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_ail_delete_one() is called directly from dquot and inode IO
completion, as well as from the generic xfs_trans_ail_delete()
function. Inodes are about to have their own failure handling, and
dquots will in future, too. Pull the clearing of the LI_FAILED flag
up into the callers so we can customise the code appropriately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_dquot.c      | 6 +-----
 fs/xfs/xfs_inode_item.c | 3 +--
 fs/xfs/xfs_trans_ail.c  | 2 +-
 3 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index d5984a926d1d..76353c9a723e 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1070,16 +1070,12 @@ xfs_qm_dqflush_done(
 	     test_bit(XFS_LI_FAILED, &lip->li_flags))) {
 
 		spin_lock(&ailp->ail_lock);
+		xfs_clear_li_failed(lip);
 		if (lip->li_lsn == qip->qli_flush_lsn) {
 			/* xfs_ail_update_finish() drops the AIL lock */
 			tail_lsn = xfs_ail_delete_one(ailp, lip);
 			xfs_ail_update_finish(ailp, tail_lsn);
 		} else {
-			/*
-			 * Clear the failed state since we are about to drop the
-			 * flush lock
-			 */
-			xfs_clear_li_failed(lip);
 			spin_unlock(&ailp->ail_lock);
 		}
 	}
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 86c783dec2ba..0ba75764a8dc 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -690,12 +690,11 @@ xfs_iflush_done(
 		/* this is an opencoded batch version of xfs_trans_ail_delete */
 		spin_lock(&ailp->ail_lock);
 		list_for_each_entry(lip, &tmp, li_bio_list) {
+			xfs_clear_li_failed(lip);
 			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
 				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
 				if (!tail_lsn && lsn)
 					tail_lsn = lsn;
-			} else {
-				xfs_clear_li_failed(lip);
 			}
 		}
 		xfs_ail_update_finish(ailp, tail_lsn);
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index ac5019361a13..ac33f6393f99 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -843,7 +843,6 @@ xfs_ail_delete_one(
 
 	trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn);
 	xfs_ail_delete(ailp, lip);
-	xfs_clear_li_failed(lip);
 	clear_bit(XFS_LI_IN_AIL, &lip->li_flags);
 	lip->li_lsn = 0;
 
@@ -874,6 +873,7 @@ xfs_trans_ail_delete(
 	}
 
 	/* xfs_ail_update_finish() drops the AIL lock */
+	xfs_clear_li_failed(lip);
 	tail_lsn = xfs_ail_delete_one(ailp, lip);
 	xfs_ail_update_finish(ailp, tail_lsn);
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 16/30] xfs: pin inode backing buffer to the inode log item
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (14 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-23  2:39   ` Darrick J. Wong
  2020-06-22  8:15 ` [PATCH 17/30] xfs: make inode reclaim almost non-blocking Dave Chinner
                   ` (14 subsequent siblings)
  30 siblings, 1 reply; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we dirty an inode, we are going to have to write it disk at
some point in the near future. This requires the inode cluster
backing buffer to be present in memory. Unfortunately, under severe
memory pressure we can reclaim the inode backing buffer while the
inode is dirty in memory, resulting in stalling the AIL pushing
because it has to do a read-modify-write cycle on the cluster
buffer.

When we have no memory available, the read of the cluster buffer
blocks the AIL pushing process, and this causes all sorts of issues
for memory reclaim as it requires inode writeback to make forwards
progress. Allocating a cluster buffer causes more memory pressure,
and results in more cluster buffers to be reclaimed, resulting in
more RMW cycles to be done in the AIL context and everything then
backs up on AIL progress. Only the synchronous inode cluster
writeback in the the inode reclaim code provides some level of
forwards progress guarantees that prevent OOM-killer rampages in
this situation.

Fix this by pinning the inode backing buffer to the inode log item
when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
This may mean the first modification of an inode that has been held
in cache for a long time may block on a cluster buffer read, but
we can do that in transaction context and block safely until the
buffer has been allocated and read.

Once we have the cluster buffer, the inode log item takes a
reference to it, pinning it in memory, and attaches it to the log
item for future reference. This means we can always grab the cluster
buffer from the inode log item when we need it.

When the inode is finally cleaned and removed from the AIL, we can
drop the reference the inode log item holds on the cluster buffer.
Once all inodes on the cluster buffer are clean, the cluster buffer
will be unpinned and it will be available for memory reclaim to
reclaim again.

This avoids the issues with needing to do RMW cycles in the AIL
pushing context, and hence allows complete non-blocking inode
flushing to be performed by the AIL pushing context.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c   |  3 +-
 fs/xfs/libxfs/xfs_trans_inode.c | 53 ++++++++++++++++++++++++----
 fs/xfs/xfs_buf_item.c           |  4 +--
 fs/xfs/xfs_inode_item.c         | 61 ++++++++++++++++++++++++++-------
 fs/xfs/xfs_trans_ail.c          |  8 +++--
 5 files changed, 105 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 6f84ea85fdd8..1af97235785c 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -176,7 +176,8 @@ xfs_imap_to_bp(
 	}
 
 	*bpp = bp;
-	*dipp = xfs_buf_offset(bp, imap->im_boffset);
+	if (dipp)
+		*dipp = xfs_buf_offset(bp, imap->im_boffset);
 	return 0;
 }
 
diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index c66d9d1dd58b..ad5974365c58 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -8,6 +8,8 @@
 #include "xfs_shared.h"
 #include "xfs_format.h"
 #include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
 #include "xfs_inode.h"
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
@@ -72,13 +74,19 @@ xfs_trans_ichgtime(
 }
 
 /*
- * This is called to mark the fields indicated in fieldmask as needing
- * to be logged when the transaction is committed.  The inode must
- * already be associated with the given transaction.
+ * This is called to mark the fields indicated in fieldmask as needing to be
+ * logged when the transaction is committed.  The inode must already be
+ * associated with the given transaction.
  *
- * The values for fieldmask are defined in xfs_inode_item.h.  We always
- * log all of the core inode if any of it has changed, and we always log
- * all of the inline data/extents/b-tree root if any of them has changed.
+ * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
+ * of the core inode if any of it has changed, and we always log all of the
+ * inline data/extents/b-tree root if any of them has changed.
+ *
+ * Grab and pin the cluster buffer associated with this inode to avoid RMW
+ * cycles at inode writeback time. Avoid the need to add error handling to every
+ * xfs_trans_log_inode() call by shutting down on read error.  This will cause
+ * transactions to fail and everything to error out, just like if we return a
+ * read error in a dirty transaction and cancel it.
  */
 void
 xfs_trans_log_inode(
@@ -131,6 +139,39 @@ xfs_trans_log_inode(
 	spin_lock(&iip->ili_lock);
 	iip->ili_fsync_fields |= flags;
 
+	if (!iip->ili_item.li_buf) {
+		struct xfs_buf	*bp;
+		int		error;
+
+		/*
+		 * We hold the ILOCK here, so this inode is not going to be
+		 * flushed while we are here. Further, because there is no
+		 * buffer attached to the item, we know that there is no IO in
+		 * progress, so nothing will clear the ili_fields while we read
+		 * in the buffer. Hence we can safely drop the spin lock and
+		 * read the buffer knowing that the state will not change from
+		 * here.
+		 */
+		spin_unlock(&iip->ili_lock);
+		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, NULL,
+					&bp, 0);
+		if (error) {
+			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
+			return;
+		}
+
+		/*
+		 * We need an explicit buffer reference for the log item but
+		 * don't want the buffer to remain attached to the transaction.
+		 * Hold the buffer but release the transaction reference.
+		 */
+		xfs_buf_hold(bp);
+		xfs_trans_brelse(tp, bp);
+
+		spin_lock(&iip->ili_lock);
+		iip->ili_item.li_buf = bp;
+	}
+
 	/*
 	 * Always OR in the bits from the ili_last_fields field.  This is to
 	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index d61f20b989cd..ecb3362395af 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1143,11 +1143,9 @@ xfs_buf_inode_iodone(
 		if (ret == XBF_IOERROR_DONE)
 			return;
 		ASSERT(ret == XBF_IOERROR_FAIL);
-		spin_lock(&bp->b_mount->m_ail->ail_lock);
 		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
-			xfs_set_li_failed(lip, bp);
+			set_bit(XFS_LI_FAILED, &lip->li_flags);
 		}
-		spin_unlock(&bp->b_mount->m_ail->ail_lock);
 		xfs_buf_ioerror(bp, 0);
 		xfs_buf_relse(bp);
 		return;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 0ba75764a8dc..64bdda72f7b2 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -439,6 +439,7 @@ xfs_inode_item_pin(
 	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
+	ASSERT(lip->li_buf);
 
 	trace_xfs_inode_pin(ip, _RET_IP_);
 	atomic_inc(&ip->i_pincount);
@@ -450,6 +451,12 @@ xfs_inode_item_pin(
  * item which was previously pinned with a call to xfs_inode_item_pin().
  *
  * Also wake up anyone in xfs_iunpin_wait() if the count goes to 0.
+ *
+ * Note that unpin can race with inode cluster buffer freeing marking the buffer
+ * stale. In that case, flush completions are run from the buffer unpin call,
+ * which may happen before the inode is unpinned. If we lose the race, there
+ * will be no buffer attached to the log item, but the inode will be marked
+ * XFS_ISTALE.
  */
 STATIC void
 xfs_inode_item_unpin(
@@ -459,6 +466,7 @@ xfs_inode_item_unpin(
 	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
 
 	trace_xfs_inode_unpin(ip, _RET_IP_);
+	ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE));
 	ASSERT(atomic_read(&ip->i_pincount) > 0);
 	if (atomic_dec_and_test(&ip->i_pincount))
 		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
@@ -629,10 +637,15 @@ xfs_inode_item_init(
  */
 void
 xfs_inode_item_destroy(
-	xfs_inode_t	*ip)
+	struct xfs_inode	*ip)
 {
-	kmem_free(ip->i_itemp->ili_item.li_lv_shadow);
-	kmem_cache_free(xfs_ili_zone, ip->i_itemp);
+	struct xfs_inode_log_item *iip = ip->i_itemp;
+
+	ASSERT(iip->ili_item.li_buf == NULL);
+
+	ip->i_itemp = NULL;
+	kmem_free(iip->ili_item.li_lv_shadow);
+	kmem_cache_free(xfs_ili_zone, iip);
 }
 
 
@@ -673,11 +686,10 @@ xfs_iflush_done(
 		list_move_tail(&lip->li_bio_list, &tmp);
 
 		/* Do an unlocked check for needing the AIL lock. */
-		if (lip->li_lsn == iip->ili_flush_lsn ||
+		if (iip->ili_flush_lsn == lip->li_lsn ||
 		    test_bit(XFS_LI_FAILED, &lip->li_flags))
 			need_ail++;
 	}
-	ASSERT(list_empty(&bp->b_li_list));
 
 	/*
 	 * We only want to pull the item from the AIL if it is actually there
@@ -690,7 +702,7 @@ xfs_iflush_done(
 		/* this is an opencoded batch version of xfs_trans_ail_delete */
 		spin_lock(&ailp->ail_lock);
 		list_for_each_entry(lip, &tmp, li_bio_list) {
-			xfs_clear_li_failed(lip);
+			clear_bit(XFS_LI_FAILED, &lip->li_flags);
 			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
 				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
 				if (!tail_lsn && lsn)
@@ -706,14 +718,29 @@ xfs_iflush_done(
 	 * them is safely on disk.
 	 */
 	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
+		bool	drop_buffer = false;
+
 		list_del_init(&lip->li_bio_list);
 		iip = INODE_ITEM(lip);
 
 		spin_lock(&iip->ili_lock);
+
+		/*
+		 * Remove the reference to the cluster buffer if the inode is
+		 * clean in memory. Drop the buffer reference once we've dropped
+		 * the locks we hold.
+		 */
+		ASSERT(iip->ili_item.li_buf == bp);
+		if (!iip->ili_fields) {
+			iip->ili_item.li_buf = NULL;
+			drop_buffer = true;
+		}
 		iip->ili_last_fields = 0;
+		iip->ili_flush_lsn = 0;
 		spin_unlock(&iip->ili_lock);
-
 		xfs_ifunlock(iip->ili_inode);
+		if (drop_buffer)
+			xfs_buf_rele(bp);
 	}
 }
 
@@ -725,12 +752,20 @@ xfs_iflush_done(
  */
 void
 xfs_iflush_abort(
-	struct xfs_inode		*ip)
+	struct xfs_inode	*ip)
 {
-	struct xfs_inode_log_item	*iip = ip->i_itemp;
+	struct xfs_inode_log_item *iip = ip->i_itemp;
+	struct xfs_buf		*bp = NULL;
 
 	if (iip) {
+		/*
+		 * Clear the failed bit before removing the item from the AIL so
+		 * xfs_trans_ail_delete() doesn't try to clear and release the
+		 * buffer attached to the log item before we are done with it.
+		 */
+		clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags);
 		xfs_trans_ail_delete(&iip->ili_item, 0);
+
 		/*
 		 * Clear the inode logging fields so no more flushes are
 		 * attempted.
@@ -739,12 +774,14 @@ xfs_iflush_abort(
 		iip->ili_last_fields = 0;
 		iip->ili_fields = 0;
 		iip->ili_fsync_fields = 0;
+		iip->ili_flush_lsn = 0;
+		bp = iip->ili_item.li_buf;
+		iip->ili_item.li_buf = NULL;
 		spin_unlock(&iip->ili_lock);
 	}
-	/*
-	 * Release the inode's flush lock since we're done with it.
-	 */
 	xfs_ifunlock(ip);
+	if (bp)
+		xfs_buf_rele(bp);
 }
 
 /*
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index ac33f6393f99..c3be6e440134 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -377,8 +377,12 @@ xfsaild_resubmit_item(
 	}
 
 	/* protected by ail_lock */
-	list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
-		xfs_clear_li_failed(lip);
+	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
+		if (bp->b_flags & _XBF_INODES)
+			clear_bit(XFS_LI_FAILED, &lip->li_flags);
+		else
+			xfs_clear_li_failed(lip);
+	}
 
 	xfs_buf_unlock(bp);
 	return XFS_ITEM_SUCCESS;
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 17/30] xfs: make inode reclaim almost non-blocking
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (15 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 18/30] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
                   ` (13 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that dirty inode writeback doesn't cause read-modify-write
cycles on the inode cluster buffer under memory pressure, the need
to throttle memory reclaim to the rate at which we can clean dirty
inodes goes away. That is due to the fact that we no longer thrash
inode cluster buffers under memory pressure to clean dirty inodes.

This means inode writeback no longer stalls on memory allocation
or read IO, and hence can be done asynchronously without generating
memory pressure. As a result, blocking inode writeback in reclaim is
no longer necessary to prevent reclaim priority windup as cleaning
dirty inodes is no longer dependent on having memory reserves
available for the filesystem to make progress reclaiming inodes.

Hence we can convert inode reclaim to be non-blocking for shrinker
callouts, both for direct reclaim and kswapd.

On a vanilla kernel, running a 16-way fsmark create workload on a
4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
userspace mlock(). The OOM killer gets invoked at 15GB of
pinned RAM.

Without the inode cluster pinning, this non-blocking reclaim patch
triggers premature OOM killer invocation with the same memory
pinning, sometimes with as much as 45% of RAM being free.  It's
trivially easy to trigger the OOM killer when reclaim does not
block.

With pinning inode clusters in RAM and then adding this patch, I can
reliably pin 14.5GB of RAM and still have the fsmark workload run to
completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
is only a small amount of memory less than the vanilla kernel. It is
much more reliable than just with async reclaim alone.

simoops shows that allocation stalls go away when async reclaim is
used. Vanilla kernel:

Run time: 1924 seconds
Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)

With inode cluster pinning and async reclaim:

Run time: 1924 seconds
Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)

Latencies don't really change much, nor does the work rate. However,
allocation almost never stalls with these changes, whilst the
vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
period. This difference is due to inode reclaim being largely
non-blocking now.

IOWs, once we have pinned inode cluster buffers, we can make inode
reclaim non-blocking without a major risk of premature and/or
spurious OOM killer invocation, and without any changes to memory
reclaim infrastructure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 59dea8178ae3..01f8efc5f59c 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1402,7 +1402,7 @@ xfs_reclaim_inodes_nr(
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
 }
 
 /*
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 18/30] xfs: remove IO submission from xfs_reclaim_inode()
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (16 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 17/30] xfs: make inode reclaim almost non-blocking Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 19/30] xfs: allow multiple reclaimers per AG Dave Chinner
                   ` (12 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We no longer need to issue IO from shrinker based inode reclaim to
prevent spurious OOM killer invocation. This leaves only the global
filesystem management operations such as unmount needing to
writeback dirty inodes and reclaim them.

Instead of using the reclaim pass to write dirty inodes before
reclaiming them, use the AIL to push all the dirty inodes before we
try to reclaim them. This allows us to remove all the conditional
SYNC_WAIT locking and the writeback code from xfs_reclaim_inode()
and greatly simplify the checks we need to do to reclaim an inode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 117 ++++++++++++--------------------------------
 1 file changed, 31 insertions(+), 86 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 01f8efc5f59c..fa4df9b4edb5 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1111,24 +1111,17 @@ xfs_reclaim_inode_grab(
  *	dirty, async	=> requeue
  *	dirty, sync	=> flush, wait and reclaim
  */
-STATIC int
+static bool
 xfs_reclaim_inode(
 	struct xfs_inode	*ip,
 	struct xfs_perag	*pag,
 	int			sync_mode)
 {
-	struct xfs_buf		*bp = NULL;
 	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
-	int			error;
 
-restart:
-	error = 0;
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	if (!xfs_iflock_nowait(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
-			goto out;
-		xfs_iflock(ip);
-	}
+	if (!xfs_iflock_nowait(ip))
+		goto out;
 
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
 		xfs_iunpin_wait(ip);
@@ -1136,52 +1129,12 @@ xfs_reclaim_inode(
 		xfs_iflush_abort(ip);
 		goto reclaim;
 	}
-	if (xfs_ipincount(ip)) {
-		if (!(sync_mode & SYNC_WAIT))
-			goto out_ifunlock;
-		xfs_iunpin_wait(ip);
-	}
-	if (xfs_inode_clean(ip)) {
-		xfs_ifunlock(ip);
-		goto reclaim;
-	}
-
-	/*
-	 * Never flush out dirty data during non-blocking reclaim, as it would
-	 * just contend with AIL pushing trying to do the same job.
-	 */
-	if (!(sync_mode & SYNC_WAIT))
+	if (xfs_ipincount(ip))
+		goto out_ifunlock;
+	if (!xfs_inode_clean(ip))
 		goto out_ifunlock;
 
-	/*
-	 * Now we have an inode that needs flushing.
-	 *
-	 * Note that xfs_iflush will never block on the inode buffer lock, as
-	 * xfs_ifree_cluster() can lock the inode buffer before it locks the
-	 * ip->i_lock, and we are doing the exact opposite here.  As a result,
-	 * doing a blocking xfs_imap_to_bp() to get the cluster buffer would
-	 * result in an ABBA deadlock with xfs_ifree_cluster().
-	 *
-	 * As xfs_ifree_cluser() must gather all inodes that are active in the
-	 * cache to mark them stale, if we hit this case we don't actually want
-	 * to do IO here - we want the inode marked stale so we can simply
-	 * reclaim it.  Hence if we get an EAGAIN error here,  just unlock the
-	 * inode, back off and try again.  Hopefully the next pass through will
-	 * see the stale flag set on the inode.
-	 */
-	error = xfs_iflush(ip, &bp);
-	if (error == -EAGAIN) {
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		/* backoff longer than in xfs_ifree_cluster */
-		delay(2);
-		goto restart;
-	}
-
-	if (!error) {
-		error = xfs_bwrite(bp);
-		xfs_buf_relse(bp);
-	}
-
+	xfs_ifunlock(ip);
 reclaim:
 	ASSERT(!xfs_isiflocked(ip));
 
@@ -1231,21 +1184,14 @@ xfs_reclaim_inode(
 	ASSERT(xfs_inode_clean(ip));
 
 	__xfs_inode_free(ip);
-	return error;
+	return true;
 
 out_ifunlock:
 	xfs_ifunlock(ip);
 out:
-	xfs_iflags_clear(ip, XFS_IRECLAIM);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	/*
-	 * We could return -EAGAIN here to make reclaim rescan the inode tree in
-	 * a short while. However, this just burns CPU time scanning the tree
-	 * waiting for IO to complete and the reclaim work never goes back to
-	 * the idle state. Instead, return 0 to let the next scheduled
-	 * background reclaim attempt to reclaim the inode again.
-	 */
-	return 0;
+	xfs_iflags_clear(ip, XFS_IRECLAIM);
+	return false;
 }
 
 /*
@@ -1253,21 +1199,22 @@ xfs_reclaim_inode(
  * corrupted, we still want to try to reclaim all the inodes. If we don't,
  * then a shut down during filesystem unmount reclaim walk leak all the
  * unreclaimed inodes.
+ *
+ * Returns non-zero if any AGs or inodes were skipped in the reclaim pass
+ * so that callers that want to block until all dirty inodes are written back
+ * and reclaimed can sanely loop.
  */
-STATIC int
+static int
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
 	int			flags,
 	int			*nr_to_scan)
 {
 	struct xfs_perag	*pag;
-	int			error = 0;
-	int			last_error = 0;
 	xfs_agnumber_t		ag;
 	int			trylock = flags & SYNC_TRYLOCK;
 	int			skipped;
 
-restart:
 	ag = 0;
 	skipped = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
@@ -1341,9 +1288,8 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				if (!batch[i])
 					continue;
-				error = xfs_reclaim_inode(batch[i], pag, flags);
-				if (error && last_error != -EFSCORRUPTED)
-					last_error = error;
+				if (!xfs_reclaim_inode(batch[i], pag, flags))
+					skipped++;
 			}
 
 			*nr_to_scan -= XFS_LOOKUP_BATCH;
@@ -1359,19 +1305,7 @@ xfs_reclaim_inodes_ag(
 		mutex_unlock(&pag->pag_ici_reclaim_lock);
 		xfs_perag_put(pag);
 	}
-
-	/*
-	 * if we skipped any AG, and we still have scan count remaining, do
-	 * another pass this time using blocking reclaim semantics (i.e
-	 * waiting on the reclaim locks and ignoring the reclaim cursors). This
-	 * ensure that when we get more reclaimers than AGs we block rather
-	 * than spin trying to execute reclaim.
-	 */
-	if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
-		trylock = 0;
-		goto restart;
-	}
-	return last_error;
+	return skipped;
 }
 
 int
@@ -1380,8 +1314,18 @@ xfs_reclaim_inodes(
 	int		mode)
 {
 	int		nr_to_scan = INT_MAX;
+	int		skipped;
 
-	return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	if (!(mode & SYNC_WAIT))
+		return 0;
+
+	do {
+		xfs_ail_push_all_sync(mp->m_ail);
+		skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	} while (skipped > 0);
+
+	return 0;
 }
 
 /*
@@ -1402,7 +1346,8 @@ xfs_reclaim_inodes_nr(
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
+	return 0;
 }
 
 /*
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 19/30] xfs: allow multiple reclaimers per AG
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (17 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 18/30] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 20/30] xfs: don't block inode reclaim on the ILOCK Dave Chinner
                   ` (11 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Inode reclaim will still throttle direct reclaim on the per-ag
reclaim locks. This is no longer necessary as reclaim can run
non-blocking now. Hence we can remove these locks so that we don't
arbitrarily block reclaimers just because there are more direct
reclaimers than there are AGs.

This can result in multiple reclaimers working on the same range of
an AG, but this doesn't cause any apparent issues. Optimising the
spread of concurrent reclaimers for best efficiency can be done in a
future patchset.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 31 ++++++++++++-------------------
 fs/xfs/xfs_mount.c  |  4 ----
 fs/xfs/xfs_mount.h  |  1 -
 3 files changed, 12 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index fa4df9b4edb5..592eab23c6e7 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1211,12 +1211,9 @@ xfs_reclaim_inodes_ag(
 	int			*nr_to_scan)
 {
 	struct xfs_perag	*pag;
-	xfs_agnumber_t		ag;
-	int			trylock = flags & SYNC_TRYLOCK;
-	int			skipped;
+	xfs_agnumber_t		ag = 0;
+	int			skipped = 0;
 
-	ag = 0;
-	skipped = 0;
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
 		unsigned long	first_index = 0;
 		int		done = 0;
@@ -1224,15 +1221,13 @@ xfs_reclaim_inodes_ag(
 
 		ag = pag->pag_agno + 1;
 
-		if (trylock) {
-			if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
-				skipped++;
-				xfs_perag_put(pag);
-				continue;
-			}
-			first_index = pag->pag_ici_reclaim_cursor;
-		} else
-			mutex_lock(&pag->pag_ici_reclaim_lock);
+		/*
+		 * If the cursor is not zero, we haven't scanned the whole AG
+		 * so we might have skipped inodes here.
+		 */
+		first_index = READ_ONCE(pag->pag_ici_reclaim_cursor);
+		if (first_index)
+			skipped++;
 
 		do {
 			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
@@ -1298,11 +1293,9 @@ xfs_reclaim_inodes_ag(
 
 		} while (nr_found && !done && *nr_to_scan > 0);
 
-		if (trylock && !done)
-			pag->pag_ici_reclaim_cursor = first_index;
-		else
-			pag->pag_ici_reclaim_cursor = 0;
-		mutex_unlock(&pag->pag_ici_reclaim_lock);
+		if (done)
+			first_index = 0;
+		WRITE_ONCE(pag->pag_ici_reclaim_cursor, first_index);
 		xfs_perag_put(pag);
 	}
 	return skipped;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index d5dcf9869860..03158b42a194 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -148,7 +148,6 @@ xfs_free_perag(
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
 		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
-		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
 	}
 }
@@ -200,7 +199,6 @@ xfs_initialize_perag(
 		pag->pag_agno = index;
 		pag->pag_mount = mp;
 		spin_lock_init(&pag->pag_ici_lock);
-		mutex_init(&pag->pag_ici_reclaim_lock);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 		if (xfs_buf_hash_init(pag))
 			goto out_free_pag;
@@ -242,7 +240,6 @@ xfs_initialize_perag(
 out_hash_destroy:
 	xfs_buf_hash_destroy(pag);
 out_free_pag:
-	mutex_destroy(&pag->pag_ici_reclaim_lock);
 	kmem_free(pag);
 out_unwind_new_pags:
 	/* unwind any prior newly initialized pags */
@@ -252,7 +249,6 @@ xfs_initialize_perag(
 			break;
 		xfs_buf_hash_destroy(pag);
 		xfs_iunlink_destroy(pag);
-		mutex_destroy(&pag->pag_ici_reclaim_lock);
 		kmem_free(pag);
 	}
 	return error;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 3725d25ad97e..a72cfcaa4ad1 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -354,7 +354,6 @@ typedef struct xfs_perag {
 	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
-	struct mutex	pag_ici_reclaim_lock;	/* serialisation point */
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 20/30] xfs: don't block inode reclaim on the ILOCK
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (18 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 19/30] xfs: allow multiple reclaimers per AG Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 21/30] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we attempt to reclaim an inode, the first thing we do is take
the inode lock. This is blocking right now, so if the inode being
accessed by something else (e.g. being flushed to the cluster
buffer) we will block here.

Change this to a trylock so that we do not block inode reclaim
unnecessarily here.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 592eab23c6e7..f387ec21dd35 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1119,9 +1119,10 @@ xfs_reclaim_inode(
 {
 	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
 
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	if (!xfs_iflock_nowait(ip))
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
 		goto out;
+	if (!xfs_iflock_nowait(ip))
+		goto out_iunlock;
 
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
 		xfs_iunpin_wait(ip);
@@ -1188,8 +1189,9 @@ xfs_reclaim_inode(
 
 out_ifunlock:
 	xfs_ifunlock(ip);
-out:
+out_iunlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out:
 	xfs_iflags_clear(ip, XFS_IRECLAIM);
 	return false;
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 21/30] xfs: remove SYNC_TRYLOCK from inode reclaim
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (19 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 20/30] xfs: don't block inode reclaim on the ILOCK Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-07-01  4:48   ` [PATCH 21/30 V2] " Dave Chinner
  2020-06-22  8:15 ` [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Dave Chinner
                   ` (9 subsequent siblings)
  30 siblings, 1 reply; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

All background reclaim is SYNC_TRYLOCK already, and even blocking
reclaim (SYNC_WAIT) can use trylock mechanisms as
xfs_reclaim_inodes_ag() will keep cycling until there are no more
reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode
reclaim and make everything unconditionally non-blocking.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index f387ec21dd35..57ba6d176e42 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -174,7 +174,7 @@ xfs_reclaim_worker(
 	struct xfs_mount *mp = container_of(to_delayed_work(work),
 					struct xfs_mount, m_reclaim_work);
 
-	xfs_reclaim_inodes(mp, SYNC_TRYLOCK);
+	xfs_reclaim_inodes(mp, 0);
 	xfs_reclaim_work_queue(mp);
 }
 
@@ -1030,10 +1030,9 @@ xfs_cowblocks_worker(
  * Grab the inode for reclaim exclusively.
  * Return 0 if we grabbed it, non-zero otherwise.
  */
-STATIC int
+static int
 xfs_reclaim_inode_grab(
-	struct xfs_inode	*ip,
-	int			flags)
+	struct xfs_inode	*ip)
 {
 	ASSERT(rcu_read_lock_held());
 
@@ -1042,12 +1041,10 @@ xfs_reclaim_inode_grab(
 		return 1;
 
 	/*
-	 * If we are asked for non-blocking operation, do unlocked checks to
-	 * see if the inode already is being flushed or in reclaim to avoid
-	 * lock traffic.
+	 * Do unlocked checks to see if the inode already is being flushed or in
+	 * reclaim to avoid lock traffic.
 	 */
-	if ((flags & SYNC_TRYLOCK) &&
-	    __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
+	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
 		return 1;
 
 	/*
@@ -1114,8 +1111,7 @@ xfs_reclaim_inode_grab(
 static bool
 xfs_reclaim_inode(
 	struct xfs_inode	*ip,
-	struct xfs_perag	*pag,
-	int			sync_mode)
+	struct xfs_perag	*pag)
 {
 	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
 
@@ -1209,7 +1205,6 @@ xfs_reclaim_inode(
 static int
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
-	int			flags,
 	int			*nr_to_scan)
 {
 	struct xfs_perag	*pag;
@@ -1254,7 +1249,7 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				struct xfs_inode *ip = batch[i];
 
-				if (done || xfs_reclaim_inode_grab(ip, flags))
+				if (done || xfs_reclaim_inode_grab(ip))
 					batch[i] = NULL;
 
 				/*
@@ -1285,7 +1280,7 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				if (!batch[i])
 					continue;
-				if (!xfs_reclaim_inode(batch[i], pag, flags))
+				if (!xfs_reclaim_inode(batch[i], pag))
 					skipped++;
 			}
 
@@ -1311,13 +1306,13 @@ xfs_reclaim_inodes(
 	int		nr_to_scan = INT_MAX;
 	int		skipped;
 
-	xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
 	if (!(mode & SYNC_WAIT))
 		return 0;
 
 	do {
 		xfs_ail_push_all_sync(mp->m_ail);
-		skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+		skipped = xfs_reclaim_inodes_ag(mp, &nr_to_scan);
 	} while (skipped > 0);
 
 	return 0;
@@ -1341,7 +1336,7 @@ xfs_reclaim_inodes_nr(
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
 	return 0;
 }
 
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes()
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (20 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 21/30] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-07-01  4:51   ` [PATCH 22/30 V2] " Dave Chinner
  2020-06-22  8:15 ` [PATCH 23/30] xfs: clean up inode reclaim comments Dave Chinner
                   ` (8 subsequent siblings)
  30 siblings, 1 reply; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Clean up xfs_reclaim_inodes() callers. Most callers want blocking
behaviour, so just make the existing SYNC_WAIT behaviour the
default.

For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
directly because we just want optimistic clean inode reclaim to be
done in the background.

For xfs_quiesce_attr() we can just remove the inode reclaim calls as
they are a historic relic that was required to flush dirty inodes
that contained unlogged changes. We now log all changes to the
inodes, so the sync AIL push from xfs_log_quiesce() called by
xfs_quiesce_attr() will do all the required inode writeback for
freeze.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 48 ++++++++++++++++++++-------------------------
 fs/xfs/xfs_icache.h |  2 +-
 fs/xfs/xfs_mount.c  | 11 +++++------
 fs/xfs/xfs_super.c  |  3 ---
 4 files changed, 27 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 57ba6d176e42..0189769f885b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -160,24 +160,6 @@ xfs_reclaim_work_queue(
 	rcu_read_unlock();
 }
 
-/*
- * This is a fast pass over the inode cache to try to get reclaim moving on as
- * many inodes as possible in a short period of time. It kicks itself every few
- * seconds, as well as being kicked by the inode cache shrinker when memory
- * goes low. It scans as quickly as possible avoiding locked inodes or those
- * already being flushed, and once done schedules a future pass.
- */
-void
-xfs_reclaim_worker(
-	struct work_struct *work)
-{
-	struct xfs_mount *mp = container_of(to_delayed_work(work),
-					struct xfs_mount, m_reclaim_work);
-
-	xfs_reclaim_inodes(mp, 0);
-	xfs_reclaim_work_queue(mp);
-}
-
 static void
 xfs_perag_set_reclaim_tag(
 	struct xfs_perag	*pag)
@@ -1298,24 +1280,17 @@ xfs_reclaim_inodes_ag(
 	return skipped;
 }
 
-int
+void
 xfs_reclaim_inodes(
-	xfs_mount_t	*mp,
-	int		mode)
+	struct xfs_mount	*mp)
 {
 	int		nr_to_scan = INT_MAX;
 	int		skipped;
 
-	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
-	if (!(mode & SYNC_WAIT))
-		return 0;
-
 	do {
 		xfs_ail_push_all_sync(mp->m_ail);
 		skipped = xfs_reclaim_inodes_ag(mp, &nr_to_scan);
 	} while (skipped > 0);
-
-	return 0;
 }
 
 /*
@@ -1434,6 +1409,25 @@ xfs_inode_matches_eofb(
 	return true;
 }
 
+/*
+ * This is a fast pass over the inode cache to try to get reclaim moving on as
+ * many inodes as possible in a short period of time. It kicks itself every few
+ * seconds, as well as being kicked by the inode cache shrinker when memory
+ * goes low. It scans as quickly as possible avoiding locked inodes or those
+ * already being flushed, and once done schedules a future pass.
+ */
+void
+xfs_reclaim_worker(
+	struct work_struct *work)
+{
+	struct xfs_mount *mp = container_of(to_delayed_work(work),
+					struct xfs_mount, m_reclaim_work);
+	int		nr_to_scan = INT_MAX;
+
+	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
+	xfs_reclaim_work_queue(mp);
+}
+
 STATIC int
 xfs_inode_free_eofblocks(
 	struct xfs_inode	*ip,
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 93b54e7d55f0..ae92ca53de42 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -51,7 +51,7 @@ void xfs_inode_free(struct xfs_inode *ip);
 
 void xfs_reclaim_worker(struct work_struct *work);
 
-int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
+void xfs_reclaim_inodes(struct xfs_mount *mp);
 int xfs_reclaim_inodes_count(struct xfs_mount *mp);
 long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 03158b42a194..c8ae49a1e99c 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1011,7 +1011,7 @@ xfs_mountfs(
 	 * quota inodes.
 	 */
 	cancel_delayed_work_sync(&mp->m_reclaim_work);
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
+	xfs_reclaim_inodes(mp);
 	xfs_health_unmount(mp);
  out_log_dealloc:
 	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
@@ -1088,13 +1088,12 @@ xfs_unmountfs(
 	xfs_ail_push_all_sync(mp->m_ail);
 
 	/*
-	 * And reclaim all inodes.  At this point there should be no dirty
-	 * inodes and none should be pinned or locked, but use synchronous
-	 * reclaim just to be sure. We can stop background inode reclaim
-	 * here as well if it is still running.
+	 * Reclaim all inodes. At this point there should be no dirty inodes and
+	 * none should be pinned or locked. Stop background inode reclaim here
+	 * if it is still running.
 	 */
 	cancel_delayed_work_sync(&mp->m_reclaim_work);
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
+	xfs_reclaim_inodes(mp);
 	xfs_health_unmount(mp);
 
 	xfs_qm_unmount(mp);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 379cbff438bc..5a5d9453cf51 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -890,9 +890,6 @@ xfs_quiesce_attr(
 	/* force the log to unpin objects from the now complete transactions */
 	xfs_log_force(mp, XFS_LOG_SYNC);
 
-	/* reclaim inodes to do any IO before the freeze completes */
-	xfs_reclaim_inodes(mp, 0);
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
 
 	/* Push the superblock and write an unmount record */
 	error = xfs_log_sbcount(mp);
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 23/30] xfs: clean up inode reclaim comments
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (21 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:15 ` [PATCH 24/30] xfs: rework stale inodes in xfs_ifree_cluster Dave Chinner
                   ` (7 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Inode reclaim is quite different now to the way described in various
comments, so update all the comments explaining what it does and how
it works.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 128 ++++++++++++--------------------------------
 1 file changed, 35 insertions(+), 93 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0189769f885b..94cef017a107 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -141,11 +141,8 @@ xfs_inode_free(
 }
 
 /*
- * Queue a new inode reclaim pass if there are reclaimable inodes and there
- * isn't a reclaim pass already in progress. By default it runs every 5s based
- * on the xfs periodic sync default of 30s. Perhaps this should have it's own
- * tunable, but that can be done if this method proves to be ineffective or too
- * aggressive.
+ * Queue background inode reclaim work if there are reclaimable inodes and there
+ * isn't reclaim work already scheduled or in progress.
  */
 static void
 xfs_reclaim_work_queue(
@@ -600,48 +597,31 @@ xfs_iget_cache_miss(
 }
 
 /*
- * Look up an inode by number in the given file system.
- * The inode is looked up in the cache held in each AG.
- * If the inode is found in the cache, initialise the vfs inode
- * if necessary.
+ * Look up an inode by number in the given file system.  The inode is looked up
+ * in the cache held in each AG.  If the inode is found in the cache, initialise
+ * the vfs inode if necessary.
  *
- * If it is not in core, read it in from the file system's device,
- * add it to the cache and initialise the vfs inode.
+ * If it is not in core, read it in from the file system's device, add it to the
+ * cache and initialise the vfs inode.
  *
  * The inode is locked according to the value of the lock_flags parameter.
- * This flag parameter indicates how and if the inode's IO lock and inode lock
- * should be taken.
- *
- * mp -- the mount point structure for the current file system.  It points
- *       to the inode hash table.
- * tp -- a pointer to the current transaction if there is one.  This is
- *       simply passed through to the xfs_iread() call.
- * ino -- the number of the inode desired.  This is the unique identifier
- *        within the file system for the inode being requested.
- * lock_flags -- flags indicating how to lock the inode.  See the comment
- *		 for xfs_ilock() for a list of valid values.
+ * Inode lookup is only done during metadata operations and not as part of the
+ * data IO path. Hence we only allow locking of the XFS_ILOCK during lookup.
  */
 int
 xfs_iget(
-	xfs_mount_t	*mp,
-	xfs_trans_t	*tp,
-	xfs_ino_t	ino,
-	uint		flags,
-	uint		lock_flags,
-	xfs_inode_t	**ipp)
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_ino_t		ino,
+	uint			flags,
+	uint			lock_flags,
+	struct xfs_inode	**ipp)
 {
-	xfs_inode_t	*ip;
-	int		error;
-	xfs_perag_t	*pag;
-	xfs_agino_t	agino;
+	struct xfs_inode	*ip;
+	struct xfs_perag	*pag;
+	xfs_agino_t		agino;
+	int			error;
 
-	/*
-	 * xfs_reclaim_inode() uses the ILOCK to ensure an inode
-	 * doesn't get freed while it's being referenced during a
-	 * radix tree traversal here.  It assumes this function
-	 * aqcuires only the ILOCK (and therefore it has no need to
-	 * involve the IOLOCK in this synchronization).
-	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED)) == 0);
 
 	/* reject inode numbers outside existing AGs */
@@ -758,15 +738,7 @@ xfs_inode_walk_ag_grab(
 
 	ASSERT(rcu_read_lock_held());
 
-	/*
-	 * check for stale RCU freed inode
-	 *
-	 * If the inode has been reallocated, it doesn't matter if it's not in
-	 * the AG we are walking - we are walking for writeback, so if it
-	 * passes all the "valid inode" checks and is dirty, then we'll write
-	 * it back anyway.  If it has been reallocated and still being
-	 * initialised, the XFS_INEW check below will catch it.
-	 */
+	/* Check for stale RCU freed inode */
 	spin_lock(&ip->i_flags_lock);
 	if (!ip->i_ino)
 		goto out_unlock_noent;
@@ -1052,43 +1024,16 @@ xfs_reclaim_inode_grab(
 }
 
 /*
- * Inodes in different states need to be treated differently. The following
- * table lists the inode states and the reclaim actions necessary:
- *
- *	inode state	     iflush ret		required action
- *      ---------------      ----------         ---------------
- *	bad			-		reclaim
- *	shutdown		EIO		unpin and reclaim
- *	clean, unpinned		0		reclaim
- *	stale, unpinned		0		reclaim
- *	clean, pinned(*)	0		requeue
- *	stale, pinned		EAGAIN		requeue
- *	dirty, async		-		requeue
- *	dirty, sync		0		reclaim
+ * Inode reclaim is non-blocking, so the default action if progress cannot be
+ * made is to "requeue" the inode for reclaim by unlocking it and clearing the
+ * XFS_IRECLAIM flag.  If we are in a shutdown state, we don't care about
+ * blocking anymore and hence we can wait for the inode to be able to reclaim
+ * it.
  *
- * (*) dgc: I don't think the clean, pinned state is possible but it gets
- * handled anyway given the order of checks implemented.
- *
- * Also, because we get the flush lock first, we know that any inode that has
- * been flushed delwri has had the flush completed by the time we check that
- * the inode is clean.
- *
- * Note that because the inode is flushed delayed write by AIL pushing, the
- * flush lock may already be held here and waiting on it can result in very
- * long latencies.  Hence for sync reclaims, where we wait on the flush lock,
- * the caller should push the AIL first before trying to reclaim inodes to
- * minimise the amount of time spent waiting.  For background relaim, we only
- * bother to reclaim clean inodes anyway.
- *
- * Hence the order of actions after gaining the locks should be:
- *	bad		=> reclaim
- *	shutdown	=> unpin and reclaim
- *	pinned, async	=> requeue
- *	pinned, sync	=> unpin
- *	stale		=> reclaim
- *	clean		=> reclaim
- *	dirty, async	=> requeue
- *	dirty, sync	=> flush, wait and reclaim
+ * We do no IO here - if callers require inodes to be cleaned they must push the
+ * AIL first to trigger writeback of dirty inodes.  This enables writeback to be
+ * done in the background in a non-blocking manner, and enables memory reclaim
+ * to make progress without blocking.
  */
 static bool
 xfs_reclaim_inode(
@@ -1294,13 +1239,11 @@ xfs_reclaim_inodes(
 }
 
 /*
- * Scan a certain number of inodes for reclaim.
- *
- * When called we make sure that there is a background (fast) inode reclaim in
- * progress, while we will throttle the speed of reclaim via doing synchronous
- * reclaim of inodes. That means if we come across dirty inodes, we wait for
- * them to be cleaned, which we hope will not be very long due to the
- * background walker having already kicked the IO off on those dirty inodes.
+ * The shrinker infrastructure determines how many inodes we should scan for
+ * reclaim. We want as many clean inodes ready to reclaim as possible, so we
+ * push the AIL here. We also want to proactively free up memory if we can to
+ * minimise the amount of work memory reclaim has to do so we kick the
+ * background reclaim if it isn't already scheduled.
  */
 long
 xfs_reclaim_inodes_nr(
@@ -1413,8 +1356,7 @@ xfs_inode_matches_eofb(
  * This is a fast pass over the inode cache to try to get reclaim moving on as
  * many inodes as possible in a short period of time. It kicks itself every few
  * seconds, as well as being kicked by the inode cache shrinker when memory
- * goes low. It scans as quickly as possible avoiding locked inodes or those
- * already being flushed, and once done schedules a future pass.
+ * goes low.
  */
 void
 xfs_reclaim_worker(
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 24/30] xfs: rework stale inodes in xfs_ifree_cluster
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (22 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 23/30] xfs: clean up inode reclaim comments Dave Chinner
@ 2020-06-22  8:15 ` Dave Chinner
  2020-06-22  8:16 ` [PATCH 25/30] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:15 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Once we have inodes pinning the cluster buffer and attached whenever
they are dirty, we no longer have a guarantee that the items are
flush locked when we lock the cluster buffer. Hence we cannot just
walk the buffer log item list and modify the attached inodes.

If the inode is not flush locked, we have to ILOCK it first and then
flush lock it to do all the prerequisite checks needed to avoid
races with other code. This is already handled by
xfs_ifree_get_one_inode(), so rework the inode iteration loop and
function to update all inodes in cache whether they are attached to
the buffer or not.

Note: we also remove the copying of the log item lsn to the
ili_flush_lsn as xfs_iflush_done() now uses the XFS_ISTALE flag to
trigger aborts and so flush lsn matching is not needed in IO
completion for processing freed inodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode.c | 161 +++++++++++++++++----------------------------
 1 file changed, 62 insertions(+), 99 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 8d638f14ab4e..e62e7133679a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2517,17 +2517,19 @@ xfs_iunlink_remove(
 }
 
 /*
- * Look up the inode number specified and mark it stale if it is found. If it is
- * dirty, return the inode so it can be attached to the cluster buffer so it can
- * be processed appropriately when the cluster free transaction completes.
+ * Look up the inode number specified and if it is not already marked XFS_ISTALE
+ * mark it stale. We should only find clean inodes in this lookup that aren't
+ * already stale.
  */
-static struct xfs_inode *
-xfs_ifree_get_one_inode(
-	struct xfs_perag	*pag,
+static void
+xfs_ifree_mark_inode_stale(
+	struct xfs_buf		*bp,
 	struct xfs_inode	*free_ip,
 	xfs_ino_t		inum)
 {
-	struct xfs_mount	*mp = pag->pag_mount;
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_perag	*pag = bp->b_pag;
+	struct xfs_inode_log_item *iip;
 	struct xfs_inode	*ip;
 
 retry:
@@ -2535,8 +2537,10 @@ xfs_ifree_get_one_inode(
 	ip = radix_tree_lookup(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, inum));
 
 	/* Inode not in memory, nothing to do */
-	if (!ip)
-		goto out_rcu_unlock;
+	if (!ip) {
+		rcu_read_unlock();
+		return;
+	}
 
 	/*
 	 * because this is an RCU protected lookup, we could find a recently
@@ -2547,9 +2551,9 @@ xfs_ifree_get_one_inode(
 	spin_lock(&ip->i_flags_lock);
 	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
 		spin_unlock(&ip->i_flags_lock);
-		goto out_rcu_unlock;
+		rcu_read_unlock();
+		return;
 	}
-	spin_unlock(&ip->i_flags_lock);
 
 	/*
 	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
@@ -2559,43 +2563,53 @@ xfs_ifree_get_one_inode(
 	 */
 	if (ip != free_ip) {
 		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
+			spin_unlock(&ip->i_flags_lock);
 			rcu_read_unlock();
 			delay(1);
 			goto retry;
 		}
-
-		/*
-		 * Check the inode number again in case we're racing with
-		 * freeing in xfs_reclaim_inode().  See the comments in that
-		 * function for more information as to why the initial check is
-		 * not sufficient.
-		 */
-		if (ip->i_ino != inum) {
-			xfs_iunlock(ip, XFS_ILOCK_EXCL);
-			goto out_rcu_unlock;
-		}
 	}
+	ip->i_flags |= XFS_ISTALE;
+	spin_unlock(&ip->i_flags_lock);
 	rcu_read_unlock();
 
-	xfs_iflock(ip);
-	xfs_iflags_set(ip, XFS_ISTALE);
+	/*
+	 * If we can't get the flush lock, the inode is already attached.  All
+	 * we needed to do here is mark the inode stale so buffer IO completion
+	 * will remove it from the AIL.
+	 */
+	iip = ip->i_itemp;
+	if (!xfs_iflock_nowait(ip)) {
+		ASSERT(!list_empty(&iip->ili_item.li_bio_list));
+		ASSERT(iip->ili_last_fields);
+		goto out_iunlock;
+	}
+	ASSERT(!iip || list_empty(&iip->ili_item.li_bio_list));
 
 	/*
-	 * We don't need to attach clean inodes or those only with unlogged
-	 * changes (which we throw away, anyway).
+	 * Clean inodes can be released immediately.  Everything else has to go
+	 * through xfs_iflush_abort() on journal commit as the flock
+	 * synchronises removal of the inode from the cluster buffer against
+	 * inode reclaim.
 	 */
-	if (!ip->i_itemp || xfs_inode_clean(ip)) {
-		ASSERT(ip != free_ip);
+	if (xfs_inode_clean(ip)) {
 		xfs_ifunlock(ip);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		goto out_no_inode;
+		goto out_iunlock;
 	}
-	return ip;
 
-out_rcu_unlock:
-	rcu_read_unlock();
-out_no_inode:
-	return NULL;
+	/* we have a dirty inode in memory that has not yet been flushed. */
+	ASSERT(iip->ili_fields);
+	spin_lock(&iip->ili_lock);
+	iip->ili_last_fields = iip->ili_fields;
+	iip->ili_fields = 0;
+	iip->ili_fsync_fields = 0;
+	spin_unlock(&iip->ili_lock);
+	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
+	ASSERT(iip->ili_last_fields);
+
+out_iunlock:
+	if (ip != free_ip)
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 }
 
 /*
@@ -2605,26 +2619,20 @@ xfs_ifree_get_one_inode(
  */
 STATIC int
 xfs_ifree_cluster(
-	xfs_inode_t		*free_ip,
-	xfs_trans_t		*tp,
+	struct xfs_inode	*free_ip,
+	struct xfs_trans	*tp,
 	struct xfs_icluster	*xic)
 {
-	xfs_mount_t		*mp = free_ip->i_mount;
+	struct xfs_mount	*mp = free_ip->i_mount;
+	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
+	struct xfs_buf		*bp;
+	xfs_daddr_t		blkno;
+	xfs_ino_t		inum = xic->first_ino;
 	int			nbufs;
 	int			i, j;
 	int			ioffset;
-	xfs_daddr_t		blkno;
-	xfs_buf_t		*bp;
-	xfs_inode_t		*ip;
-	struct xfs_inode_log_item *iip;
-	struct xfs_log_item	*lip;
-	struct xfs_perag	*pag;
-	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
-	xfs_ino_t		inum;
 	int			error;
 
-	inum = xic->first_ino;
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inum));
 	nbufs = igeo->ialloc_blks / igeo->blocks_per_cluster;
 
 	for (j = 0; j < nbufs; j++, inum += igeo->inodes_per_cluster) {
@@ -2653,10 +2661,8 @@ xfs_ifree_cluster(
 		error = xfs_trans_get_buf(tp, mp->m_ddev_targp, blkno,
 				mp->m_bsize * igeo->blocks_per_cluster,
 				XBF_UNMAPPED, &bp);
-		if (error) {
-			xfs_perag_put(pag);
+		if (error)
 			return error;
-		}
 
 		/*
 		 * This buffer may not have been correctly initialised as we
@@ -2670,59 +2676,16 @@ xfs_ifree_cluster(
 		bp->b_ops = &xfs_inode_buf_ops;
 
 		/*
-		 * Walk the inodes already attached to the buffer and mark them
-		 * stale. These will all have the flush locks held, so an
-		 * in-memory inode walk can't lock them. By marking them all
-		 * stale first, we will not attempt to lock them in the loop
-		 * below as the XFS_ISTALE flag will be set.
+		 * Now we need to set all the cached clean inodes as XFS_ISTALE,
+		 * too. This requires lookups, and will skip inodes that we've
+		 * already marked XFS_ISTALE.
 		 */
-		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
-			if (lip->li_type == XFS_LI_INODE) {
-				iip = (struct xfs_inode_log_item *)lip;
-				xfs_trans_ail_copy_lsn(mp->m_ail,
-							&iip->ili_flush_lsn,
-							&iip->ili_item.li_lsn);
-				xfs_iflags_set(iip->ili_inode, XFS_ISTALE);
-			}
-		}
-
-
-		/*
-		 * For each inode in memory attempt to add it to the inode
-		 * buffer and set it up for being staled on buffer IO
-		 * completion.  This is safe as we've locked out tail pushing
-		 * and flushing by locking the buffer.
-		 *
-		 * We have already marked every inode that was part of a
-		 * transaction stale above, which means there is no point in
-		 * even trying to lock them.
-		 */
-		for (i = 0; i < igeo->inodes_per_cluster; i++) {
-			ip = xfs_ifree_get_one_inode(pag, free_ip, inum + i);
-			if (!ip)
-				continue;
-
-			iip = ip->i_itemp;
-			spin_lock(&iip->ili_lock);
-			iip->ili_last_fields = iip->ili_fields;
-			iip->ili_fields = 0;
-			iip->ili_fsync_fields = 0;
-			spin_unlock(&iip->ili_lock);
-			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
-						&iip->ili_item.li_lsn);
-
-			list_add_tail(&iip->ili_item.li_bio_list,
-						&bp->b_li_list);
-
-			if (ip != free_ip)
-				xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		}
+		for (i = 0; i < igeo->inodes_per_cluster; i++)
+			xfs_ifree_mark_inode_stale(bp, free_ip, inum + i);
 
 		xfs_trans_stale_inode_buf(tp, bp);
 		xfs_trans_binval(tp, bp);
 	}
-
-	xfs_perag_put(pag);
 	return 0;
 }
 
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 25/30] xfs: attach inodes to the cluster buffer when dirtied
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (23 preceding siblings ...)
  2020-06-22  8:15 ` [PATCH 24/30] xfs: rework stale inodes in xfs_ifree_cluster Dave Chinner
@ 2020-06-22  8:16 ` Dave Chinner
  2020-06-22  8:16 ` [PATCH 26/30] xfs: xfs_iflush() is no longer necessary Dave Chinner
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:16 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Rather than attach inodes to the cluster buffer just when we are
doing IO, attach the inodes to the cluster buffer when they are
dirtied. The means the buffer always carries a list of dirty inodes
that reference it, and we can use that list to make more fundamental
changes to inode writeback that aren't otherwise possible.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_inode.c |  9 ++++++---
 fs/xfs/xfs_buf_item.c           |  1 +
 fs/xfs/xfs_icache.c             |  1 +
 fs/xfs/xfs_inode.c              | 24 +++++-------------------
 fs/xfs/xfs_inode_item.c         | 16 ++++++++++++++--
 5 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index ad5974365c58..e15129647e00 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -163,13 +163,16 @@ xfs_trans_log_inode(
 		/*
 		 * We need an explicit buffer reference for the log item but
 		 * don't want the buffer to remain attached to the transaction.
-		 * Hold the buffer but release the transaction reference.
+		 * Hold the buffer but release the transaction reference once
+		 * we've attached the inode log item to the buffer log item
+		 * list.
 		 */
 		xfs_buf_hold(bp);
-		xfs_trans_brelse(tp, bp);
-
 		spin_lock(&iip->ili_lock);
 		iip->ili_item.li_buf = bp;
+		bp->b_flags |= _XBF_INODES;
+		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
+		xfs_trans_brelse(tp, bp);
 	}
 
 	/*
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index ecb3362395af..e9428c30862a 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -465,6 +465,7 @@ xfs_buf_item_unpin(
 		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
 			xfs_buf_item_done(bp);
 			xfs_iflush_done(bp);
+			ASSERT(list_empty(&bp->b_li_list));
 		} else {
 			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
 			xfs_buf_item_relse(bp);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 94cef017a107..a973f180c6cd 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -115,6 +115,7 @@ __xfs_inode_free(
 {
 	/* asserts to verify all state is correct here */
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
+	ASSERT(!ip->i_itemp || list_empty(&ip->i_itemp->ili_item.li_bio_list));
 	XFS_STATS_DEC(ip->i_mount, vn_active);
 
 	call_rcu(&VFS_I(ip)->i_rcu, xfs_inode_free_callback);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index e62e7133679a..382e9a8bbfbf 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2584,27 +2584,24 @@ xfs_ifree_mark_inode_stale(
 		ASSERT(iip->ili_last_fields);
 		goto out_iunlock;
 	}
-	ASSERT(!iip || list_empty(&iip->ili_item.li_bio_list));
 
 	/*
-	 * Clean inodes can be released immediately.  Everything else has to go
-	 * through xfs_iflush_abort() on journal commit as the flock
-	 * synchronises removal of the inode from the cluster buffer against
-	 * inode reclaim.
+	 * Inodes not attached to the buffer can be released immediately.
+	 * Everything else has to go through xfs_iflush_abort() on journal
+	 * commit as the flock synchronises removal of the inode from the
+	 * cluster buffer against inode reclaim.
 	 */
-	if (xfs_inode_clean(ip)) {
+	if (!iip || list_empty(&iip->ili_item.li_bio_list)) {
 		xfs_ifunlock(ip);
 		goto out_iunlock;
 	}
 
 	/* we have a dirty inode in memory that has not yet been flushed. */
-	ASSERT(iip->ili_fields);
 	spin_lock(&iip->ili_lock);
 	iip->ili_last_fields = iip->ili_fields;
 	iip->ili_fields = 0;
 	iip->ili_fsync_fields = 0;
 	spin_unlock(&iip->ili_lock);
-	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
 	ASSERT(iip->ili_last_fields);
 
 out_iunlock:
@@ -3818,19 +3815,8 @@ xfs_iflush_int(
 	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
 				&iip->ili_item.li_lsn);
 
-	/*
-	 * Attach the inode item callback to the buffer whether the flush
-	 * succeeded or not. If not, the caller will shut down and fail I/O
-	 * completion on the buffer to remove the inode from the AIL and release
-	 * the flush lock.
-	 */
-	bp->b_flags |= _XBF_INODES;
-	list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
-
 	/* generate the checksum. */
 	xfs_dinode_calc_crc(mp, dip);
-
-	ASSERT(!list_empty(&bp->b_li_list));
 	return error;
 }
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 64bdda72f7b2..697248b7eb2b 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -660,6 +660,10 @@ xfs_inode_item_destroy(
  * list for other inodes that will run this function. We remove them from the
  * buffer list so we can process all the inode IO completions in one AIL lock
  * traversal.
+ *
+ * Note: Now that we attach the log item to the buffer when we first log the
+ * inode in memory, we can have unflushed inodes on the buffer list here. These
+ * inodes will have a zero ili_last_fields, so skip over them here.
  */
 void
 xfs_iflush_done(
@@ -677,12 +681,15 @@ xfs_iflush_done(
 	 */
 	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
 		iip = INODE_ITEM(lip);
+
 		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
-			list_del_init(&lip->li_bio_list);
 			xfs_iflush_abort(iip->ili_inode);
 			continue;
 		}
 
+		if (!iip->ili_last_fields)
+			continue;
+
 		list_move_tail(&lip->li_bio_list, &tmp);
 
 		/* Do an unlocked check for needing the AIL lock. */
@@ -728,12 +735,16 @@ xfs_iflush_done(
 		/*
 		 * Remove the reference to the cluster buffer if the inode is
 		 * clean in memory. Drop the buffer reference once we've dropped
-		 * the locks we hold.
+		 * the locks we hold. If the inode is dirty in memory, we need
+		 * to put the inode item back on the buffer list for another
+		 * pass through the flush machinery.
 		 */
 		ASSERT(iip->ili_item.li_buf == bp);
 		if (!iip->ili_fields) {
 			iip->ili_item.li_buf = NULL;
 			drop_buffer = true;
+		} else {
+			list_add(&lip->li_bio_list, &bp->b_li_list);
 		}
 		iip->ili_last_fields = 0;
 		iip->ili_flush_lsn = 0;
@@ -777,6 +788,7 @@ xfs_iflush_abort(
 		iip->ili_flush_lsn = 0;
 		bp = iip->ili_item.li_buf;
 		iip->ili_item.li_buf = NULL;
+		list_del_init(&iip->ili_item.li_bio_list);
 		spin_unlock(&iip->ili_lock);
 	}
 	xfs_ifunlock(ip);
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 26/30] xfs: xfs_iflush() is no longer necessary
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (24 preceding siblings ...)
  2020-06-22  8:16 ` [PATCH 25/30] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
@ 2020-06-22  8:16 ` Dave Chinner
  2020-06-22  8:16 ` [PATCH 27/30] xfs: rename xfs_iflush_int() Dave Chinner
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:16 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now we have a cached buffer on inode log items, we don't need
to do buffer lookups when flushing inodes anymore - all we need
to do is lock the buffer and we are ready to go.

This largely gets rid of the need for xfs_iflush(), which is
essentially just a mechanism to look up the buffer and flush the
inode to it. Instead, we can just call xfs_iflush_cluster() with a
few modifications to ensure it also flushes the inode we already
hold locked.

This allows the AIL inode item pushing to be almost entirely
non-blocking in XFS - we won't block unless memory allocation
for the cluster inode lookup blocks or the block device queues are
full.

Writeback during inode reclaim becomes a little more complex because
we now have to lock the buffer ourselves, but otherwise this change
is largely a functional no-op that removes a whole lot of code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode.c      | 107 ++++++----------------------------------
 fs/xfs/xfs_inode.h      |   2 +-
 fs/xfs/xfs_inode_item.c |  51 +++++++------------
 3 files changed, 33 insertions(+), 127 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 382e9a8bbfbf..74d09487f702 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3450,7 +3450,18 @@ xfs_rename(
 	return error;
 }
 
-STATIC int
+/*
+ * Non-blocking flush of dirty inode metadata into the backing buffer.
+ *
+ * The caller must have a reference to the inode and hold the cluster buffer
+ * locked. The function will walk across all the inodes on the cluster buffer it
+ * can find and lock without blocking, and flush them to the cluster buffer.
+ *
+ * On success, the caller must write out the buffer returned in *bp and
+ * release it. On failure, the filesystem will be shut down, the buffer will
+ * have been unlocked and released, and EFSCORRUPTED will be returned.
+ */
+int
 xfs_iflush_cluster(
 	struct xfs_inode	*ip,
 	struct xfs_buf		*bp)
@@ -3485,8 +3496,6 @@ xfs_iflush_cluster(
 
 	for (i = 0; i < nr_found; i++) {
 		cip = cilist[i];
-		if (cip == ip)
-			continue;
 
 		/*
 		 * because this is an RCU protected lookup, we could find a
@@ -3577,99 +3586,11 @@ xfs_iflush_cluster(
 	kmem_free(cilist);
 out_put:
 	xfs_perag_put(pag);
-	return error;
-}
-
-/*
- * Flush dirty inode metadata into the backing buffer.
- *
- * The caller must have the inode lock and the inode flush lock held.  The
- * inode lock will still be held upon return to the caller, and the inode
- * flush lock will be released after the inode has reached the disk.
- *
- * The caller must write out the buffer returned in *bpp and release it.
- */
-int
-xfs_iflush(
-	struct xfs_inode	*ip,
-	struct xfs_buf		**bpp)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_buf		*bp = NULL;
-	struct xfs_dinode	*dip;
-	int			error;
-
-	XFS_STATS_INC(mp, xs_iflush_count);
-
-	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
-	ASSERT(xfs_isiflocked(ip));
-	ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_BTREE ||
-	       ip->i_df.if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
-
-	*bpp = NULL;
-
-	xfs_iunpin_wait(ip);
-
-	/*
-	 * For stale inodes we cannot rely on the backing buffer remaining
-	 * stale in cache for the remaining life of the stale inode and so
-	 * xfs_imap_to_bp() below may give us a buffer that no longer contains
-	 * inodes below. We have to check this after ensuring the inode is
-	 * unpinned so that it is safe to reclaim the stale inode after the
-	 * flush call.
-	 */
-	if (xfs_iflags_test(ip, XFS_ISTALE)) {
-		xfs_ifunlock(ip);
-		return 0;
-	}
-
-	/*
-	 * Get the buffer containing the on-disk inode. We are doing a try-lock
-	 * operation here, so we may get an EAGAIN error. In that case, return
-	 * leaving the inode dirty.
-	 *
-	 * If we get any other error, we effectively have a corruption situation
-	 * and we cannot flush the inode. Abort the flush and shut down.
-	 */
-	error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &bp, XBF_TRYLOCK);
-	if (error == -EAGAIN) {
-		xfs_ifunlock(ip);
-		return error;
-	}
-	if (error)
-		goto abort;
-
-	/*
-	 * If the buffer is pinned then push on the log now so we won't
-	 * get stuck waiting in the write for too long.
-	 */
-	if (xfs_buf_ispinned(bp))
-		xfs_log_force(mp, 0);
-
-	/*
-	 * Flush the provided inode then attempt to gather others from the
-	 * cluster into the write.
-	 *
-	 * Note: Once we attempt to flush an inode, we must run buffer
-	 * completion callbacks on any failure. If this fails, simulate an I/O
-	 * failure on the buffer and shut down.
-	 */
-	error = xfs_iflush_int(ip, bp);
-	if (!error)
-		error = xfs_iflush_cluster(ip, bp);
 	if (error) {
 		bp->b_flags |= XBF_ASYNC;
 		xfs_buf_ioend_fail(bp);
-		goto shutdown;
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 	}
-
-	*bpp = bp;
-	return 0;
-
-abort:
-	xfs_iflush_abort(ip);
-shutdown:
-	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 	return error;
 }
 
@@ -3687,7 +3608,7 @@ xfs_iflush_int(
 	ASSERT(xfs_isiflocked(ip));
 	ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_BTREE ||
 	       ip->i_df.if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
-	ASSERT(iip != NULL && iip->ili_fields != 0);
+	ASSERT(iip->ili_item.li_buf == bp);
 
 	dip = xfs_buf_offset(bp, ip->i_imap.im_boffset);
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 47d3b391030d..9ae1b6a830e1 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -426,7 +426,7 @@ int		xfs_log_force_inode(struct xfs_inode *ip);
 void		xfs_iunpin_wait(xfs_inode_t *);
 #define xfs_ipincount(ip)	((unsigned int) atomic_read(&ip->i_pincount))
 
-int		xfs_iflush(struct xfs_inode *, struct xfs_buf **);
+int		xfs_iflush_cluster(struct xfs_inode *, struct xfs_buf *);
 void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
 				struct xfs_inode *ip1, uint ip1_mode);
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 697248b7eb2b..e8eda2ac25fb 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -485,53 +485,38 @@ xfs_inode_item_push(
 	uint			rval = XFS_ITEM_SUCCESS;
 	int			error;
 
-	if (xfs_ipincount(ip) > 0)
+	ASSERT(iip->ili_item.li_buf);
+
+	if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp) ||
+	    (ip->i_flags & XFS_ISTALE))
 		return XFS_ITEM_PINNED;
 
-	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
-		return XFS_ITEM_LOCKED;
+	/* If the inode is already flush locked, we're already flushing. */
+	if (xfs_isiflocked(ip))
+		return XFS_ITEM_FLUSHING;
 
-	/*
-	 * Re-check the pincount now that we stabilized the value by
-	 * taking the ilock.
-	 */
-	if (xfs_ipincount(ip) > 0) {
-		rval = XFS_ITEM_PINNED;
-		goto out_unlock;
-	}
+	if (!xfs_buf_trylock(bp))
+		return XFS_ITEM_LOCKED;
 
-	/*
-	 * Stale inode items should force out the iclog.
-	 */
-	if (ip->i_flags & XFS_ISTALE) {
-		rval = XFS_ITEM_PINNED;
-		goto out_unlock;
-	}
+	spin_unlock(&lip->li_ailp->ail_lock);
 
 	/*
-	 * Someone else is already flushing the inode.  Nothing we can do
-	 * here but wait for the flush to finish and remove the item from
-	 * the AIL.
+	 * We need to hold a reference for flushing the cluster buffer as it may
+	 * fail the buffer without IO submission. In which case, we better get a
+	 * reference for that completion because otherwise we don't get a
+	 * reference for IO until we queue the buffer for delwri submission.
 	 */
-	if (!xfs_iflock_nowait(ip)) {
-		rval = XFS_ITEM_FLUSHING;
-		goto out_unlock;
-	}
-
-	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
-	spin_unlock(&lip->li_ailp->ail_lock);
-
-	error = xfs_iflush(ip, &bp);
+	xfs_buf_hold(bp);
+	error = xfs_iflush_cluster(ip, bp);
 	if (!error) {
 		if (!xfs_buf_delwri_queue(bp, buffer_list))
 			rval = XFS_ITEM_FLUSHING;
 		xfs_buf_relse(bp);
-	} else if (error == -EAGAIN)
+	} else {
 		rval = XFS_ITEM_LOCKED;
+	}
 
 	spin_lock(&lip->li_ailp->ail_lock);
-out_unlock:
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 	return rval;
 }
 
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 27/30] xfs: rename xfs_iflush_int()
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (25 preceding siblings ...)
  2020-06-22  8:16 ` [PATCH 26/30] xfs: xfs_iflush() is no longer necessary Dave Chinner
@ 2020-06-22  8:16 ` Dave Chinner
  2020-06-22  8:16 ` [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:16 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

with xfs_iflush() gone, we can rename xfs_iflush_int() back to
xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't
need the forward definition any more.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_inode.c | 293 ++++++++++++++++++++++-----------------------
 1 file changed, 146 insertions(+), 147 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 74d09487f702..b03cc784f73e 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -44,7 +44,6 @@ kmem_zone_t *xfs_inode_zone;
  */
 #define	XFS_ITRUNC_MAX_EXTENTS	2
 
-STATIC int xfs_iflush_int(struct xfs_inode *, struct xfs_buf *);
 STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *);
 STATIC int xfs_iunlink_remove(struct xfs_trans *, struct xfs_inode *);
 
@@ -3450,152 +3449,8 @@ xfs_rename(
 	return error;
 }
 
-/*
- * Non-blocking flush of dirty inode metadata into the backing buffer.
- *
- * The caller must have a reference to the inode and hold the cluster buffer
- * locked. The function will walk across all the inodes on the cluster buffer it
- * can find and lock without blocking, and flush them to the cluster buffer.
- *
- * On success, the caller must write out the buffer returned in *bp and
- * release it. On failure, the filesystem will be shut down, the buffer will
- * have been unlocked and released, and EFSCORRUPTED will be returned.
- */
-int
-xfs_iflush_cluster(
-	struct xfs_inode	*ip,
-	struct xfs_buf		*bp)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_perag	*pag;
-	unsigned long		first_index, mask;
-	int			cilist_size;
-	struct xfs_inode	**cilist;
-	struct xfs_inode	*cip;
-	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
-	int			error = 0;
-	int			nr_found;
-	int			clcount = 0;
-	int			i;
-
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-
-	cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *);
-	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
-	if (!cilist)
-		goto out_put;
-
-	mask = ~(igeo->inodes_per_cluster - 1);
-	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
-	rcu_read_lock();
-	/* really need a gang lookup range call here */
-	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
-					first_index, igeo->inodes_per_cluster);
-	if (nr_found == 0)
-		goto out_free;
-
-	for (i = 0; i < nr_found; i++) {
-		cip = cilist[i];
-
-		/*
-		 * because this is an RCU protected lookup, we could find a
-		 * recently freed or even reallocated inode during the lookup.
-		 * We need to check under the i_flags_lock for a valid inode
-		 * here. Skip it if it is not valid or the wrong inode.
-		 */
-		spin_lock(&cip->i_flags_lock);
-		if (!cip->i_ino ||
-		    __xfs_iflags_test(cip, XFS_ISTALE)) {
-			spin_unlock(&cip->i_flags_lock);
-			continue;
-		}
-
-		/*
-		 * Once we fall off the end of the cluster, no point checking
-		 * any more inodes in the list because they will also all be
-		 * outside the cluster.
-		 */
-		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
-			spin_unlock(&cip->i_flags_lock);
-			break;
-		}
-		spin_unlock(&cip->i_flags_lock);
-
-		/*
-		 * Do an un-protected check to see if the inode is dirty and
-		 * is a candidate for flushing.  These checks will be repeated
-		 * later after the appropriate locks are acquired.
-		 */
-		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
-			continue;
-
-		/*
-		 * Try to get locks.  If any are unavailable or it is pinned,
-		 * then this inode cannot be flushed and is skipped.
-		 */
-
-		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
-			continue;
-		if (!xfs_iflock_nowait(cip)) {
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
-			continue;
-		}
-		if (xfs_ipincount(cip)) {
-			xfs_ifunlock(cip);
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
-			continue;
-		}
-
-
-		/*
-		 * Check the inode number again, just to be certain we are not
-		 * racing with freeing in xfs_reclaim_inode(). See the comments
-		 * in that function for more information as to why the initial
-		 * check is not sufficient.
-		 */
-		if (!cip->i_ino) {
-			xfs_ifunlock(cip);
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
-			continue;
-		}
-
-		/*
-		 * arriving here means that this inode can be flushed.  First
-		 * re-check that it's dirty before flushing.
-		 */
-		if (!xfs_inode_clean(cip)) {
-			error = xfs_iflush_int(cip, bp);
-			if (error) {
-				xfs_iunlock(cip, XFS_ILOCK_SHARED);
-				goto out_free;
-			}
-			clcount++;
-		} else {
-			xfs_ifunlock(cip);
-		}
-		xfs_iunlock(cip, XFS_ILOCK_SHARED);
-	}
-
-	if (clcount) {
-		XFS_STATS_INC(mp, xs_icluster_flushcnt);
-		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
-	}
-
-out_free:
-	rcu_read_unlock();
-	kmem_free(cilist);
-out_put:
-	xfs_perag_put(pag);
-	if (error) {
-		bp->b_flags |= XBF_ASYNC;
-		xfs_buf_ioend_fail(bp);
-		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-	}
-	return error;
-}
-
-STATIC int
-xfs_iflush_int(
+static int
+xfs_iflush(
 	struct xfs_inode	*ip,
 	struct xfs_buf		*bp)
 {
@@ -3741,6 +3596,150 @@ xfs_iflush_int(
 	return error;
 }
 
+/*
+ * Non-blocking flush of dirty inode metadata into the backing buffer.
+ *
+ * The caller must have a reference to the inode and hold the cluster buffer
+ * locked. The function will walk across all the inodes on the cluster buffer it
+ * can find and lock without blocking, and flush them to the cluster buffer.
+ *
+ * On success, the caller must write out the buffer returned in *bp and
+ * release it. On failure, the filesystem will be shut down, the buffer will
+ * have been unlocked and released, and EFSCORRUPTED will be returned.
+ */
+int
+xfs_iflush_cluster(
+	struct xfs_inode	*ip,
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_perag	*pag;
+	unsigned long		first_index, mask;
+	int			cilist_size;
+	struct xfs_inode	**cilist;
+	struct xfs_inode	*cip;
+	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
+	int			error = 0;
+	int			nr_found;
+	int			clcount = 0;
+	int			i;
+
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
+
+	cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *);
+	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
+	if (!cilist)
+		goto out_put;
+
+	mask = ~(igeo->inodes_per_cluster - 1);
+	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
+	rcu_read_lock();
+	/* really need a gang lookup range call here */
+	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
+					first_index, igeo->inodes_per_cluster);
+	if (nr_found == 0)
+		goto out_free;
+
+	for (i = 0; i < nr_found; i++) {
+		cip = cilist[i];
+
+		/*
+		 * because this is an RCU protected lookup, we could find a
+		 * recently freed or even reallocated inode during the lookup.
+		 * We need to check under the i_flags_lock for a valid inode
+		 * here. Skip it if it is not valid or the wrong inode.
+		 */
+		spin_lock(&cip->i_flags_lock);
+		if (!cip->i_ino ||
+		    __xfs_iflags_test(cip, XFS_ISTALE)) {
+			spin_unlock(&cip->i_flags_lock);
+			continue;
+		}
+
+		/*
+		 * Once we fall off the end of the cluster, no point checking
+		 * any more inodes in the list because they will also all be
+		 * outside the cluster.
+		 */
+		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
+			spin_unlock(&cip->i_flags_lock);
+			break;
+		}
+		spin_unlock(&cip->i_flags_lock);
+
+		/*
+		 * Do an un-protected check to see if the inode is dirty and
+		 * is a candidate for flushing.  These checks will be repeated
+		 * later after the appropriate locks are acquired.
+		 */
+		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
+			continue;
+
+		/*
+		 * Try to get locks.  If any are unavailable or it is pinned,
+		 * then this inode cannot be flushed and is skipped.
+		 */
+
+		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
+			continue;
+		if (!xfs_iflock_nowait(cip)) {
+			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+			continue;
+		}
+		if (xfs_ipincount(cip)) {
+			xfs_ifunlock(cip);
+			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+			continue;
+		}
+
+
+		/*
+		 * Check the inode number again, just to be certain we are not
+		 * racing with freeing in xfs_reclaim_inode(). See the comments
+		 * in that function for more information as to why the initial
+		 * check is not sufficient.
+		 */
+		if (!cip->i_ino) {
+			xfs_ifunlock(cip);
+			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+			continue;
+		}
+
+		/*
+		 * arriving here means that this inode can be flushed.  First
+		 * re-check that it's dirty before flushing.
+		 */
+		if (!xfs_inode_clean(cip)) {
+			error = xfs_iflush(cip, bp);
+			if (error) {
+				xfs_iunlock(cip, XFS_ILOCK_SHARED);
+				goto out_free;
+			}
+			clcount++;
+		} else {
+			xfs_ifunlock(cip);
+		}
+		xfs_iunlock(cip, XFS_ILOCK_SHARED);
+	}
+
+	if (clcount) {
+		XFS_STATS_INC(mp, xs_icluster_flushcnt);
+		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
+	}
+
+out_free:
+	rcu_read_unlock();
+	kmem_free(cilist);
+out_put:
+	xfs_perag_put(pag);
+	if (error) {
+		bp->b_flags |= XBF_ASYNC;
+		xfs_buf_ioend_fail(bp);
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+	}
+	return error;
+}
+
 /* Release an inode. */
 void
 xfs_irele(
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (26 preceding siblings ...)
  2020-06-22  8:16 ` [PATCH 27/30] xfs: rename xfs_iflush_int() Dave Chinner
@ 2020-06-22  8:16 ` Dave Chinner
  2020-06-22  8:16 ` [PATCH 29/30] xfs: factor xfs_iflush_done Dave Chinner
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:16 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we have all the dirty inodes attached to the cluster
buffer, we don't actually have to do radix tree lookups to find
them. Sure, the radix tree is efficient, but walking a linked list
of just the dirty inodes attached to the buffer is much better.

We are also no longer dependent on having a locked inode passed into
the function to determine where to start the lookup. This means we
can drop it from the function call and treat all inodes the same.

We also make xfs_iflush_cluster skip inodes marked with
XFS_IRECLAIM. This we avoid races with inodes that reclaim is
actively referencing or are being re-initialised by inode lookup. If
they are actually dirty, they'll get written by a future cluster
flush....

We also add a shutdown check after obtaining the flush lock so that
we catch inodes that are dirty in memory and may have inconsistent
state due to the shutdown in progress. We abort these inodes
directly and so they remove themselves directly from the buffer list
and the AIL rather than having to wait for the buffer to be failed
and callbacks run to be processed correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode.c      | 171 ++++++++++++++++++----------------------
 fs/xfs/xfs_inode.h      |   2 +-
 fs/xfs/xfs_inode_item.c |   8 +-
 3 files changed, 83 insertions(+), 98 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b03cc784f73e..05017475c766 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3603,141 +3603,120 @@ xfs_iflush(
  * locked. The function will walk across all the inodes on the cluster buffer it
  * can find and lock without blocking, and flush them to the cluster buffer.
  *
- * On success, the caller must write out the buffer returned in *bp and
- * release it. On failure, the filesystem will be shut down, the buffer will
- * have been unlocked and released, and EFSCORRUPTED will be returned.
+ * On successful flushing of at least one inode, the caller must write out the
+ * buffer and release it. If no inodes are flushed, -EAGAIN will be returned and
+ * the caller needs to release the buffer. On failure, the filesystem will be
+ * shut down, the buffer will have been unlocked and released, and EFSCORRUPTED
+ * will be returned.
  */
 int
 xfs_iflush_cluster(
-	struct xfs_inode	*ip,
 	struct xfs_buf		*bp)
 {
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_perag	*pag;
-	unsigned long		first_index, mask;
-	int			cilist_size;
-	struct xfs_inode	**cilist;
-	struct xfs_inode	*cip;
-	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
-	int			error = 0;
-	int			nr_found;
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_log_item	*lip, *n;
+	struct xfs_inode	*ip;
+	struct xfs_inode_log_item *iip;
 	int			clcount = 0;
-	int			i;
-
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-
-	cilist_size = igeo->inodes_per_cluster * sizeof(struct xfs_inode *);
-	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
-	if (!cilist)
-		goto out_put;
-
-	mask = ~(igeo->inodes_per_cluster - 1);
-	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
-	rcu_read_lock();
-	/* really need a gang lookup range call here */
-	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
-					first_index, igeo->inodes_per_cluster);
-	if (nr_found == 0)
-		goto out_free;
+	int			error = 0;
 
-	for (i = 0; i < nr_found; i++) {
-		cip = cilist[i];
+	/*
+	 * We must use the safe variant here as on shutdown xfs_iflush_abort()
+	 * can remove itself from the list.
+	 */
+	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
+		iip = (struct xfs_inode_log_item *)lip;
+		ip = iip->ili_inode;
 
 		/*
-		 * because this is an RCU protected lookup, we could find a
-		 * recently freed or even reallocated inode during the lookup.
-		 * We need to check under the i_flags_lock for a valid inode
-		 * here. Skip it if it is not valid or the wrong inode.
+		 * Quick and dirty check to avoid locks if possible.
 		 */
-		spin_lock(&cip->i_flags_lock);
-		if (!cip->i_ino ||
-		    __xfs_iflags_test(cip, XFS_ISTALE)) {
-			spin_unlock(&cip->i_flags_lock);
+		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK))
+			continue;
+		if (xfs_ipincount(ip))
 			continue;
-		}
 
 		/*
-		 * Once we fall off the end of the cluster, no point checking
-		 * any more inodes in the list because they will also all be
-		 * outside the cluster.
+		 * The inode is still attached to the buffer, which means it is
+		 * dirty but reclaim might try to grab it. Check carefully for
+		 * that, and grab the ilock while still holding the i_flags_lock
+		 * to guarantee reclaim will not be able to reclaim this inode
+		 * once we drop the i_flags_lock.
 		 */
-		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
-			spin_unlock(&cip->i_flags_lock);
-			break;
+		spin_lock(&ip->i_flags_lock);
+		ASSERT(!__xfs_iflags_test(ip, XFS_ISTALE));
+		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK)) {
+			spin_unlock(&ip->i_flags_lock);
+			continue;
 		}
-		spin_unlock(&cip->i_flags_lock);
 
 		/*
-		 * Do an un-protected check to see if the inode is dirty and
-		 * is a candidate for flushing.  These checks will be repeated
-		 * later after the appropriate locks are acquired.
+		 * ILOCK will pin the inode against reclaim and prevent
+		 * concurrent transactions modifying the inode while we are
+		 * flushing the inode.
 		 */
-		if (xfs_inode_clean(cip) && xfs_ipincount(cip) == 0)
+		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) {
+			spin_unlock(&ip->i_flags_lock);
 			continue;
+		}
+		spin_unlock(&ip->i_flags_lock);
 
 		/*
-		 * Try to get locks.  If any are unavailable or it is pinned,
-		 * then this inode cannot be flushed and is skipped.
+		 * Skip inodes that are already flush locked as they have
+		 * already been written to the buffer.
 		 */
-
-		if (!xfs_ilock_nowait(cip, XFS_ILOCK_SHARED))
-			continue;
-		if (!xfs_iflock_nowait(cip)) {
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+		if (!xfs_iflock_nowait(ip)) {
+			xfs_iunlock(ip, XFS_ILOCK_SHARED);
 			continue;
 		}
-		if (xfs_ipincount(cip)) {
-			xfs_ifunlock(cip);
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
-			continue;
-		}
-
 
 		/*
-		 * Check the inode number again, just to be certain we are not
-		 * racing with freeing in xfs_reclaim_inode(). See the comments
-		 * in that function for more information as to why the initial
-		 * check is not sufficient.
+		 * Abort flushing this inode if we are shut down because the
+		 * inode may not currently be in the AIL. This can occur when
+		 * log I/O failure unpins the inode without inserting into the
+		 * AIL, leaving a dirty/unpinned inode attached to the buffer
+		 * that otherwise looks like it should be flushed.
 		 */
-		if (!cip->i_ino) {
-			xfs_ifunlock(cip);
-			xfs_iunlock(cip, XFS_ILOCK_SHARED);
+		if (XFS_FORCED_SHUTDOWN(mp)) {
+			xfs_iunpin_wait(ip);
+			/* xfs_iflush_abort() drops the flush lock */
+			xfs_iflush_abort(ip);
+			xfs_iunlock(ip, XFS_ILOCK_SHARED);
+			error = -EIO;
 			continue;
 		}
 
-		/*
-		 * arriving here means that this inode can be flushed.  First
-		 * re-check that it's dirty before flushing.
-		 */
-		if (!xfs_inode_clean(cip)) {
-			error = xfs_iflush(cip, bp);
-			if (error) {
-				xfs_iunlock(cip, XFS_ILOCK_SHARED);
-				goto out_free;
-			}
-			clcount++;
-		} else {
-			xfs_ifunlock(cip);
+		/* don't block waiting on a log force to unpin dirty inodes */
+		if (xfs_ipincount(ip)) {
+			xfs_ifunlock(ip);
+			xfs_iunlock(ip, XFS_ILOCK_SHARED);
+			continue;
 		}
-		xfs_iunlock(cip, XFS_ILOCK_SHARED);
-	}
 
-	if (clcount) {
-		XFS_STATS_INC(mp, xs_icluster_flushcnt);
-		XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
+		if (!xfs_inode_clean(ip))
+			error = xfs_iflush(ip, bp);
+		else
+			xfs_ifunlock(ip);
+		xfs_iunlock(ip, XFS_ILOCK_SHARED);
+		if (error)
+			break;
+		clcount++;
 	}
 
-out_free:
-	rcu_read_unlock();
-	kmem_free(cilist);
-out_put:
-	xfs_perag_put(pag);
 	if (error) {
 		bp->b_flags |= XBF_ASYNC;
 		xfs_buf_ioend_fail(bp);
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+		return error;
 	}
-	return error;
+
+	if (!clcount)
+		return -EAGAIN;
+
+	XFS_STATS_INC(mp, xs_icluster_flushcnt);
+	XFS_STATS_ADD(mp, xs_icluster_flushinode, clcount);
+	return 0;
+
 }
 
 /* Release an inode. */
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 9ae1b6a830e1..7a8adb76c17f 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -426,7 +426,7 @@ int		xfs_log_force_inode(struct xfs_inode *ip);
 void		xfs_iunpin_wait(xfs_inode_t *);
 #define xfs_ipincount(ip)	((unsigned int) atomic_read(&ip->i_pincount))
 
-int		xfs_iflush_cluster(struct xfs_inode *, struct xfs_buf *);
+int		xfs_iflush_cluster(struct xfs_buf *);
 void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
 				struct xfs_inode *ip1, uint ip1_mode);
 
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index e8eda2ac25fb..4e7fce8d4f7c 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -507,12 +507,18 @@ xfs_inode_item_push(
 	 * reference for IO until we queue the buffer for delwri submission.
 	 */
 	xfs_buf_hold(bp);
-	error = xfs_iflush_cluster(ip, bp);
+	error = xfs_iflush_cluster(bp);
 	if (!error) {
 		if (!xfs_buf_delwri_queue(bp, buffer_list))
 			rval = XFS_ITEM_FLUSHING;
 		xfs_buf_relse(bp);
 	} else {
+		/*
+		 * Release the buffer if we were unable to flush anything. On
+		 * any other error, the buffer has already been released.
+		 */
+		if (error == -EAGAIN)
+			xfs_buf_relse(bp);
 		rval = XFS_ITEM_LOCKED;
 	}
 
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 29/30] xfs: factor xfs_iflush_done
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (27 preceding siblings ...)
  2020-06-22  8:16 ` [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
@ 2020-06-22  8:16 ` Dave Chinner
  2020-06-22 22:16   ` [PATCH 29/30 V2] " Dave Chinner
  2020-06-22  8:16 ` [PATCH 30/30] xfs: remove xfs_inobp_check() Dave Chinner
  2020-06-29 23:01 ` [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Darrick J. Wong
  30 siblings, 1 reply; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:16 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_iflush_done() does 3 distinct operations to the inodes attached
to the buffer. Separate these operations out into functions so that
it is easier to modify these operations independently in future.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_inode_item.c | 155 +++++++++++++++++++++-------------------
 1 file changed, 81 insertions(+), 74 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 4e7fce8d4f7c..90a386dc0965 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -641,101 +641,63 @@ xfs_inode_item_destroy(
 
 
 /*
- * This is the inode flushing I/O completion routine.  It is called
- * from interrupt level when the buffer containing the inode is
- * flushed to disk.  It is responsible for removing the inode item
- * from the AIL if it has not been re-logged, and unlocking the inode's
- * flush lock.
- *
- * To reduce AIL lock traffic as much as possible, we scan the buffer log item
- * list for other inodes that will run this function. We remove them from the
- * buffer list so we can process all the inode IO completions in one AIL lock
- * traversal.
- *
- * Note: Now that we attach the log item to the buffer when we first log the
- * inode in memory, we can have unflushed inodes on the buffer list here. These
- * inodes will have a zero ili_last_fields, so skip over them here.
+ * We only want to pull the item from the AIL if it is actually there
+ * and its location in the log has not changed since we started the
+ * flush.  Thus, we only bother if the inode's lsn has not changed.
  */
 void
-xfs_iflush_done(
-	struct xfs_buf		*bp)
+xfs_iflush_ail_updates(
+	struct xfs_ail		*ailp,
+	struct list_head	*list)
 {
-	struct xfs_inode_log_item *iip;
-	struct xfs_log_item	*lip, *n;
-	struct xfs_ail		*ailp = bp->b_mount->m_ail;
-	int			need_ail = 0;
-	LIST_HEAD(tmp);
+	struct xfs_log_item	*lip;
+	xfs_lsn_t		tail_lsn = 0;
 
-	/*
-	 * Pull the attached inodes from the buffer one at a time and take the
-	 * appropriate action on them.
-	 */
-	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
-		iip = INODE_ITEM(lip);
-
-		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
-			xfs_iflush_abort(iip->ili_inode);
-			continue;
-		}
+	/* this is an opencoded batch version of xfs_trans_ail_delete */
+	spin_lock(&ailp->ail_lock);
+	list_for_each_entry(lip, list, li_bio_list) {
+		xfs_lsn_t	lsn;
 
-		if (!iip->ili_last_fields)
+		clear_bit(XFS_LI_FAILED, &lip->li_flags);
+		if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn)
 			continue;
 
-		list_move_tail(&lip->li_bio_list, &tmp);
-
-		/* Do an unlocked check for needing the AIL lock. */
-		if (iip->ili_flush_lsn == lip->li_lsn ||
-		    test_bit(XFS_LI_FAILED, &lip->li_flags))
-			need_ail++;
+		lsn = xfs_ail_delete_one(ailp, lip);
+		if (!tail_lsn && lsn)
+			tail_lsn = lsn;
 	}
+	xfs_ail_update_finish(ailp, tail_lsn);
+}
 
-	/*
-	 * We only want to pull the item from the AIL if it is actually there
-	 * and its location in the log has not changed since we started the
-	 * flush.  Thus, we only bother if the inode's lsn has not changed.
-	 */
-	if (need_ail) {
-		xfs_lsn_t	tail_lsn = 0;
-
-		/* this is an opencoded batch version of xfs_trans_ail_delete */
-		spin_lock(&ailp->ail_lock);
-		list_for_each_entry(lip, &tmp, li_bio_list) {
-			clear_bit(XFS_LI_FAILED, &lip->li_flags);
-			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
-				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
-				if (!tail_lsn && lsn)
-					tail_lsn = lsn;
-			}
-		}
-		xfs_ail_update_finish(ailp, tail_lsn);
-	}
+/*
+ * Walk the list of inodes that have completed their IOs. If they are clean
+ * remove them from the list and dissociate them from the buffer. Buffers that
+ * are still dirty remain linked to the buffer and on the list. Caller must
+ * handle them appropriately.
+ */
+void
+xfs_iflush_finish(
+	struct xfs_buf		*bp,
+	struct list_head	*list)
+{
+	struct xfs_log_item	*lip, *n;
 
-	/*
-	 * Clean up and unlock the flush lock now we are done. We can clear the
-	 * ili_last_fields bits now that we know that the data corresponding to
-	 * them is safely on disk.
-	 */
-	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
+	list_for_each_entry_safe(lip, n, list, li_bio_list) {
+		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
 		bool	drop_buffer = false;
 
-		list_del_init(&lip->li_bio_list);
-		iip = INODE_ITEM(lip);
-
 		spin_lock(&iip->ili_lock);
 
 		/*
 		 * Remove the reference to the cluster buffer if the inode is
-		 * clean in memory. Drop the buffer reference once we've dropped
-		 * the locks we hold. If the inode is dirty in memory, we need
-		 * to put the inode item back on the buffer list for another
-		 * pass through the flush machinery.
+		 * clean in memory and drop the buffer reference once we've
+		 * dropped the locks we hold.
 		 */
 		ASSERT(iip->ili_item.li_buf == bp);
 		if (!iip->ili_fields) {
 			iip->ili_item.li_buf = NULL;
+			list_del_init(&lip->li_bio_list);
 			drop_buffer = true;
-		} else {
-			list_add(&lip->li_bio_list, &bp->b_li_list);
 		}
 		iip->ili_last_fields = 0;
 		iip->ili_flush_lsn = 0;
@@ -746,6 +708,51 @@ xfs_iflush_done(
 	}
 }
 
+/*
+ * Inode buffer IO completion routine.  It is responsible for removing inodes
+ * attached to the buffer from the AIL if they have not been re-logged, as well
+ * as completing the flush and unlocking the inode.
+ */
+void
+xfs_iflush_done(
+	struct xfs_buf		*bp)
+{
+	struct xfs_log_item	*lip, *n;
+	LIST_HEAD(flushed_inodes);
+	LIST_HEAD(ail_updates);
+
+	/*
+	 * Pull the attached inodes from the buffer one at a time and take the
+	 * appropriate action on them.
+	 */
+	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
+		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
+
+		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
+			xfs_iflush_abort(iip->ili_inode);
+			continue;
+		}
+		if (!iip->ili_last_fields)
+			continue;
+
+		/* Do an unlocked check for needing the AIL lock. */
+		if (iip->ili_flush_lsn == lip->li_lsn ||
+		    test_bit(XFS_LI_FAILED, &lip->li_flags))
+			list_move_tail(&lip->li_bio_list, &ail_updates);
+		else
+			list_move_tail(&lip->li_bio_list, &flushed_inodes);
+	}
+
+	if (!list_empty(&ail_updates)) {
+		xfs_iflush_ail_updates(bp->b_mount->m_ail, &ail_updates);
+		list_splice_tail(&ail_updates, &flushed_inodes);
+	}
+
+	xfs_iflush_finish(bp, &flushed_inodes);
+	if (!list_empty(&flushed_inodes))
+		list_splice_tail(&flushed_inodes, &bp->b_li_list);
+}
+
 /*
  * This is the inode flushing abort routine.  It is called from xfs_iflush when
  * the filesystem is shutting down to clean up the inode state.  It is
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 30/30] xfs: remove xfs_inobp_check()
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (28 preceding siblings ...)
  2020-06-22  8:16 ` [PATCH 29/30] xfs: factor xfs_iflush_done Dave Chinner
@ 2020-06-22  8:16 ` Dave Chinner
  2020-06-29 23:01 ` [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Darrick J. Wong
  30 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22  8:16 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

This debug code is called on every xfs_iflush() call, which then
checks every inode in the buffer for non-zero unlinked list field.
Hence it checks every inode in the cluster buffer every time a
single inode on that cluster it flushed. This is resulting in:

-   38.91%     5.33%  [kernel]  [k] xfs_iflush
   - 17.70% xfs_iflush
      - 9.93% xfs_inobp_check
           4.36% xfs_buf_offset

10% of the CPU time spent flushing inodes is repeatedly checking
unlinked fields in the buffer. We don't need to do this.

The other place we call xfs_inobp_check() is
xfs_iunlink_update_dinode(), and this is after we've done this
assert for the agino we are about to write into that inode:

	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));

which means we've already checked that the agino we are about to
write is not 0 on debug kernels. The inode buffer verifiers do
everything else we need, so let's just remove this debug code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c | 24 ------------------------
 fs/xfs/libxfs/xfs_inode_buf.h |  6 ------
 fs/xfs/xfs_inode.c            |  2 --
 3 files changed, 32 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 1af97235785c..6b6f67595bf4 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -20,30 +20,6 @@
 
 #include <linux/iversion.h>
 
-/*
- * Check that none of the inode's in the buffer have a next
- * unlinked field of 0.
- */
-#if defined(DEBUG)
-void
-xfs_inobp_check(
-	xfs_mount_t	*mp,
-	xfs_buf_t	*bp)
-{
-	int		i;
-	xfs_dinode_t	*dip;
-
-	for (i = 0; i < M_IGEO(mp)->inodes_per_cluster; i++) {
-		dip = xfs_buf_offset(bp, i * mp->m_sb.sb_inodesize);
-		if (!dip->di_next_unlinked)  {
-			xfs_alert(mp,
-	"Detected bogus zero next_unlinked field in inode %d buffer 0x%llx.",
-				i, (long long)bp->b_bn);
-		}
-	}
-}
-#endif
-
 /*
  * If we are doing readahead on an inode buffer, we might be in log recovery
  * reading an inode allocation buffer that hasn't yet been replayed, and hence
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 865ac493c72a..6b08b9d060c2 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -52,12 +52,6 @@ int	xfs_inode_from_disk(struct xfs_inode *ip, struct xfs_dinode *from);
 void	xfs_log_dinode_to_disk(struct xfs_log_dinode *from,
 			       struct xfs_dinode *to);
 
-#if defined(DEBUG)
-void	xfs_inobp_check(struct xfs_mount *, struct xfs_buf *);
-#else
-#define	xfs_inobp_check(mp, bp)
-#endif /* DEBUG */
-
 xfs_failaddr_t xfs_dinode_verify(struct xfs_mount *mp, xfs_ino_t ino,
 			   struct xfs_dinode *dip);
 xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 05017475c766..bae84b3eeb9a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2165,7 +2165,6 @@ xfs_iunlink_update_dinode(
 	xfs_dinode_calc_crc(mp, dip);
 	xfs_trans_inode_buf(tp, ibp);
 	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
-	xfs_inobp_check(mp, ibp);
 }
 
 /* Set an in-core inode's unlinked pointer and return the old value. */
@@ -3558,7 +3557,6 @@ xfs_iflush(
 	xfs_iflush_fork(ip, dip, iip, XFS_DATA_FORK);
 	if (XFS_IFORK_Q(ip))
 		xfs_iflush_fork(ip, dip, iip, XFS_ATTR_FORK);
-	xfs_inobp_check(mp, bp);
 
 	/*
 	 * We've recorded everything logged in the inode, so we'd like to clear
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 29/30 V2] xfs: factor xfs_iflush_done
  2020-06-22  8:16 ` [PATCH 29/30] xfs: factor xfs_iflush_done Dave Chinner
@ 2020-06-22 22:16   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-22 22:16 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_iflush_done() does 3 distinct operations to the inodes attached
to the buffer. Separate these operations out into functions so that
it is easier to modify these operations independently in future.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
V2 - fix static function warnings from kbuild robot

 fs/xfs/xfs_inode_item.c | 157 +++++++++++++++++++++++++-----------------------
 1 file changed, 82 insertions(+), 75 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 4e7fce8d4f7c..3840117f8a5e 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -641,101 +641,63 @@ xfs_inode_item_destroy(
 
 
 /*
- * This is the inode flushing I/O completion routine.  It is called
- * from interrupt level when the buffer containing the inode is
- * flushed to disk.  It is responsible for removing the inode item
- * from the AIL if it has not been re-logged, and unlocking the inode's
- * flush lock.
- *
- * To reduce AIL lock traffic as much as possible, we scan the buffer log item
- * list for other inodes that will run this function. We remove them from the
- * buffer list so we can process all the inode IO completions in one AIL lock
- * traversal.
- *
- * Note: Now that we attach the log item to the buffer when we first log the
- * inode in memory, we can have unflushed inodes on the buffer list here. These
- * inodes will have a zero ili_last_fields, so skip over them here.
+ * We only want to pull the item from the AIL if it is actually there
+ * and its location in the log has not changed since we started the
+ * flush.  Thus, we only bother if the inode's lsn has not changed.
  */
-void
-xfs_iflush_done(
-	struct xfs_buf		*bp)
+static void
+xfs_iflush_ail_updates(
+	struct xfs_ail		*ailp,
+	struct list_head	*list)
 {
-	struct xfs_inode_log_item *iip;
-	struct xfs_log_item	*lip, *n;
-	struct xfs_ail		*ailp = bp->b_mount->m_ail;
-	int			need_ail = 0;
-	LIST_HEAD(tmp);
+	struct xfs_log_item	*lip;
+	xfs_lsn_t		tail_lsn = 0;
 
-	/*
-	 * Pull the attached inodes from the buffer one at a time and take the
-	 * appropriate action on them.
-	 */
-	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
-		iip = INODE_ITEM(lip);
-
-		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
-			xfs_iflush_abort(iip->ili_inode);
-			continue;
-		}
+	/* this is an opencoded batch version of xfs_trans_ail_delete */
+	spin_lock(&ailp->ail_lock);
+	list_for_each_entry(lip, list, li_bio_list) {
+		xfs_lsn_t	lsn;
 
-		if (!iip->ili_last_fields)
+		clear_bit(XFS_LI_FAILED, &lip->li_flags);
+		if (INODE_ITEM(lip)->ili_flush_lsn != lip->li_lsn)
 			continue;
 
-		list_move_tail(&lip->li_bio_list, &tmp);
-
-		/* Do an unlocked check for needing the AIL lock. */
-		if (iip->ili_flush_lsn == lip->li_lsn ||
-		    test_bit(XFS_LI_FAILED, &lip->li_flags))
-			need_ail++;
+		lsn = xfs_ail_delete_one(ailp, lip);
+		if (!tail_lsn && lsn)
+			tail_lsn = lsn;
 	}
+	xfs_ail_update_finish(ailp, tail_lsn);
+}
 
-	/*
-	 * We only want to pull the item from the AIL if it is actually there
-	 * and its location in the log has not changed since we started the
-	 * flush.  Thus, we only bother if the inode's lsn has not changed.
-	 */
-	if (need_ail) {
-		xfs_lsn_t	tail_lsn = 0;
-
-		/* this is an opencoded batch version of xfs_trans_ail_delete */
-		spin_lock(&ailp->ail_lock);
-		list_for_each_entry(lip, &tmp, li_bio_list) {
-			clear_bit(XFS_LI_FAILED, &lip->li_flags);
-			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
-				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
-				if (!tail_lsn && lsn)
-					tail_lsn = lsn;
-			}
-		}
-		xfs_ail_update_finish(ailp, tail_lsn);
-	}
+/*
+ * Walk the list of inodes that have completed their IOs. If they are clean
+ * remove them from the list and dissociate them from the buffer. Buffers that
+ * are still dirty remain linked to the buffer and on the list. Caller must
+ * handle them appropriately.
+ */
+static void
+xfs_iflush_finish(
+	struct xfs_buf		*bp,
+	struct list_head	*list)
+{
+	struct xfs_log_item	*lip, *n;
 
-	/*
-	 * Clean up and unlock the flush lock now we are done. We can clear the
-	 * ili_last_fields bits now that we know that the data corresponding to
-	 * them is safely on disk.
-	 */
-	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
+	list_for_each_entry_safe(lip, n, list, li_bio_list) {
+		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
 		bool	drop_buffer = false;
 
-		list_del_init(&lip->li_bio_list);
-		iip = INODE_ITEM(lip);
-
 		spin_lock(&iip->ili_lock);
 
 		/*
 		 * Remove the reference to the cluster buffer if the inode is
-		 * clean in memory. Drop the buffer reference once we've dropped
-		 * the locks we hold. If the inode is dirty in memory, we need
-		 * to put the inode item back on the buffer list for another
-		 * pass through the flush machinery.
+		 * clean in memory and drop the buffer reference once we've
+		 * dropped the locks we hold.
 		 */
 		ASSERT(iip->ili_item.li_buf == bp);
 		if (!iip->ili_fields) {
 			iip->ili_item.li_buf = NULL;
+			list_del_init(&lip->li_bio_list);
 			drop_buffer = true;
-		} else {
-			list_add(&lip->li_bio_list, &bp->b_li_list);
 		}
 		iip->ili_last_fields = 0;
 		iip->ili_flush_lsn = 0;
@@ -746,6 +708,51 @@ xfs_iflush_done(
 	}
 }
 
+/*
+ * Inode buffer IO completion routine.  It is responsible for removing inodes
+ * attached to the buffer from the AIL if they have not been re-logged, as well
+ * as completing the flush and unlocking the inode.
+ */
+void
+xfs_iflush_done(
+	struct xfs_buf		*bp)
+{
+	struct xfs_log_item	*lip, *n;
+	LIST_HEAD(flushed_inodes);
+	LIST_HEAD(ail_updates);
+
+	/*
+	 * Pull the attached inodes from the buffer one at a time and take the
+	 * appropriate action on them.
+	 */
+	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
+		struct xfs_inode_log_item *iip = INODE_ITEM(lip);
+
+		if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) {
+			xfs_iflush_abort(iip->ili_inode);
+			continue;
+		}
+		if (!iip->ili_last_fields)
+			continue;
+
+		/* Do an unlocked check for needing the AIL lock. */
+		if (iip->ili_flush_lsn == lip->li_lsn ||
+		    test_bit(XFS_LI_FAILED, &lip->li_flags))
+			list_move_tail(&lip->li_bio_list, &ail_updates);
+		else
+			list_move_tail(&lip->li_bio_list, &flushed_inodes);
+	}
+
+	if (!list_empty(&ail_updates)) {
+		xfs_iflush_ail_updates(bp->b_mount->m_ail, &ail_updates);
+		list_splice_tail(&ail_updates, &flushed_inodes);
+	}
+
+	xfs_iflush_finish(bp, &flushed_inodes);
+	if (!list_empty(&flushed_inodes))
+		list_splice_tail(&flushed_inodes, &bp->b_li_list);
+}
+
 /*
  * This is the inode flushing abort routine.  It is called from xfs_iflush when
  * the filesystem is shutting down to clean up the inode state.  It is

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 03/30] xfs: add an inode item lock
  2020-06-22  8:15 ` [PATCH 03/30] xfs: add an inode item lock Dave Chinner
@ 2020-06-23  2:30   ` Darrick J. Wong
  0 siblings, 0 replies; 43+ messages in thread
From: Darrick J. Wong @ 2020-06-23  2:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 22, 2020 at 06:15:38PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The inode log item is kind of special in that it can be aggregating
> new changes in memory at the same time time existing changes are
> being written back to disk. This means there are fields in the log
> item that are accessed concurrently from contexts that don't share
> any locking at all.
> 
> e.g. updating ili_last_fields occurs at flush time under the
> ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
> completion time, and is read under the ILOCK_EXCL when the inode is
> logged.  Hence there is no actual serialisation between reading the
> field during logging of the inode in transactions vs clearing the
> field in IO completion.
> 
> We currently get away with this by the fact that we are only
> clearing fields in IO completion, and nothing bad happens if we
> accidentally log more of the inode than we actually modify. Worst
> case is we consume a tiny bit more memory and log bandwidth.
> 
> However, if we want to do more complex state manipulations on the
> log item that requires updates at all three of these potential
> locations, we need to have some mechanism of serialising those
> operations. To do this, introduce a spinlock into the log item to
> serialise internal state.
> 
> This could be done via the xfs_inode i_flags_lock, but this then
> leads to potential lock inversion issues where inode flag updates
> need to occur inside locks that best nest inside the inode log item
> locks (e.g. marking inodes stale during inode cluster freeing).
> Using a separate spinlock avoids these sorts of problems and
> simplifies future code.
> 
> This does not touch the use of ili_fields in the item formatting
> code - that is entirely protected by the ILOCK_EXCL at this point in
> time, so it remains untouched.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

/me thot he'd reviewed this ages ago but apparently, not...

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 52 ++++++++++++++++-----------------
>  fs/xfs/xfs_file.c               |  9 ++++--
>  fs/xfs/xfs_inode.c              | 20 ++++++++-----
>  fs/xfs/xfs_inode_item.c         |  7 +++++
>  fs/xfs/xfs_inode_item.h         | 18 ++++++++++--
>  5 files changed, 66 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 4504d215cd59..c66d9d1dd58b 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -82,16 +82,20 @@ xfs_trans_ichgtime(
>   */
>  void
>  xfs_trans_log_inode(
> -	xfs_trans_t	*tp,
> -	xfs_inode_t	*ip,
> -	uint		flags)
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip,
> +	uint			flags)
>  {
> -	struct inode	*inode = VFS_I(ip);
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
> +	struct inode		*inode = VFS_I(ip);
> +	uint			iversion_flags = 0;
>  
> -	ASSERT(ip->i_itemp != NULL);
> +	ASSERT(iip);
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
>  
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +
>  	/*
>  	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
>  	 * don't matter - we either will need an extra transaction in 24 hours
> @@ -104,15 +108,6 @@ xfs_trans_log_inode(
>  		spin_unlock(&inode->i_lock);
>  	}
>  
> -	/*
> -	 * Record the specific change for fdatasync optimisation. This
> -	 * allows fdatasync to skip log forces for inodes that are only
> -	 * timestamp dirty. We do this before the change count so that
> -	 * the core being logged in this case does not impact on fdatasync
> -	 * behaviour.
> -	 */
> -	ip->i_itemp->ili_fsync_fields |= flags;
> -
>  	/*
>  	 * First time we log the inode in a transaction, bump the inode change
>  	 * counter if it is configured for this to occur. While we have the
> @@ -122,23 +117,28 @@ xfs_trans_log_inode(
>  	 * set however, then go ahead and bump the i_version counter
>  	 * unconditionally.
>  	 */
> -	if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
> -	    IS_I_VERSION(VFS_I(ip))) {
> -		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
> -			flags |= XFS_ILOG_CORE;
> +	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> +		if (IS_I_VERSION(inode) &&
> +		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> +			iversion_flags = XFS_ILOG_CORE;
>  	}
>  
> -	tp->t_flags |= XFS_TRANS_DIRTY;
> +	/*
> +	 * Record the specific change for fdatasync optimisation. This allows
> +	 * fdatasync to skip log forces for inodes that are only timestamp
> +	 * dirty.
> +	 */
> +	spin_lock(&iip->ili_lock);
> +	iip->ili_fsync_fields |= flags;
>  
>  	/*
> -	 * Always OR in the bits from the ili_last_fields field.
> -	 * This is to coordinate with the xfs_iflush() and xfs_iflush_done()
> -	 * routines in the eventual clearing of the ili_fields bits.
> -	 * See the big comment in xfs_iflush() for an explanation of
> -	 * this coordination mechanism.
> +	 * Always OR in the bits from the ili_last_fields field.  This is to
> +	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
> +	 * the eventual clearing of the ili_fields bits.  See the big comment in
> +	 * xfs_iflush() for an explanation of this coordination mechanism.
>  	 */
> -	flags |= ip->i_itemp->ili_last_fields;
> -	ip->i_itemp->ili_fields |= flags;
> +	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
> +	spin_unlock(&iip->ili_lock);
>  }
>  
>  int
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 00db81eac80d..7b05f8fd7b3d 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -94,6 +94,7 @@ xfs_file_fsync(
>  {
>  	struct inode		*inode = file->f_mapping->host;
>  	struct xfs_inode	*ip = XFS_I(inode);
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
>  	struct xfs_mount	*mp = ip->i_mount;
>  	int			error = 0;
>  	int			log_flushed = 0;
> @@ -137,13 +138,15 @@ xfs_file_fsync(
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
>  	if (xfs_ipincount(ip)) {
>  		if (!datasync ||
> -		    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> -			lsn = ip->i_itemp->ili_last_lsn;
> +		    (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> +			lsn = iip->ili_last_lsn;
>  	}
>  
>  	if (lsn) {
>  		error = xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
> -		ip->i_itemp->ili_fsync_fields = 0;
> +		spin_lock(&iip->ili_lock);
> +		iip->ili_fsync_fields = 0;
> +		spin_unlock(&iip->ili_lock);
>  	}
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 2e5fa6dea83a..e2296f0b7f11 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2704,9 +2704,11 @@ xfs_ifree_cluster(
>  				continue;
>  
>  			iip = ip->i_itemp;
> +			spin_lock(&iip->ili_lock);
>  			iip->ili_last_fields = iip->ili_fields;
>  			iip->ili_fields = 0;
>  			iip->ili_fsync_fields = 0;
> +			spin_unlock(&iip->ili_lock);
>  			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  						&iip->ili_item.li_lsn);
>  
> @@ -2742,6 +2744,7 @@ xfs_ifree(
>  {
>  	int			error;
>  	struct xfs_icluster	xic = { 0 };
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
>  
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  	ASSERT(VFS_I(ip)->i_nlink == 0);
> @@ -2779,7 +2782,9 @@ xfs_ifree(
>  	ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS;
>  
>  	/* Don't attempt to replay owner changes for a deleted inode */
> -	ip->i_itemp->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);
> +	spin_lock(&iip->ili_lock);
> +	iip->ili_fields &= ~(XFS_ILOG_AOWNER | XFS_ILOG_DOWNER);
> +	spin_unlock(&iip->ili_lock);
>  
>  	/*
>  	 * Bump the generation count so no one will be confused
> @@ -3835,20 +3840,19 @@ xfs_iflush_int(
>  	 * know that the information those bits represent is permanently on
>  	 * disk.  As long as the flush completes before the inode is logged
>  	 * again, then both ili_fields and ili_last_fields will be cleared.
> -	 *
> -	 * We can play with the ili_fields bits here, because the inode lock
> -	 * must be held exclusively in order to set bits there and the flush
> -	 * lock protects the ili_last_fields bits.  Store the current LSN of the
> -	 * inode so that we can tell whether the item has moved in the AIL from
> -	 * xfs_iflush_done().  In order to read the lsn we need the AIL lock,
> -	 * because it is a 64 bit value that cannot be read atomically.
>  	 */
>  	error = 0;
>  flush_out:
> +	spin_lock(&iip->ili_lock);
>  	iip->ili_last_fields = iip->ili_fields;
>  	iip->ili_fields = 0;
>  	iip->ili_fsync_fields = 0;
> +	spin_unlock(&iip->ili_lock);
>  
> +	/*
> +	 * Store the current LSN of the inode so that we can tell whether the
> +	 * item has moved in the AIL from xfs_iflush_done().
> +	 */
>  	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  				&iip->ili_item.li_lsn);
>  
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index b17384aa8df4..6ef9cbcfc94a 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -637,6 +637,7 @@ xfs_inode_item_init(
>  	iip = ip->i_itemp = kmem_zone_zalloc(xfs_ili_zone, 0);
>  
>  	iip->ili_inode = ip;
> +	spin_lock_init(&iip->ili_lock);
>  	xfs_log_item_init(mp, &iip->ili_item, XFS_LI_INODE,
>  						&xfs_inode_item_ops);
>  }
> @@ -738,7 +739,11 @@ xfs_iflush_done(
>  	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
>  		list_del_init(&blip->li_bio_list);
>  		iip = INODE_ITEM(blip);
> +
> +		spin_lock(&iip->ili_lock);
>  		iip->ili_last_fields = 0;
> +		spin_unlock(&iip->ili_lock);
> +
>  		xfs_ifunlock(iip->ili_inode);
>  	}
>  	list_del(&tmp);
> @@ -762,9 +767,11 @@ xfs_iflush_abort(
>  		 * Clear the inode logging fields so no more flushes are
>  		 * attempted.
>  		 */
> +		spin_lock(&iip->ili_lock);
>  		iip->ili_last_fields = 0;
>  		iip->ili_fields = 0;
>  		iip->ili_fsync_fields = 0;
> +		spin_unlock(&iip->ili_lock);
>  	}
>  	/*
>  	 * Release the inode's flush lock since we're done with it.
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 4de5070e0765..4a10a1b92ee9 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -16,12 +16,24 @@ struct xfs_mount;
>  struct xfs_inode_log_item {
>  	struct xfs_log_item	ili_item;	   /* common portion */
>  	struct xfs_inode	*ili_inode;	   /* inode ptr */
> -	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
> -	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
> -	unsigned short		ili_lock_flags;	   /* lock flags */
> +	unsigned short		ili_lock_flags;	   /* inode lock flags */
> +	/*
> +	 * The ili_lock protects the interactions between the dirty state and
> +	 * the flush state of the inode log item. This allows us to do atomic
> +	 * modifications of multiple state fields without having to hold a
> +	 * specific inode lock to serialise them.
> +	 *
> +	 * We need atomic changes between inode dirtying, inode flushing and
> +	 * inode completion, but these all hold different combinations of
> +	 * ILOCK and iflock and hence we need some other method of serialising
> +	 * updates to the flush state.
> +	 */
> +	spinlock_t		ili_lock;	   /* flush state lock */
>  	unsigned int		ili_last_fields;   /* fields when flushed */
>  	unsigned int		ili_fields;	   /* fields to be logged */
>  	unsigned int		ili_fsync_fields;  /* logged since last fsync */
> +	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
> +	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
>  };
>  
>  static inline int xfs_inode_clean(xfs_inode_t *ip)
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 04/30] xfs: mark inode buffers in cache
  2020-06-22  8:15 ` [PATCH 04/30] xfs: mark inode buffers in cache Dave Chinner
@ 2020-06-23  2:32   ` Darrick J. Wong
  0 siblings, 0 replies; 43+ messages in thread
From: Darrick J. Wong @ 2020-06-23  2:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 22, 2020 at 06:15:39PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Inode buffers always have write IO callbacks, so by marking them
> directly we can avoid needing to attach ->b_iodone functions to
> them. This avoids an indirect call, and makes future modifications
> much simpler.
> 
> While this is largely a refactor of existing functionality, we
> broaden the scope of the flag to beyond where inodes are explicitly
> attached because future changes need to know what type of log items
> are attached to the buffer. Adding this buffer flag may invoke the
> inode iodone callback in cases where it wouldn't have been
> previously, but this is not a functional change because the callback
> is identical to the normal buffer write iodone callback when inodes
> are not attached.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Looks like a pretty straightforward removal of indirect iodone calls,

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_buf.c       | 21 ++++++++++++++++-----
>  fs/xfs/xfs_buf.h       | 38 +++++++++++++++++++++++++-------------
>  fs/xfs/xfs_buf_item.c  | 42 +++++++++++++++++++++++++++++++-----------
>  fs/xfs/xfs_buf_item.h  |  1 +
>  fs/xfs/xfs_inode.c     |  2 +-
>  fs/xfs/xfs_trans_buf.c |  3 +++
>  6 files changed, 77 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 20b748f7e186..ae0c923574df 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -14,6 +14,8 @@
>  #include "xfs_mount.h"
>  #include "xfs_trace.h"
>  #include "xfs_log.h"
> +#include "xfs_trans.h"
> +#include "xfs_buf_item.h"
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  
> @@ -1202,12 +1204,21 @@ xfs_buf_ioend(
>  		bp->b_flags |= XBF_DONE;
>  	}
>  
> -	if (bp->b_iodone)
> +	if (read)
> +		goto out_finish;
> +
> +	if (bp->b_flags & _XBF_INODES) {
> +		xfs_buf_inode_iodone(bp);
> +		return;
> +	}
> +
> +	if (bp->b_iodone) {
>  		(*(bp->b_iodone))(bp);
> -	else if (bp->b_flags & XBF_ASYNC)
> -		xfs_buf_relse(bp);
> -	else
> -		complete(&bp->b_iowait);
> +		return;
> +	}
> +
> +out_finish:
> +	xfs_buf_ioend_finish(bp);
>  }
>  
>  static void
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 050c53b739e2..2400cb90a04c 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -30,15 +30,18 @@
>  #define XBF_STALE	 (1 << 6) /* buffer has been staled, do not find it */
>  #define XBF_WRITE_FAIL	 (1 << 7) /* async writes have failed on this buffer */
>  
> -/* flags used only as arguments to access routines */
> -#define XBF_TRYLOCK	 (1 << 16)/* lock requested, but do not wait */
> -#define XBF_UNMAPPED	 (1 << 17)/* do not map the buffer */
> +/* buffer type flags for write callbacks */
> +#define _XBF_INODES	 (1 << 16)/* inode buffer */
>  
>  /* flags used only internally */
>  #define _XBF_PAGES	 (1 << 20)/* backed by refcounted pages */
>  #define _XBF_KMEM	 (1 << 21)/* backed by heap memory */
>  #define _XBF_DELWRI_Q	 (1 << 22)/* buffer on a delwri queue */
>  
> +/* flags used only as arguments to access routines */
> +#define XBF_TRYLOCK	 (1 << 30)/* lock requested, but do not wait */
> +#define XBF_UNMAPPED	 (1 << 31)/* do not map the buffer */
> +
>  typedef unsigned int xfs_buf_flags_t;
>  
>  #define XFS_BUF_FLAGS \
> @@ -50,12 +53,13 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ XBF_DONE,		"DONE" }, \
>  	{ XBF_STALE,		"STALE" }, \
>  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
> -	{ XBF_TRYLOCK,		"TRYLOCK" },	/* should never be set */\
> -	{ XBF_UNMAPPED,		"UNMAPPED" },	/* ditto */\
> +	{ _XBF_INODES,		"INODES" }, \
>  	{ _XBF_PAGES,		"PAGES" }, \
>  	{ _XBF_KMEM,		"KMEM" }, \
> -	{ _XBF_DELWRI_Q,	"DELWRI_Q" }
> -
> +	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
> +	/* The following interface flags should never be set */ \
> +	{ XBF_TRYLOCK,		"TRYLOCK" }, \
> +	{ XBF_UNMAPPED,		"UNMAPPED" }
>  
>  /*
>   * Internal state flags.
> @@ -257,9 +261,23 @@ extern void xfs_buf_unlock(xfs_buf_t *);
>  #define xfs_buf_islocked(bp) \
>  	((bp)->b_sema.count <= 0)
>  
> +static inline void xfs_buf_relse(xfs_buf_t *bp)
> +{
> +	xfs_buf_unlock(bp);
> +	xfs_buf_rele(bp);
> +}
> +
>  /* Buffer Read and Write Routines */
>  extern int xfs_bwrite(struct xfs_buf *bp);
>  extern void xfs_buf_ioend(struct xfs_buf *bp);
> +static inline void xfs_buf_ioend_finish(struct xfs_buf *bp)
> +{
> +	if (bp->b_flags & XBF_ASYNC)
> +		xfs_buf_relse(bp);
> +	else
> +		complete(&bp->b_iowait);
> +}
> +
>  extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error,
>  		xfs_failaddr_t failaddr);
>  #define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address)
> @@ -324,12 +342,6 @@ static inline int xfs_buf_ispinned(struct xfs_buf *bp)
>  	return atomic_read(&bp->b_pin_count);
>  }
>  
> -static inline void xfs_buf_relse(xfs_buf_t *bp)
> -{
> -	xfs_buf_unlock(bp);
> -	xfs_buf_rele(bp);
> -}
> -
>  static inline int
>  xfs_buf_verify_cksum(struct xfs_buf *bp, unsigned long cksum_offset)
>  {
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 9e75e8d6042e..8659cf4282a6 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -1158,20 +1158,15 @@ xfs_buf_iodone_callback_error(
>  	return false;
>  }
>  
> -/*
> - * This is the iodone() function for buffers which have had callbacks attached
> - * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
> - * callback list, mark the buffer as having no more callbacks and then push the
> - * buffer through IO completion processing.
> - */
> -void
> -xfs_buf_iodone_callbacks(
> +static void
> +xfs_buf_run_callbacks(
>  	struct xfs_buf		*bp)
>  {
> +
>  	/*
> -	 * If there is an error, process it. Some errors require us
> -	 * to run callbacks after failure processing is done so we
> -	 * detect that and take appropriate action.
> +	 * If there is an error, process it. Some errors require us to run
> +	 * callbacks after failure processing is done so we detect that and take
> +	 * appropriate action.
>  	 */
>  	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
>  		return;
> @@ -1188,9 +1183,34 @@ xfs_buf_iodone_callbacks(
>  	bp->b_log_item = NULL;
>  	list_del_init(&bp->b_li_list);
>  	bp->b_iodone = NULL;
> +}
> +
> +/*
> + * This is the iodone() function for buffers which have had callbacks attached
> + * to them by xfs_buf_attach_iodone(). We need to iterate the items on the
> + * callback list, mark the buffer as having no more callbacks and then push the
> + * buffer through IO completion processing.
> + */
> +void
> +xfs_buf_iodone_callbacks(
> +	struct xfs_buf		*bp)
> +{
> +	xfs_buf_run_callbacks(bp);
>  	xfs_buf_ioend(bp);
>  }
>  
> +/*
> + * Inode buffer iodone callback function.
> + */
> +void
> +xfs_buf_inode_iodone(
> +	struct xfs_buf		*bp)
> +{
> +	xfs_buf_run_callbacks(bp);
> +	xfs_buf_ioend_finish(bp);
> +}
> +
> +
>  /*
>   * This is the iodone() function for buffers which have been
>   * logged.  It is called when they are eventually flushed out.
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index c9c57e2da932..a342933ad9b8 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -59,6 +59,7 @@ void	xfs_buf_attach_iodone(struct xfs_buf *,
>  			      struct xfs_log_item *);
>  void	xfs_buf_iodone_callbacks(struct xfs_buf *);
>  void	xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *);
> +void	xfs_buf_inode_iodone(struct xfs_buf *);
>  bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
>  
>  extern kmem_zone_t	*xfs_buf_item_zone;
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index e2296f0b7f11..b0760a900e11 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3862,13 +3862,13 @@ xfs_iflush_int(
>  	 * completion on the buffer to remove the inode from the AIL and release
>  	 * the flush lock.
>  	 */
> +	bp->b_flags |= _XBF_INODES;
>  	xfs_buf_attach_iodone(bp, xfs_iflush_done, &iip->ili_item);
>  
>  	/* generate the checksum. */
>  	xfs_dinode_calc_crc(mp, dip);
>  
>  	ASSERT(!list_empty(&bp->b_li_list));
> -	ASSERT(bp->b_iodone != NULL);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 08174ffa2118..552d0869aa0f 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -626,6 +626,7 @@ xfs_trans_inode_buf(
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  
>  	bip->bli_flags |= XFS_BLI_INODE_BUF;
> +	bp->b_flags |= _XBF_INODES;
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
>  }
>  
> @@ -651,6 +652,7 @@ xfs_trans_stale_inode_buf(
>  
>  	bip->bli_flags |= XFS_BLI_STALE_INODE;
>  	bip->bli_item.li_cb = xfs_buf_iodone;
> +	bp->b_flags |= _XBF_INODES;
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
>  }
>  
> @@ -675,6 +677,7 @@ xfs_trans_inode_alloc_buf(
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  
>  	bip->bli_flags |= XFS_BLI_INODE_ALLOC_BUF;
> +	bp->b_flags |= _XBF_INODES;
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
>  }
>  
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 13/30] xfs: handle buffer log item IO errors directly
  2020-06-22  8:15 ` [PATCH 13/30] xfs: handle buffer log item IO errors directly Dave Chinner
@ 2020-06-23  2:38   ` Darrick J. Wong
  0 siblings, 0 replies; 43+ messages in thread
From: Darrick J. Wong @ 2020-06-23  2:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 22, 2020 at 06:15:48PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Currently when a buffer with attached log items has an IO error
> it called ->iop_error for each attched log item. These all call
> xfs_set_li_failed() to handle the error, but we are about to change
> the way log items manage buffers. hence we first need to remove the
> per-item dependency on buffer handling done by xfs_set_li_failed().
> 
> We already have specific buffer type IO completion routines, so move
> the log item error handling out of the generic error handling and
> into the log item specific functions so we can implement per-type
> error handling easily.
> 
> This requires a more complex return value from the error handling
> code so that we can take the correct action the failure handling
> requires.  This results in some repeated boilerplate in the
> functions, but that can be cleaned up later once all the changes
> cascade through this code.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Ok, I can more or less follow along with what we're doing now.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_buf_item.c | 214 ++++++++++++++++++++++++++++--------------
>  1 file changed, 144 insertions(+), 70 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 09bfe9c52dbd..f80fc5bd3bff 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -986,21 +986,24 @@ xfs_buf_do_callbacks_fail(
>  	spin_unlock(&ailp->ail_lock);
>  }
>  
> +/*
> + * Decide if we're going to retry the write after a failure, and prepare
> + * the buffer for retrying the write.
> + */
>  static bool
> -xfs_buf_iodone_callback_error(
> +xfs_buf_ioerror_fail_without_retry(
>  	struct xfs_buf		*bp)
>  {
>  	struct xfs_mount	*mp = bp->b_mount;
>  	static ulong		lasttime;
>  	static xfs_buftarg_t	*lasttarg;
> -	struct xfs_error_cfg	*cfg;
>  
>  	/*
>  	 * If we've already decided to shutdown the filesystem because of
>  	 * I/O errors, there's no point in giving this a retry.
>  	 */
>  	if (XFS_FORCED_SHUTDOWN(mp))
> -		goto out_stale;
> +		return true;
>  
>  	if (bp->b_target != lasttarg ||
>  	    time_after(jiffies, (lasttime + 5*HZ))) {
> @@ -1011,91 +1014,114 @@ xfs_buf_iodone_callback_error(
>  
>  	/* synchronous writes will have callers process the error */
>  	if (!(bp->b_flags & XBF_ASYNC))
> -		goto out_stale;
> -
> -	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
> -
> -	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
> +		return true;
> +	return false;
> +}
>  
> -	/*
> -	 * If the write was asynchronous then no one will be looking for the
> -	 * error.  If this is the first failure of this type, clear the error
> -	 * state and write the buffer out again. This means we always retry an
> -	 * async write failure at least once, but we also need to set the buffer
> -	 * up to behave correctly now for repeated failures.
> -	 */
> -	if (!(bp->b_flags & (XBF_STALE | XBF_WRITE_FAIL)) ||
> -	     bp->b_last_error != bp->b_error) {
> -		bp->b_flags |= (XBF_WRITE | XBF_DONE | XBF_WRITE_FAIL);
> -		bp->b_last_error = bp->b_error;
> -		if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER &&
> -		    !bp->b_first_retry_time)
> -			bp->b_first_retry_time = jiffies;
> +static bool
> +xfs_buf_ioerror_retry(
> +	struct xfs_buf		*bp,
> +	struct xfs_error_cfg	*cfg)
> +{
> +	if ((bp->b_flags & (XBF_STALE | XBF_WRITE_FAIL)) &&
> +	    bp->b_last_error == bp->b_error)
> +		return false;
>  
> -		xfs_buf_ioerror(bp, 0);
> -		xfs_buf_submit(bp);
> -		return true;
> -	}
> +	bp->b_flags |= (XBF_WRITE | XBF_DONE | XBF_WRITE_FAIL);
> +	bp->b_last_error = bp->b_error;
> +	if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER &&
> +	    !bp->b_first_retry_time)
> +		bp->b_first_retry_time = jiffies;
> +	return true;
> +}
>  
> -	/*
> -	 * Repeated failure on an async write. Take action according to the
> -	 * error configuration we have been set up to use.
> -	 */
> +/*
> + * Account for this latest trip around the retry handler, and decide if
> + * we've failed enough times to constitute a permanent failure.
> + */
> +static bool
> +xfs_buf_ioerror_permanent(
> +	struct xfs_buf		*bp,
> +	struct xfs_error_cfg	*cfg)
> +{
> +	struct xfs_mount	*mp = bp->b_mount;
>  
>  	if (cfg->max_retries != XFS_ERR_RETRY_FOREVER &&
>  	    ++bp->b_retries > cfg->max_retries)
> -			goto permanent_error;
> +		return true;
>  	if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER &&
>  	    time_after(jiffies, cfg->retry_timeout + bp->b_first_retry_time))
> -			goto permanent_error;
> +		return true;
>  
>  	/* At unmount we may treat errors differently */
>  	if ((mp->m_flags & XFS_MOUNT_UNMOUNTING) && mp->m_fail_unmount)
> -		goto permanent_error;
> -
> -	/*
> -	 * Still a transient error, run IO completion failure callbacks and let
> -	 * the higher layers retry the buffer.
> -	 */
> -	xfs_buf_do_callbacks_fail(bp);
> -	xfs_buf_ioerror(bp, 0);
> -	xfs_buf_relse(bp);
> -	return true;
> +		return true;
>  
> -	/*
> -	 * Permanent error - we need to trigger a shutdown if we haven't already
> -	 * to indicate that inconsistency will result from this action.
> -	 */
> -permanent_error:
> -	xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> -out_stale:
> -	xfs_buf_stale(bp);
> -	bp->b_flags |= XBF_DONE;
> -	trace_xfs_buf_error_relse(bp, _RET_IP_);
>  	return false;
>  }
>  
> -static inline bool
> -xfs_buf_had_callback_errors(
> +/*
> + * On a sync write or shutdown we just want to stale the buffer and let the
> + * caller handle the error in bp->b_error appropriately.
> + *
> + * If the write was asynchronous then no one will be looking for the error.  If
> + * this is the first failure of this type, clear the error state and write the
> + * buffer out again. This means we always retry an async write failure at least
> + * once, but we also need to set the buffer up to behave correctly now for
> + * repeated failures.
> + *
> + * If we get repeated async write failures, then we take action according to the
> + * error configuration we have been set up to use.
> + *
> + * Multi-state return value:
> + *
> + * XBF_IOERROR_FINISH: clear IO error retry state and run callback completions
> + * XBF_IOERROR_DONE: resubmitted immediately, do not run any completions
> + * XBF_IOERROR_FAIL: transient error, run failure callback completions and then
> + *    release the buffer
> + */
> +enum {
> +	XBF_IOERROR_FINISH,
> +	XBF_IOERROR_DONE,
> +	XBF_IOERROR_FAIL,
> +};
> +
> +static int
> +xfs_buf_iodone_error(
>  	struct xfs_buf		*bp)
>  {
> +	struct xfs_mount	*mp = bp->b_mount;
> +	struct xfs_error_cfg	*cfg;
>  
> -	/*
> -	 * If there is an error, process it. Some errors require us to run
> -	 * callbacks after failure processing is done so we detect that and take
> -	 * appropriate action.
> -	 */
> -	if (bp->b_error && xfs_buf_iodone_callback_error(bp))
> -		return true;
> +	if (xfs_buf_ioerror_fail_without_retry(bp))
> +		goto out_stale;
> +
> +	trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
> +
> +	cfg = xfs_error_get_cfg(mp, XFS_ERR_METADATA, bp->b_error);
> +	if (xfs_buf_ioerror_retry(bp, cfg)) {
> +		xfs_buf_ioerror(bp, 0);
> +		xfs_buf_submit(bp);
> +		return XBF_IOERROR_DONE;
> +	}
>  
>  	/*
> -	 * Successful IO or permanent error. Either way, we can clear the
> -	 * retry state here in preparation for the next error that may occur.
> +	 * Permanent error - we need to trigger a shutdown if we haven't already
> +	 * to indicate that inconsistency will result from this action.
>  	 */
> -	bp->b_last_error = 0;
> -	bp->b_retries = 0;
> -	bp->b_first_retry_time = 0;
> -	return false;
> +	if (xfs_buf_ioerror_permanent(bp, cfg)) {
> +		xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> +		goto out_stale;
> +	}
> +
> +	/* Still considered a transient error. Caller will schedule retries. */
> +	return XBF_IOERROR_FAIL;
> +
> +out_stale:
> +	xfs_buf_stale(bp);
> +	bp->b_flags |= XBF_DONE;
> +	trace_xfs_buf_error_relse(bp, _RET_IP_);
> +	return XBF_IOERROR_FINISH;
>  }
>  
>  static void
> @@ -1122,6 +1148,15 @@ xfs_buf_item_done(
>  	xfs_buf_rele(bp);
>  }
>  
> +static inline void
> +xfs_buf_clear_ioerror_retry_state(
> +	struct xfs_buf		*bp)
> +{
> +	bp->b_last_error = 0;
> +	bp->b_retries = 0;
> +	bp->b_first_retry_time = 0;
> +}
> +
>  /*
>   * Inode buffer iodone callback function.
>   */
> @@ -1129,9 +1164,22 @@ void
>  xfs_buf_inode_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	if (xfs_buf_had_callback_errors(bp))
> +	if (bp->b_error) {
> +		int ret = xfs_buf_iodone_error(bp);
> +
> +		if (ret == XBF_IOERROR_FINISH)
> +			goto finish_iodone;
> +		if (ret == XBF_IOERROR_DONE)
> +			return;
> +		ASSERT(ret == XBF_IOERROR_FAIL);
> +		xfs_buf_do_callbacks_fail(bp);
> +		xfs_buf_ioerror(bp, 0);
> +		xfs_buf_relse(bp);
>  		return;
> +	}
>  
> +finish_iodone:
> +	xfs_buf_clear_ioerror_retry_state(bp);
>  	xfs_buf_item_done(bp);
>  	xfs_iflush_done(bp);
>  	xfs_buf_ioend_finish(bp);
> @@ -1144,9 +1192,22 @@ void
>  xfs_buf_dquot_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	if (xfs_buf_had_callback_errors(bp))
> +	if (bp->b_error) {
> +		int ret = xfs_buf_iodone_error(bp);
> +
> +		if (ret == XBF_IOERROR_FINISH)
> +			goto finish_iodone;
> +		if (ret == XBF_IOERROR_DONE)
> +			return;
> +		ASSERT(ret == XBF_IOERROR_FAIL);
> +		xfs_buf_do_callbacks_fail(bp);
> +		xfs_buf_ioerror(bp, 0);
> +		xfs_buf_relse(bp);
>  		return;
> +	}
>  
> +finish_iodone:
> +	xfs_buf_clear_ioerror_retry_state(bp);
>  	/* a newly allocated dquot buffer might have a log item attached */
>  	xfs_buf_item_done(bp);
>  	xfs_dquot_done(bp);
> @@ -1163,9 +1224,22 @@ void
>  xfs_buf_iodone(
>  	struct xfs_buf		*bp)
>  {
> -	if (xfs_buf_had_callback_errors(bp))
> +	if (bp->b_error) {
> +		int ret = xfs_buf_iodone_error(bp);
> +
> +		if (ret == XBF_IOERROR_FINISH)
> +			goto finish_iodone;
> +		if (ret == XBF_IOERROR_DONE)
> +			return;
> +		ASSERT(ret == XBF_IOERROR_FAIL);
> +		xfs_buf_do_callbacks_fail(bp);
> +		xfs_buf_ioerror(bp, 0);
> +		xfs_buf_relse(bp);
>  		return;
> +	}
>  
> +finish_iodone:
> +	xfs_buf_clear_ioerror_retry_state(bp);
>  	xfs_buf_item_done(bp);
>  	xfs_buf_ioend_finish(bp);
>  }
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 16/30] xfs: pin inode backing buffer to the inode log item
  2020-06-22  8:15 ` [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Dave Chinner
@ 2020-06-23  2:39   ` Darrick J. Wong
  0 siblings, 0 replies; 43+ messages in thread
From: Darrick J. Wong @ 2020-06-23  2:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 22, 2020 at 06:15:51PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we dirty an inode, we are going to have to write it disk at
> some point in the near future. This requires the inode cluster
> backing buffer to be present in memory. Unfortunately, under severe
> memory pressure we can reclaim the inode backing buffer while the
> inode is dirty in memory, resulting in stalling the AIL pushing
> because it has to do a read-modify-write cycle on the cluster
> buffer.
> 
> When we have no memory available, the read of the cluster buffer
> blocks the AIL pushing process, and this causes all sorts of issues
> for memory reclaim as it requires inode writeback to make forwards
> progress. Allocating a cluster buffer causes more memory pressure,
> and results in more cluster buffers to be reclaimed, resulting in
> more RMW cycles to be done in the AIL context and everything then
> backs up on AIL progress. Only the synchronous inode cluster
> writeback in the the inode reclaim code provides some level of
> forwards progress guarantees that prevent OOM-killer rampages in
> this situation.
> 
> Fix this by pinning the inode backing buffer to the inode log item
> when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
> This may mean the first modification of an inode that has been held
> in cache for a long time may block on a cluster buffer read, but
> we can do that in transaction context and block safely until the
> buffer has been allocated and read.
> 
> Once we have the cluster buffer, the inode log item takes a
> reference to it, pinning it in memory, and attaches it to the log
> item for future reference. This means we can always grab the cluster
> buffer from the inode log item when we need it.
> 
> When the inode is finally cleaned and removed from the AIL, we can
> drop the reference the inode log item holds on the cluster buffer.
> Once all inodes on the cluster buffer are clean, the cluster buffer
> will be unpinned and it will be available for memory reclaim to
> reclaim again.
> 
> This avoids the issues with needing to do RMW cycles in the AIL
> pushing context, and hence allows complete non-blocking inode
> flushing to be performed by the AIL pushing context.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Looks like a nice improvement in a bunch of stalls...

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/libxfs/xfs_inode_buf.c   |  3 +-
>  fs/xfs/libxfs/xfs_trans_inode.c | 53 ++++++++++++++++++++++++----
>  fs/xfs/xfs_buf_item.c           |  4 +--
>  fs/xfs/xfs_inode_item.c         | 61 ++++++++++++++++++++++++++-------
>  fs/xfs/xfs_trans_ail.c          |  8 +++--
>  5 files changed, 105 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 6f84ea85fdd8..1af97235785c 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -176,7 +176,8 @@ xfs_imap_to_bp(
>  	}
>  
>  	*bpp = bp;
> -	*dipp = xfs_buf_offset(bp, imap->im_boffset);
> +	if (dipp)
> +		*dipp = xfs_buf_offset(bp, imap->im_boffset);
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index c66d9d1dd58b..ad5974365c58 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -8,6 +8,8 @@
>  #include "xfs_shared.h"
>  #include "xfs_format.h"
>  #include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
>  #include "xfs_inode.h"
>  #include "xfs_trans.h"
>  #include "xfs_trans_priv.h"
> @@ -72,13 +74,19 @@ xfs_trans_ichgtime(
>  }
>  
>  /*
> - * This is called to mark the fields indicated in fieldmask as needing
> - * to be logged when the transaction is committed.  The inode must
> - * already be associated with the given transaction.
> + * This is called to mark the fields indicated in fieldmask as needing to be
> + * logged when the transaction is committed.  The inode must already be
> + * associated with the given transaction.
>   *
> - * The values for fieldmask are defined in xfs_inode_item.h.  We always
> - * log all of the core inode if any of it has changed, and we always log
> - * all of the inline data/extents/b-tree root if any of them has changed.
> + * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
> + * of the core inode if any of it has changed, and we always log all of the
> + * inline data/extents/b-tree root if any of them has changed.
> + *
> + * Grab and pin the cluster buffer associated with this inode to avoid RMW
> + * cycles at inode writeback time. Avoid the need to add error handling to every
> + * xfs_trans_log_inode() call by shutting down on read error.  This will cause
> + * transactions to fail and everything to error out, just like if we return a
> + * read error in a dirty transaction and cancel it.
>   */
>  void
>  xfs_trans_log_inode(
> @@ -131,6 +139,39 @@ xfs_trans_log_inode(
>  	spin_lock(&iip->ili_lock);
>  	iip->ili_fsync_fields |= flags;
>  
> +	if (!iip->ili_item.li_buf) {
> +		struct xfs_buf	*bp;
> +		int		error;
> +
> +		/*
> +		 * We hold the ILOCK here, so this inode is not going to be
> +		 * flushed while we are here. Further, because there is no
> +		 * buffer attached to the item, we know that there is no IO in
> +		 * progress, so nothing will clear the ili_fields while we read
> +		 * in the buffer. Hence we can safely drop the spin lock and
> +		 * read the buffer knowing that the state will not change from
> +		 * here.
> +		 */
> +		spin_unlock(&iip->ili_lock);
> +		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, NULL,
> +					&bp, 0);
> +		if (error) {
> +			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> +			return;
> +		}
> +
> +		/*
> +		 * We need an explicit buffer reference for the log item but
> +		 * don't want the buffer to remain attached to the transaction.
> +		 * Hold the buffer but release the transaction reference.
> +		 */
> +		xfs_buf_hold(bp);
> +		xfs_trans_brelse(tp, bp);
> +
> +		spin_lock(&iip->ili_lock);
> +		iip->ili_item.li_buf = bp;
> +	}
> +
>  	/*
>  	 * Always OR in the bits from the ili_last_fields field.  This is to
>  	 * coordinate with the xfs_iflush() and xfs_iflush_done() routines in
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index d61f20b989cd..ecb3362395af 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -1143,11 +1143,9 @@ xfs_buf_inode_iodone(
>  		if (ret == XBF_IOERROR_DONE)
>  			return;
>  		ASSERT(ret == XBF_IOERROR_FAIL);
> -		spin_lock(&bp->b_mount->m_ail->ail_lock);
>  		list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> -			xfs_set_li_failed(lip, bp);
> +			set_bit(XFS_LI_FAILED, &lip->li_flags);
>  		}
> -		spin_unlock(&bp->b_mount->m_ail->ail_lock);
>  		xfs_buf_ioerror(bp, 0);
>  		xfs_buf_relse(bp);
>  		return;
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 0ba75764a8dc..64bdda72f7b2 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -439,6 +439,7 @@ xfs_inode_item_pin(
>  	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
>  
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> +	ASSERT(lip->li_buf);
>  
>  	trace_xfs_inode_pin(ip, _RET_IP_);
>  	atomic_inc(&ip->i_pincount);
> @@ -450,6 +451,12 @@ xfs_inode_item_pin(
>   * item which was previously pinned with a call to xfs_inode_item_pin().
>   *
>   * Also wake up anyone in xfs_iunpin_wait() if the count goes to 0.
> + *
> + * Note that unpin can race with inode cluster buffer freeing marking the buffer
> + * stale. In that case, flush completions are run from the buffer unpin call,
> + * which may happen before the inode is unpinned. If we lose the race, there
> + * will be no buffer attached to the log item, but the inode will be marked
> + * XFS_ISTALE.
>   */
>  STATIC void
>  xfs_inode_item_unpin(
> @@ -459,6 +466,7 @@ xfs_inode_item_unpin(
>  	struct xfs_inode	*ip = INODE_ITEM(lip)->ili_inode;
>  
>  	trace_xfs_inode_unpin(ip, _RET_IP_);
> +	ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE));
>  	ASSERT(atomic_read(&ip->i_pincount) > 0);
>  	if (atomic_dec_and_test(&ip->i_pincount))
>  		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
> @@ -629,10 +637,15 @@ xfs_inode_item_init(
>   */
>  void
>  xfs_inode_item_destroy(
> -	xfs_inode_t	*ip)
> +	struct xfs_inode	*ip)
>  {
> -	kmem_free(ip->i_itemp->ili_item.li_lv_shadow);
> -	kmem_cache_free(xfs_ili_zone, ip->i_itemp);
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
> +
> +	ASSERT(iip->ili_item.li_buf == NULL);
> +
> +	ip->i_itemp = NULL;
> +	kmem_free(iip->ili_item.li_lv_shadow);
> +	kmem_cache_free(xfs_ili_zone, iip);
>  }
>  
>  
> @@ -673,11 +686,10 @@ xfs_iflush_done(
>  		list_move_tail(&lip->li_bio_list, &tmp);
>  
>  		/* Do an unlocked check for needing the AIL lock. */
> -		if (lip->li_lsn == iip->ili_flush_lsn ||
> +		if (iip->ili_flush_lsn == lip->li_lsn ||
>  		    test_bit(XFS_LI_FAILED, &lip->li_flags))
>  			need_ail++;
>  	}
> -	ASSERT(list_empty(&bp->b_li_list));
>  
>  	/*
>  	 * We only want to pull the item from the AIL if it is actually there
> @@ -690,7 +702,7 @@ xfs_iflush_done(
>  		/* this is an opencoded batch version of xfs_trans_ail_delete */
>  		spin_lock(&ailp->ail_lock);
>  		list_for_each_entry(lip, &tmp, li_bio_list) {
> -			xfs_clear_li_failed(lip);
> +			clear_bit(XFS_LI_FAILED, &lip->li_flags);
>  			if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) {
>  				xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip);
>  				if (!tail_lsn && lsn)
> @@ -706,14 +718,29 @@ xfs_iflush_done(
>  	 * them is safely on disk.
>  	 */
>  	list_for_each_entry_safe(lip, n, &tmp, li_bio_list) {
> +		bool	drop_buffer = false;
> +
>  		list_del_init(&lip->li_bio_list);
>  		iip = INODE_ITEM(lip);
>  
>  		spin_lock(&iip->ili_lock);
> +
> +		/*
> +		 * Remove the reference to the cluster buffer if the inode is
> +		 * clean in memory. Drop the buffer reference once we've dropped
> +		 * the locks we hold.
> +		 */
> +		ASSERT(iip->ili_item.li_buf == bp);
> +		if (!iip->ili_fields) {
> +			iip->ili_item.li_buf = NULL;
> +			drop_buffer = true;
> +		}
>  		iip->ili_last_fields = 0;
> +		iip->ili_flush_lsn = 0;
>  		spin_unlock(&iip->ili_lock);
> -
>  		xfs_ifunlock(iip->ili_inode);
> +		if (drop_buffer)
> +			xfs_buf_rele(bp);
>  	}
>  }
>  
> @@ -725,12 +752,20 @@ xfs_iflush_done(
>   */
>  void
>  xfs_iflush_abort(
> -	struct xfs_inode		*ip)
> +	struct xfs_inode	*ip)
>  {
> -	struct xfs_inode_log_item	*iip = ip->i_itemp;
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
> +	struct xfs_buf		*bp = NULL;
>  
>  	if (iip) {
> +		/*
> +		 * Clear the failed bit before removing the item from the AIL so
> +		 * xfs_trans_ail_delete() doesn't try to clear and release the
> +		 * buffer attached to the log item before we are done with it.
> +		 */
> +		clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags);
>  		xfs_trans_ail_delete(&iip->ili_item, 0);
> +
>  		/*
>  		 * Clear the inode logging fields so no more flushes are
>  		 * attempted.
> @@ -739,12 +774,14 @@ xfs_iflush_abort(
>  		iip->ili_last_fields = 0;
>  		iip->ili_fields = 0;
>  		iip->ili_fsync_fields = 0;
> +		iip->ili_flush_lsn = 0;
> +		bp = iip->ili_item.li_buf;
> +		iip->ili_item.li_buf = NULL;
>  		spin_unlock(&iip->ili_lock);
>  	}
> -	/*
> -	 * Release the inode's flush lock since we're done with it.
> -	 */
>  	xfs_ifunlock(ip);
> +	if (bp)
> +		xfs_buf_rele(bp);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index ac33f6393f99..c3be6e440134 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -377,8 +377,12 @@ xfsaild_resubmit_item(
>  	}
>  
>  	/* protected by ail_lock */
> -	list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
> -		xfs_clear_li_failed(lip);
> +	list_for_each_entry(lip, &bp->b_li_list, li_bio_list) {
> +		if (bp->b_flags & _XBF_INODES)
> +			clear_bit(XFS_LI_FAILED, &lip->li_flags);
> +		else
> +			xfs_clear_li_failed(lip);
> +	}
>  
>  	xfs_buf_unlock(bp);
>  	return XFS_ITEM_SUCCESS;
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous
  2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
                   ` (29 preceding siblings ...)
  2020-06-22  8:16 ` [PATCH 30/30] xfs: remove xfs_inobp_check() Dave Chinner
@ 2020-06-29 23:01 ` Darrick J. Wong
  2020-06-30 16:52   ` Darrick J. Wong
  30 siblings, 1 reply; 43+ messages in thread
From: Darrick J. Wong @ 2020-06-29 23:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 22, 2020 at 06:15:35PM +1000, Dave Chinner wrote:
> Hi folks,
> 
> Inode flushing requires that we first lock an inode, then check it,
> then lock the underlying buffer, flush the inode to the buffer and
> finally add the inode to the buffer to be unlocked on IO completion.
> We then walk all the other cached inodes in the buffer range and
> optimistically lock and flush them to the buffer without blocking.

Well, I've been banging my head against this patchset for the past
couple of weeks now, and I still can't get it to finish fstests
reliably.

Last week, Dave and I were stymied by a bug in the scheduler that was
fixed in -rc3, but even with that applied I still see weird failures.  I
/think/ there are only two now:

1) If I run xfs/305 (with all three quotas enabled) in a tight loop (and
rmmod xfs after each run), after 20-30 minutes I will start to see the
slub cache start complaining about leftovers in the xfs_ili (inode log
item) and xfs_inode caches.

Unfortunately, due to the kernel's new security posture of never
allowing kernel pointer values to be logged, the slub complaints are
mostly useless because it no longer prints anything that would enable me
to figure out /which/ inodes are being left behind:

 =============================================================================
 BUG xfs_ili (Tainted: G    B            ): Objects remaining in xfs_ili on __kmem_cache_shutdown()
 -----------------------------------------------------------------------------
 
 INFO: Slab 0x000000007e8837cf objects=31 used=9 fp=0x000000007017e948 flags=0x000000000010200
 CPU: 1 PID: 80614 Comm: rmmod Tainted: G    B             5.8.0-rc3-djw #rc3
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
 Call Trace:
  dump_stack+0x78/0xa0
  slab_err+0xb7/0xdc
  ? trace_hardirqs_on+0x1c/0xf0
  __kmem_cache_shutdown.cold+0x3a/0x163
  ? __mutex_unlock_slowpath+0x45/0x2a0
  kmem_cache_destroy+0x55/0x110
  xfs_destroy_zones+0x6a/0xe2 [xfs]
  exit_xfs_fs+0x5f/0xb7b [xfs]
  __x64_sys_delete_module+0x120/0x210
  ? __prepare_exit_to_usermode+0xe4/0x170
  do_syscall_64+0x56/0xa0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7ff204672a3b
 Code: Bad RIP value.
 RSP: 002b:00007ffe60155378 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: 0000558f2bfa2780 RCX: 00007ff204672a3b
 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000558f2bfa27e8
 RBP: 00007ffe601553d8 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007ff2046eeac0 R11: 0000000000000206 R12: 00007ffe601555b0
 R13: 00007ffe6015703d R14: 0000558f2bfa12a0 R15: 0000558f2bfa2780
 INFO: Object 0x00000000a92e3c34 @offset=0
 INFO: Object 0x00000000650eb3bf @offset=792
 INFO: Object 0x00000000eabfef0f @offset=1320
 INFO: Object 0x00000000cdaae406 @offset=4224
 INFO: Object 0x000000007d9bbde1 @offset=4488
 INFO: Object 0x00000000e35f4716 @offset=5016
 INFO: Object 0x0000000008e636d2 @offset=5280
 INFO: Object 0x00000000170762ee @offset=5808
 INFO: Object 0x0000000046425f04 @offset=7920

Note all the 64-bit values that have the 32 upper bits set to 0; this
is the pointer hashing safety algorithm at work.  I've patched around
that bit of training-wheels drain bamage, but now I get to wait until it
happens again.

2) If /that/ doesn't happen, a regular fstests run (again with all three
quotas enabled) will (usually very quickly) wedge in unmount:

[<0>] xfs_qm_dquot_walk+0x19c/0x2b0 [xfs]
[<0>] xfs_qm_dqpurge_all+0x31/0x70 [xfs]
[<0>] xfs_qm_unmount+0x1d/0x30 [xfs]
[<0>] xfs_unmountfs+0xa0/0x1a0 [xfs]
[<0>] xfs_fs_put_super+0x35/0x80 [xfs]
[<0>] generic_shutdown_super+0x67/0x100
[<0>] kill_block_super+0x21/0x50
[<0>] deactivate_locked_super+0x31/0x70
[<0>] cleanup_mnt+0x100/0x160
[<0>] task_work_run+0x5f/0xa0
[<0>] __prepare_exit_to_usermode+0x13d/0x170
[<0>] do_syscall_64+0x62/0xa0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

AFAICT it's usually the root dquot and dqpurge won't touch it because
the quota nrefs > 0.  Poking around in gdb, I find that whichever
xfs_mount is stalled does not seem to have any vfs inodes attached to
it, so it's clear that we flushed and freed all the incore inode state,
which means that all the dquots should be unreferenced.

Both of these failure cases have been difficult to reproduce, which is
to say that I can't get them to repro reliably.  Turning PREEMPT on
seems to make it reproduce faster, which makes me wonder if something in
this patchset is screwing up concurrency handling or something?  KASAN
and kmemleak have nothing to say.  I've also noticed that the less
heavily loaded the underlying VM host's storage system, the less likely
it is to happen, though that could be a coincidence.

Anyway, if I figure something out I'll holler, but I thought it was past
time to braindump on the mailing list.

--D

> This cluster write effectively repeats the same code we do with the
> initial inode, except now it has to special case that initial inode
> that is already locked. Hence we have multiple copies of very
> similar code, and it is a result of inode cluster flushing being
> based on a specific inode rather than grabbing the buffer and
> flushing all available inodes to it.
> 
> The problem with this at the moment is that we we can't look up the
> buffer until we have guaranteed that an inode is held exclusively
> and it's not going away while we get the buffer through an imap
> lookup. Hence we are kinda stuck locking an inode before we can look
> up the buffer.
> 
> This is also a result of inodes being detached from the cluster
> buffer except when IO is being done. This has the further problem
> that the cluster buffer can be reclaimed from memory and then the
> inode can be dirtied. At this point cleaning the inode requires a
> read-modify-write cycle on the cluster buffer. If we then are put
> under memory pressure, cleaning that dirty inode to reclaim it
> requires allocating memory for the cluster buffer and this leads to
> all sorts of problems.
> 
> We used synchronous inode writeback in reclaim as a throttle that
> provided a forwards progress mechanism when RMW cycles were required
> to clean inodes. Async writeback of inodes (e.g. via the AIL) would
> immediately exhaust remaining memory reserves trying to allocate
> inode cluster after inode cluster. The synchronous writeback of an
> inode cluster allowed reclaim to release the inode cluster and have
> it freed almost immediately which could then be used to allocate the
> next inode cluster buffer. Hence the IO based throttling mechanism
> largely guaranteed forwards progress in inode reclaim. By removing
> the requirement for require memory allocation for inode writeback
> filesystem level, we can issue writeback asynchrnously and not have
> to worry about the memory exhaustion anymore.
> 
> Another issue is that if we have slow disks, we can build up dirty
> inodes in memory that can then take hours for an operation like
> unmount to flush. A RMW cycle per inode on a slow RAID6 device can
> mean we only clean 50 inodes a second, and when there are hundreds
> of thousands of dirty inodes that need to be cleaned this can take a
> long time. PInning the cluster buffers will greatly speed up inode
> writeback on slow storage systems like this.
> 
> These limitations all stem from the same source: inode writeback is
> inode centric, And they are largely solved by the same architectural
> change: make inode writeback cluster buffer centric.  This series is
> makes that architectural change.
> 
> Firstly, we start by pinning the inode backing buffer in memory
> when an inode is marked dirty (i.e. when it is logged). By tracking
> the number of dirty inodes on a buffer as a counter rather than a
> flag, we avoid the problem of overlapping inode dirtying and buffer
> flushing racing to set/clear the dirty flag. Hence as long as there
> is a dirty inode in memory, the buffer will not be able to be
> reclaimed. We can safely do this inode cluster buffer lookup when we
> dirty an inode as we do not hold the buffer locked - we merely take
> a reference to it and then release it - and hence we don't cause any
> new lock order issues.
> 
> When the inode is finally cleaned, the reference to the buffer can
> be removed from the inode log item and the buffer released. This is
> done from the inode completion callbacks that are attached to the
> buffer when the inode is flushed.
> 
> Pinning the cluster buffer in this way immediately avoids the RMW
> problem in inode writeback and reclaim contexts by moving the memory
> allocation and the blocking buffer read into the transaction context
> that dirties the inode.  This inverts our dirty inode throttling
> mechanism - we now throttle the rate at which we can dirty inodes to
> rate at which we can allocate memory and read inode cluster buffers
> into memory rather than via throttling reclaim to rate at which we
> can clean dirty inodes.
> 
> Hence if we are under memory pressure, we'll block on memory
> allocation when trying to dirty the referenced inode, rather than in
> the memory reclaim path where we are trying to clean unreferenced
> inodes to free memory.  Hence we no longer have to guarantee
> forwards progress in inode reclaim as we aren't doing memory
> allocation, and that means we can remove inode writeback from the
> XFS inode shrinker completely without changing the system tolerance
> for low memory operation.
> 
> Tracking the buffers via the inode log item also allows us to
> completely rework the inode flushing mechanism. While the inode log
> item is in the AIL, it is safe for the AIL to access any member of
> the log item. Hence the AIL push mechanisms can access the buffer
> attached to the inode without first having to lock the inode.
> 
> This means we can essentially lock the buffer directly and then
> call xfs_iflush_cluster() without first going through xfs_iflush()
> to find the buffer. Hence we can remove xfs_iflush() altogether,
> because the two places that call it - the inode item push code and
> inode reclaim - no longer need to flush inodes directly.
> 
> This can be further optimised by attaching the inode to the cluster
> buffer when the inode is dirtied. i.e. when we add the buffer
> reference to the inode log item, we also attach the inode to the
> buffer for IO processing. This leads to the dirty inodes always
> being attached to the buffer and hence we no longer need to add them
> when we flush the inode and remove them when IO completes. Instead
> the inodes are attached when the node log item is dirtied, and
> removed when the inode log item is cleaned.
> 
> With this structure in place, we no longer need to do
> lookups to find the dirty inodes in the cache to attach to the
> buffer in xfs_iflush_cluster() - they are already attached to the
> buffer. Hence when the AIL pushes an inode, we just grab the buffer
> from the log item, and then walk the buffer log item list to lock
> and flush the dirty inodes attached to the buffer.
> 
> This greatly simplifies inode writeback, and removes another memory
> allocation from the inode writeback path (the array used for the
> radix tree gang lookup). And while the radix tree lookups are fast,
> walking the linked list of dirty inodes is faster.
> 
> There is followup work I am doing that uses the inode cluster buffer
> as a replacement in the AIL for tracking dirty inodes. This part of
> the series is not ready yet as it has some intricate locking
> requirements. That is an optimisation, so I've left that out because
> solving the inode reclaim blocking problems is the important part of
> this work.
> 
> In short, this series simplifies inode writeback and fixes the long
> standing inode reclaim blocking issues without requiring any changes
> to the memory reclaim infrastructure.
> 
> Note: dquots should probably be converted to cluster flushing in a
> similar way, as they have many of the same issues as inode flushing.
> 
> Thoughts, comments and improvemnts welcome.
> 
> -Dave.
> 
> Version 4:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-4
> 
> - rebase on 5.8-rc2 + for-next
> - fix buffer retry logic braino (p13)
> - removed unnecessary asserts (p24)
> - removed unnecessary delwri queue checks from
>   xfs_inode_item_push (p24)
> - rework return value from xfs_iflush_cluster to indicate -EAGAIN if
>   no inodes were flushed and handle that case in the caller. (p28)
> - rewrite comment about shutdown case in xfs_iflush_cluster (p28)
> - always clear XFS_LI_FAILED for items requiring AIL processing
>   (p29)
> 
> 
> Version 3
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-3
> 
> - rebase on 5.7 + for-next
> - update comments (p3)
> - update commit message (p4)
> - renamed xfs_buf_ioerror_sync() (p13)
> - added enum for return value from xfs_buf_iodone_error() (p13)
> - moved clearing of buffer error to iodone functions (p13)
> - whitespace (p13)
> - rebase p14 (p13 conflicts)
> - rebase p16 (p13 conflicts)
> - removed a superfluous assert (p16)
> - moved comment and check in xfs_iflush_done() from p16 to p25
> - rebase p25 (p16 conflicts)
> 
> 
> 
> Version 2
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-2
> 
> - describe ili_lock better (p2)
> - clean up inode logging code some more (p2)
> - move "early read completion" for xfs_buf_ioend() up into p3 from
>   p4.
> - fixed conflicts in p4 due to p3 changes.
> - fixed conflicts in p5 due to p4 changes.
> - s/_XBF_LOGRCVY/_XBF_LOG_RECOVERY/ (p5)
> - renamed the buf log item iodone callback to xfs_buf_item_iodone and
>   reused the xfs_buf_iodone() name for the catch-all buffer write
>   iodone completion. (p6)
> - history update for commit message (p7)
> - subject update for p8
> - rework loop in xfs_dquot_done() (p9)
> - Fixed conflicts in p10 due to p6 changes
> - got rid of entire comments around li_cb (p11)
> - new patch to rework buffer io error callbacks
> - new patch to unwind ->iop_error calls and remove ->iop_error
> - new patch to lift xfs_clear_li_failed() out of
>   xfs_ail_delete_one()
> - rebased p12 on all the prior changes
> - reworked LI_FAILED handling when pinning inodes to the cluster
>   buffer (p12) 
> - fixed comment about holding buffer references in
>   xfs_trans_log_inode() (p12)
> - fixed indenting of xfs_iflush_abort() (p12)
> - added comments explaining "skipped" indoe reclaim return value
>   (p14)
> - cleaned up error return stack in xfs_reclaim_inode() (p14)
> - cleaned up skipped return in xfs_reclaim_inodes() (p14)
> - fixed bug where skipped wasn't incremented if reclaim cursor was
>   not zero. This could leave inodes between the start of the AG and
>   the cursor unreclaimed (p15)
> - reinstate the patch removing SYNC_WAIT from xfs_reclaim_inodes().
>   Exposed "skipped" bug in p15.
> - cleaned up inode reclaim comments (p18)
> - split p19 into two - one to change xfs_ifree_cluster(), one
>   for the buffer pinning.
> - xfs_ifree_mark_inode_stale() now takes the cluster buffer and we
>   get the perag from that rather than having to do a lookup in
>   xfs_ifree_cluster().
> - moved extra IO reference for xfs_iflush_cluster() from AIL pushing
>   to initial xfs_iflush_cluster rework (p22 -> p20)
> - fixed static declaration on xfs_iflush() (p22)
> - fixed incorrect EIO return from xfs_iflush_cluster()
> - rebase p23 because it all rejects now.
> - fix INODE_ITEM() usage in p23
> - removed long lines from commit message in p24
> - new patch to fix logging of XFS_ISTALE inodes which pushes dirty
>   inodes through reclaim.
> 
> 
> 
> Dave Chinner (30):
>   xfs: Don't allow logging of XFS_ISTALE inodes
>   xfs: remove logged flag from inode log item
>   xfs: add an inode item lock
>   xfs: mark inode buffers in cache
>   xfs: mark dquot buffers in cache
>   xfs: mark log recovery buffers for completion
>   xfs: call xfs_buf_iodone directly
>   xfs: clean up whacky buffer log item list reinit
>   xfs: make inode IO completion buffer centric
>   xfs: use direct calls for dquot IO completion
>   xfs: clean up the buffer iodone callback functions
>   xfs: get rid of log item callbacks
>   xfs: handle buffer log item IO errors directly
>   xfs: unwind log item error flagging
>   xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()
>   xfs: pin inode backing buffer to the inode log item
>   xfs: make inode reclaim almost non-blocking
>   xfs: remove IO submission from xfs_reclaim_inode()
>   xfs: allow multiple reclaimers per AG
>   xfs: don't block inode reclaim on the ILOCK
>   xfs: remove SYNC_TRYLOCK from inode reclaim
>   xfs: remove SYNC_WAIT from xfs_reclaim_inodes()
>   xfs: clean up inode reclaim comments
>   xfs: rework stale inodes in xfs_ifree_cluster
>   xfs: attach inodes to the cluster buffer when dirtied
>   xfs: xfs_iflush() is no longer necessary
>   xfs: rename xfs_iflush_int()
>   xfs: rework xfs_iflush_cluster() dirty inode iteration
>   xfs: factor xfs_iflush_done
>   xfs: remove xfs_inobp_check()
> 
>  fs/xfs/libxfs/xfs_inode_buf.c   |  27 +-
>  fs/xfs/libxfs/xfs_inode_buf.h   |   6 -
>  fs/xfs/libxfs/xfs_trans_inode.c | 110 +++++--
>  fs/xfs/xfs_buf.c                |  40 ++-
>  fs/xfs/xfs_buf.h                |  48 ++-
>  fs/xfs/xfs_buf_item.c           | 419 +++++++++++------------
>  fs/xfs/xfs_buf_item.h           |   8 +-
>  fs/xfs/xfs_buf_item_recover.c   |   5 +-
>  fs/xfs/xfs_dquot.c              |  29 +-
>  fs/xfs/xfs_dquot.h              |   1 +
>  fs/xfs/xfs_dquot_item.c         |  18 -
>  fs/xfs/xfs_dquot_item_recover.c |   2 +-
>  fs/xfs/xfs_file.c               |   9 +-
>  fs/xfs/xfs_icache.c             | 333 ++++++-------------
>  fs/xfs/xfs_icache.h             |   2 +-
>  fs/xfs/xfs_inode.c              | 567 ++++++++++++--------------------
>  fs/xfs/xfs_inode.h              |   2 +-
>  fs/xfs/xfs_inode_item.c         | 303 +++++++++--------
>  fs/xfs/xfs_inode_item.h         |  24 +-
>  fs/xfs/xfs_inode_item_recover.c |   2 +-
>  fs/xfs/xfs_log_recover.c        |   5 +-
>  fs/xfs/xfs_mount.c              |  15 +-
>  fs/xfs/xfs_mount.h              |   1 -
>  fs/xfs/xfs_super.c              |   3 -
>  fs/xfs/xfs_trans.h              |   5 -
>  fs/xfs/xfs_trans_ail.c          |  10 +-
>  fs/xfs/xfs_trans_buf.c          |  15 +-
>  27 files changed, 889 insertions(+), 1120 deletions(-)
> 
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous
  2020-06-29 23:01 ` [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Darrick J. Wong
@ 2020-06-30 16:52   ` Darrick J. Wong
  2020-06-30 21:51     ` Dave Chinner
  0 siblings, 1 reply; 43+ messages in thread
From: Darrick J. Wong @ 2020-06-30 16:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 29, 2020 at 04:01:30PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 22, 2020 at 06:15:35PM +1000, Dave Chinner wrote:
> > Hi folks,
> > 
> > Inode flushing requires that we first lock an inode, then check it,
> > then lock the underlying buffer, flush the inode to the buffer and
> > finally add the inode to the buffer to be unlocked on IO completion.
> > We then walk all the other cached inodes in the buffer range and
> > optimistically lock and flush them to the buffer without blocking.
> 
> Well, I've been banging my head against this patchset for the past
> couple of weeks now, and I still can't get it to finish fstests
> reliably.
> 
> Last week, Dave and I were stymied by a bug in the scheduler that was
> fixed in -rc3, but even with that applied I still see weird failures.  I
> /think/ there are only two now:
> 
> 1) If I run xfs/305 (with all three quotas enabled) in a tight loop (and
> rmmod xfs after each run), after 20-30 minutes I will start to see the
> slub cache start complaining about leftovers in the xfs_ili (inode log
> item) and xfs_inode caches.
> 
> Unfortunately, due to the kernel's new security posture of never
> allowing kernel pointer values to be logged, the slub complaints are
> mostly useless because it no longer prints anything that would enable me
> to figure out /which/ inodes are being left behind:
> 
>  =============================================================================
>  BUG xfs_ili (Tainted: G    B            ): Objects remaining in xfs_ili on __kmem_cache_shutdown()
>  -----------------------------------------------------------------------------
>  
>  INFO: Slab 0x000000007e8837cf objects=31 used=9 fp=0x000000007017e948 flags=0x000000000010200
>  CPU: 1 PID: 80614 Comm: rmmod Tainted: G    B             5.8.0-rc3-djw #rc3
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
>  Call Trace:
>   dump_stack+0x78/0xa0
>   slab_err+0xb7/0xdc
>   ? trace_hardirqs_on+0x1c/0xf0
>   __kmem_cache_shutdown.cold+0x3a/0x163
>   ? __mutex_unlock_slowpath+0x45/0x2a0
>   kmem_cache_destroy+0x55/0x110
>   xfs_destroy_zones+0x6a/0xe2 [xfs]
>   exit_xfs_fs+0x5f/0xb7b [xfs]
>   __x64_sys_delete_module+0x120/0x210
>   ? __prepare_exit_to_usermode+0xe4/0x170
>   do_syscall_64+0x56/0xa0
>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
>  RIP: 0033:0x7ff204672a3b
>  Code: Bad RIP value.
>  RSP: 002b:00007ffe60155378 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
>  RAX: ffffffffffffffda RBX: 0000558f2bfa2780 RCX: 00007ff204672a3b
>  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000558f2bfa27e8
>  RBP: 00007ffe601553d8 R08: 0000000000000000 R09: 0000000000000000
>  R10: 00007ff2046eeac0 R11: 0000000000000206 R12: 00007ffe601555b0
>  R13: 00007ffe6015703d R14: 0000558f2bfa12a0 R15: 0000558f2bfa2780
>  INFO: Object 0x00000000a92e3c34 @offset=0
>  INFO: Object 0x00000000650eb3bf @offset=792
>  INFO: Object 0x00000000eabfef0f @offset=1320
>  INFO: Object 0x00000000cdaae406 @offset=4224
>  INFO: Object 0x000000007d9bbde1 @offset=4488
>  INFO: Object 0x00000000e35f4716 @offset=5016
>  INFO: Object 0x0000000008e636d2 @offset=5280
>  INFO: Object 0x00000000170762ee @offset=5808
>  INFO: Object 0x0000000046425f04 @offset=7920
> 
> Note all the 64-bit values that have the 32 upper bits set to 0; this
> is the pointer hashing safety algorithm at work.  I've patched around
> that bit of training-wheels drain bamage, but now I get to wait until it
> happens again.
> 
> 2) If /that/ doesn't happen, a regular fstests run (again with all three
> quotas enabled) will (usually very quickly) wedge in unmount:
> 
> [<0>] xfs_qm_dquot_walk+0x19c/0x2b0 [xfs]
> [<0>] xfs_qm_dqpurge_all+0x31/0x70 [xfs]
> [<0>] xfs_qm_unmount+0x1d/0x30 [xfs]
> [<0>] xfs_unmountfs+0xa0/0x1a0 [xfs]
> [<0>] xfs_fs_put_super+0x35/0x80 [xfs]
> [<0>] generic_shutdown_super+0x67/0x100
> [<0>] kill_block_super+0x21/0x50
> [<0>] deactivate_locked_super+0x31/0x70
> [<0>] cleanup_mnt+0x100/0x160
> [<0>] task_work_run+0x5f/0xa0
> [<0>] __prepare_exit_to_usermode+0x13d/0x170
> [<0>] do_syscall_64+0x62/0xa0
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> AFAICT it's usually the root dquot and dqpurge won't touch it because
> the quota nrefs > 0.  Poking around in gdb, I find that whichever
> xfs_mount is stalled does not seem to have any vfs inodes attached to
> it, so it's clear that we flushed and freed all the incore inode state,
> which means that all the dquots should be unreferenced.
> 
> Both of these failure cases have been difficult to reproduce, which is
> to say that I can't get them to repro reliably.  Turning PREEMPT on
> seems to make it reproduce faster, which makes me wonder if something in
> this patchset is screwing up concurrency handling or something?  KASAN
> and kmemleak have nothing to say.  I've also noticed that the less
> heavily loaded the underlying VM host's storage system, the less likely
> it is to happen, though that could be a coincidence.
> 
> Anyway, if I figure something out I'll holler, but I thought it was past
> time to braindump on the mailing list.

Last night, Dave and I did some live debugging of a failed VM test
system, and discovered that the xfs_reclaim_inodes() call does not
actually reclaim all the IRECLAIMABLE inodes.  Because we fail to call
xfs_reclaim_inode() on all the inodes, there are still inodes in the
incore inode xarray, and they still have dquots attached.

This would explain the symptoms I've seen -- since we didn't reclaim the
inodes, we didn't dqdetach them either, and so the dqpurge_all will spin
forever on the still-referenced dquots.  This also explains the slub
complaints about active xfs_inode/xfs_inode_log_item objects if I turn
off quotas, since we didn't clean those up either.

Further analysis (aka adding tracepoints) shows xfs_reclaim_inode_grab
deciding to skip some inodes because IFLOCK is set.  Adding code to
cycle the i_flags_lock ahead of the unlocked IFLOCK test didn't make the
symptoms go away, so I instrumented the inode flush "lock" functions to
see what was going on (full version available here [1]):

          umount-10409 [001]    44.117599: console:              [   43.980314] XFS (pmem1): Unmounting Filesystem
          umount-10409 [001]    44.118314: xfs_dquot_dqdetach:   dev 259:0 ino 0x80
<snip>
   xfsaild/pmem1-10315 [002]    44.118395: xfs_iflock_nowait:    dev 259:0 ino 0x83
   xfsaild/pmem1-10315 [002]    44.118407: xfs_iflock_nowait:    dev 259:0 ino 0x80
   xfsaild/pmem1-10315 [002]    44.118416: xfs_iflock_nowait:    dev 259:0 ino 0x84
   xfsaild/pmem1-10315 [002]    44.118421: xfs_iflock_nowait:    dev 259:0 ino 0x85
   xfsaild/pmem1-10315 [002]    44.118426: xfs_iflock_nowait:    dev 259:0 ino 0x86
   xfsaild/pmem1-10315 [002]    44.118430: xfs_iflock_nowait:    dev 259:0 ino 0x87
   xfsaild/pmem1-10315 [002]    44.118435: xfs_iflock_nowait:    dev 259:0 ino 0x88
   xfsaild/pmem1-10315 [002]    44.118440: xfs_iflock_nowait:    dev 259:0 ino 0x89
   xfsaild/pmem1-10315 [002]    44.118445: xfs_iflock_nowait:    dev 259:0 ino 0x8a
   xfsaild/pmem1-10315 [002]    44.118449: xfs_iflock_nowait:    dev 259:0 ino 0x8b
   xfsaild/pmem1-10315 [002]    44.118454: xfs_iflock_nowait:    dev 259:0 ino 0x8c
   xfsaild/pmem1-10315 [002]    44.118458: xfs_iflock_nowait:    dev 259:0 ino 0x8d
   xfsaild/pmem1-10315 [002]    44.118463: xfs_iflock_nowait:    dev 259:0 ino 0x8e
   xfsaild/pmem1-10315 [002]    44.118467: xfs_iflock_nowait:    dev 259:0 ino 0x8f
   xfsaild/pmem1-10315 [002]    44.118472: xfs_iflock_nowait:    dev 259:0 ino 0x90
   xfsaild/pmem1-10315 [002]    44.118477: xfs_iflock_nowait:    dev 259:0 ino 0x91
   xfsaild/pmem1-10315 [002]    44.118481: xfs_iflock_nowait:    dev 259:0 ino 0x92
     kworker/2:1-50    [002]    44.118858: xfs_ifunlock:         dev 259:0 ino 0x83
     kworker/2:1-50    [002]    44.118862: xfs_ifunlock:         dev 259:0 ino 0x80
     kworker/2:1-50    [002]    44.118865: xfs_ifunlock:         dev 259:0 ino 0x84
     kworker/2:1-50    [002]    44.118868: xfs_ifunlock:         dev 259:0 ino 0x85
     kworker/2:1-50    [002]    44.118871: xfs_ifunlock:         dev 259:0 ino 0x86
          umount-10409 [001]    44.118872: xfs_reclaim_ag_inodes: dev 259:0 agno 0 agbno 0 len 0
     kworker/2:1-50    [002]    44.118874: xfs_ifunlock:         dev 259:0 ino 0x87
          umount-10409 [001]    44.118874: xfs_reclaim_inode_grab: dev 259:0 ino 0x80
          umount-10409 [001]    44.118875: xfs_reclaim_inode_grab: dev 259:0 ino 0x81
          umount-10409 [001]    44.118876: xfs_reclaim_inode_grab: dev 259:0 ino 0x82
          umount-10409 [001]    44.118877: xfs_reclaim_inode_grab: dev 259:0 ino 0x83
     kworker/2:1-50    [002]    44.118877: xfs_ifunlock:         dev 259:0 ino 0x88
          umount-10409 [001]    44.118877: xfs_reclaim_inode_grab: dev 259:0 ino 0x84
          umount-10409 [001]    44.118878: xfs_reclaim_inode_grab: dev 259:0 ino 0x85
          umount-10409 [001]    44.118879: xfs_reclaim_inode_grab: dev 259:0 ino 0x86
          umount-10409 [001]    44.118879: xfs_reclaim_inode_grab: dev 259:0 ino 0x87
     kworker/2:1-50    [002]    44.118880: xfs_ifunlock:         dev 259:0 ino 0x89
          umount-10409 [001]    44.118880: xfs_reclaim_inode_grab: dev 259:0 ino 0x88
          umount-10409 [001]    44.118881: xfs_reclaim_inode_grab: dev 259:0 ino 0x89
          umount-10409 [001]    44.118882: xfs_reclaim_inode_grab: dev 259:0 ino 0x8a
          umount-10409 [001]    44.118883: xfs_reclaim_inode_grab_iflock: dev 259:0 ino 0x8a
          umount-10409 [001]    44.118883: xfs_reclaim_inode_grab: dev 259:0 ino 0x8b
     kworker/2:1-50    [002]    44.118883: xfs_ifunlock:         dev 259:0 ino 0x8a

Bingo!  The xfs_ail_push_all_sync in xfs_unmountfs takes a bunch of
inode iflocks, starts the inode cluster buffer write, and since the AIL
is now empty, returns.  The unmount process moves on to calling
xfs_reclaim_inodes, which as you can see in the last four lines:

          umount-10409 [001]    44.118882: xfs_reclaim_inode_grab: dev 259:0 ino 0x8a

This ^^^ is logged at the start of xfs_reclaim_inode_grab.

          umount-10409 [001]    44.118883: xfs_reclaim_inode_grab_iflock: dev 259:0 ino 0x8a

This is logged when x_r_i_g observes that the IFLOCK is set and bails out.

     kworker/2:1-50    [002]    44.118883: xfs_ifunlock:         dev 259:0 ino 0x8a

And finally this is the inode cluster buffer IO completion calling
xfs_buf_inode_iodone -> xfs_iflush_done from a workqueue.

So it seems to me that inode reclaim races with the AIL for the IFLOCK,
and when unmount inode reclaim loses, it does the wrong thing.

Questions: Do we need to teach xfs_reclaim_inodes_ag to increment
@skipped if xfs_reclaim_inode_grab rejects an inode?  xfs_reclaim_inodes
is the only consumer of the @skipped value, and elevated skipped will
cause it to rerun the scan, so I think this will work.

Or, do we need to wait for the ail items to complete after xfsaild does
its xfs_buf_delwri_submit_nowait thing?

--D

[1] https://djwong.org/docs/tmp/barf.txt.gz

> --D
> 
> > This cluster write effectively repeats the same code we do with the
> > initial inode, except now it has to special case that initial inode
> > that is already locked. Hence we have multiple copies of very
> > similar code, and it is a result of inode cluster flushing being
> > based on a specific inode rather than grabbing the buffer and
> > flushing all available inodes to it.
> > 
> > The problem with this at the moment is that we we can't look up the
> > buffer until we have guaranteed that an inode is held exclusively
> > and it's not going away while we get the buffer through an imap
> > lookup. Hence we are kinda stuck locking an inode before we can look
> > up the buffer.
> > 
> > This is also a result of inodes being detached from the cluster
> > buffer except when IO is being done. This has the further problem
> > that the cluster buffer can be reclaimed from memory and then the
> > inode can be dirtied. At this point cleaning the inode requires a
> > read-modify-write cycle on the cluster buffer. If we then are put
> > under memory pressure, cleaning that dirty inode to reclaim it
> > requires allocating memory for the cluster buffer and this leads to
> > all sorts of problems.
> > 
> > We used synchronous inode writeback in reclaim as a throttle that
> > provided a forwards progress mechanism when RMW cycles were required
> > to clean inodes. Async writeback of inodes (e.g. via the AIL) would
> > immediately exhaust remaining memory reserves trying to allocate
> > inode cluster after inode cluster. The synchronous writeback of an
> > inode cluster allowed reclaim to release the inode cluster and have
> > it freed almost immediately which could then be used to allocate the
> > next inode cluster buffer. Hence the IO based throttling mechanism
> > largely guaranteed forwards progress in inode reclaim. By removing
> > the requirement for require memory allocation for inode writeback
> > filesystem level, we can issue writeback asynchrnously and not have
> > to worry about the memory exhaustion anymore.
> > 
> > Another issue is that if we have slow disks, we can build up dirty
> > inodes in memory that can then take hours for an operation like
> > unmount to flush. A RMW cycle per inode on a slow RAID6 device can
> > mean we only clean 50 inodes a second, and when there are hundreds
> > of thousands of dirty inodes that need to be cleaned this can take a
> > long time. PInning the cluster buffers will greatly speed up inode
> > writeback on slow storage systems like this.
> > 
> > These limitations all stem from the same source: inode writeback is
> > inode centric, And they are largely solved by the same architectural
> > change: make inode writeback cluster buffer centric.  This series is
> > makes that architectural change.
> > 
> > Firstly, we start by pinning the inode backing buffer in memory
> > when an inode is marked dirty (i.e. when it is logged). By tracking
> > the number of dirty inodes on a buffer as a counter rather than a
> > flag, we avoid the problem of overlapping inode dirtying and buffer
> > flushing racing to set/clear the dirty flag. Hence as long as there
> > is a dirty inode in memory, the buffer will not be able to be
> > reclaimed. We can safely do this inode cluster buffer lookup when we
> > dirty an inode as we do not hold the buffer locked - we merely take
> > a reference to it and then release it - and hence we don't cause any
> > new lock order issues.
> > 
> > When the inode is finally cleaned, the reference to the buffer can
> > be removed from the inode log item and the buffer released. This is
> > done from the inode completion callbacks that are attached to the
> > buffer when the inode is flushed.
> > 
> > Pinning the cluster buffer in this way immediately avoids the RMW
> > problem in inode writeback and reclaim contexts by moving the memory
> > allocation and the blocking buffer read into the transaction context
> > that dirties the inode.  This inverts our dirty inode throttling
> > mechanism - we now throttle the rate at which we can dirty inodes to
> > rate at which we can allocate memory and read inode cluster buffers
> > into memory rather than via throttling reclaim to rate at which we
> > can clean dirty inodes.
> > 
> > Hence if we are under memory pressure, we'll block on memory
> > allocation when trying to dirty the referenced inode, rather than in
> > the memory reclaim path where we are trying to clean unreferenced
> > inodes to free memory.  Hence we no longer have to guarantee
> > forwards progress in inode reclaim as we aren't doing memory
> > allocation, and that means we can remove inode writeback from the
> > XFS inode shrinker completely without changing the system tolerance
> > for low memory operation.
> > 
> > Tracking the buffers via the inode log item also allows us to
> > completely rework the inode flushing mechanism. While the inode log
> > item is in the AIL, it is safe for the AIL to access any member of
> > the log item. Hence the AIL push mechanisms can access the buffer
> > attached to the inode without first having to lock the inode.
> > 
> > This means we can essentially lock the buffer directly and then
> > call xfs_iflush_cluster() without first going through xfs_iflush()
> > to find the buffer. Hence we can remove xfs_iflush() altogether,
> > because the two places that call it - the inode item push code and
> > inode reclaim - no longer need to flush inodes directly.
> > 
> > This can be further optimised by attaching the inode to the cluster
> > buffer when the inode is dirtied. i.e. when we add the buffer
> > reference to the inode log item, we also attach the inode to the
> > buffer for IO processing. This leads to the dirty inodes always
> > being attached to the buffer and hence we no longer need to add them
> > when we flush the inode and remove them when IO completes. Instead
> > the inodes are attached when the node log item is dirtied, and
> > removed when the inode log item is cleaned.
> > 
> > With this structure in place, we no longer need to do
> > lookups to find the dirty inodes in the cache to attach to the
> > buffer in xfs_iflush_cluster() - they are already attached to the
> > buffer. Hence when the AIL pushes an inode, we just grab the buffer
> > from the log item, and then walk the buffer log item list to lock
> > and flush the dirty inodes attached to the buffer.
> > 
> > This greatly simplifies inode writeback, and removes another memory
> > allocation from the inode writeback path (the array used for the
> > radix tree gang lookup). And while the radix tree lookups are fast,
> > walking the linked list of dirty inodes is faster.
> > 
> > There is followup work I am doing that uses the inode cluster buffer
> > as a replacement in the AIL for tracking dirty inodes. This part of
> > the series is not ready yet as it has some intricate locking
> > requirements. That is an optimisation, so I've left that out because
> > solving the inode reclaim blocking problems is the important part of
> > this work.
> > 
> > In short, this series simplifies inode writeback and fixes the long
> > standing inode reclaim blocking issues without requiring any changes
> > to the memory reclaim infrastructure.
> > 
> > Note: dquots should probably be converted to cluster flushing in a
> > similar way, as they have many of the same issues as inode flushing.
> > 
> > Thoughts, comments and improvemnts welcome.
> > 
> > -Dave.
> > 
> > Version 4:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-4
> > 
> > - rebase on 5.8-rc2 + for-next
> > - fix buffer retry logic braino (p13)
> > - removed unnecessary asserts (p24)
> > - removed unnecessary delwri queue checks from
> >   xfs_inode_item_push (p24)
> > - rework return value from xfs_iflush_cluster to indicate -EAGAIN if
> >   no inodes were flushed and handle that case in the caller. (p28)
> > - rewrite comment about shutdown case in xfs_iflush_cluster (p28)
> > - always clear XFS_LI_FAILED for items requiring AIL processing
> >   (p29)
> > 
> > 
> > Version 3
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-3
> > 
> > - rebase on 5.7 + for-next
> > - update comments (p3)
> > - update commit message (p4)
> > - renamed xfs_buf_ioerror_sync() (p13)
> > - added enum for return value from xfs_buf_iodone_error() (p13)
> > - moved clearing of buffer error to iodone functions (p13)
> > - whitespace (p13)
> > - rebase p14 (p13 conflicts)
> > - rebase p16 (p13 conflicts)
> > - removed a superfluous assert (p16)
> > - moved comment and check in xfs_iflush_done() from p16 to p25
> > - rebase p25 (p16 conflicts)
> > 
> > 
> > 
> > Version 2
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-2
> > 
> > - describe ili_lock better (p2)
> > - clean up inode logging code some more (p2)
> > - move "early read completion" for xfs_buf_ioend() up into p3 from
> >   p4.
> > - fixed conflicts in p4 due to p3 changes.
> > - fixed conflicts in p5 due to p4 changes.
> > - s/_XBF_LOGRCVY/_XBF_LOG_RECOVERY/ (p5)
> > - renamed the buf log item iodone callback to xfs_buf_item_iodone and
> >   reused the xfs_buf_iodone() name for the catch-all buffer write
> >   iodone completion. (p6)
> > - history update for commit message (p7)
> > - subject update for p8
> > - rework loop in xfs_dquot_done() (p9)
> > - Fixed conflicts in p10 due to p6 changes
> > - got rid of entire comments around li_cb (p11)
> > - new patch to rework buffer io error callbacks
> > - new patch to unwind ->iop_error calls and remove ->iop_error
> > - new patch to lift xfs_clear_li_failed() out of
> >   xfs_ail_delete_one()
> > - rebased p12 on all the prior changes
> > - reworked LI_FAILED handling when pinning inodes to the cluster
> >   buffer (p12) 
> > - fixed comment about holding buffer references in
> >   xfs_trans_log_inode() (p12)
> > - fixed indenting of xfs_iflush_abort() (p12)
> > - added comments explaining "skipped" indoe reclaim return value
> >   (p14)
> > - cleaned up error return stack in xfs_reclaim_inode() (p14)
> > - cleaned up skipped return in xfs_reclaim_inodes() (p14)
> > - fixed bug where skipped wasn't incremented if reclaim cursor was
> >   not zero. This could leave inodes between the start of the AG and
> >   the cursor unreclaimed (p15)
> > - reinstate the patch removing SYNC_WAIT from xfs_reclaim_inodes().
> >   Exposed "skipped" bug in p15.
> > - cleaned up inode reclaim comments (p18)
> > - split p19 into two - one to change xfs_ifree_cluster(), one
> >   for the buffer pinning.
> > - xfs_ifree_mark_inode_stale() now takes the cluster buffer and we
> >   get the perag from that rather than having to do a lookup in
> >   xfs_ifree_cluster().
> > - moved extra IO reference for xfs_iflush_cluster() from AIL pushing
> >   to initial xfs_iflush_cluster rework (p22 -> p20)
> > - fixed static declaration on xfs_iflush() (p22)
> > - fixed incorrect EIO return from xfs_iflush_cluster()
> > - rebase p23 because it all rejects now.
> > - fix INODE_ITEM() usage in p23
> > - removed long lines from commit message in p24
> > - new patch to fix logging of XFS_ISTALE inodes which pushes dirty
> >   inodes through reclaim.
> > 
> > 
> > 
> > Dave Chinner (30):
> >   xfs: Don't allow logging of XFS_ISTALE inodes
> >   xfs: remove logged flag from inode log item
> >   xfs: add an inode item lock
> >   xfs: mark inode buffers in cache
> >   xfs: mark dquot buffers in cache
> >   xfs: mark log recovery buffers for completion
> >   xfs: call xfs_buf_iodone directly
> >   xfs: clean up whacky buffer log item list reinit
> >   xfs: make inode IO completion buffer centric
> >   xfs: use direct calls for dquot IO completion
> >   xfs: clean up the buffer iodone callback functions
> >   xfs: get rid of log item callbacks
> >   xfs: handle buffer log item IO errors directly
> >   xfs: unwind log item error flagging
> >   xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()
> >   xfs: pin inode backing buffer to the inode log item
> >   xfs: make inode reclaim almost non-blocking
> >   xfs: remove IO submission from xfs_reclaim_inode()
> >   xfs: allow multiple reclaimers per AG
> >   xfs: don't block inode reclaim on the ILOCK
> >   xfs: remove SYNC_TRYLOCK from inode reclaim
> >   xfs: remove SYNC_WAIT from xfs_reclaim_inodes()
> >   xfs: clean up inode reclaim comments
> >   xfs: rework stale inodes in xfs_ifree_cluster
> >   xfs: attach inodes to the cluster buffer when dirtied
> >   xfs: xfs_iflush() is no longer necessary
> >   xfs: rename xfs_iflush_int()
> >   xfs: rework xfs_iflush_cluster() dirty inode iteration
> >   xfs: factor xfs_iflush_done
> >   xfs: remove xfs_inobp_check()
> > 
> >  fs/xfs/libxfs/xfs_inode_buf.c   |  27 +-
> >  fs/xfs/libxfs/xfs_inode_buf.h   |   6 -
> >  fs/xfs/libxfs/xfs_trans_inode.c | 110 +++++--
> >  fs/xfs/xfs_buf.c                |  40 ++-
> >  fs/xfs/xfs_buf.h                |  48 ++-
> >  fs/xfs/xfs_buf_item.c           | 419 +++++++++++------------
> >  fs/xfs/xfs_buf_item.h           |   8 +-
> >  fs/xfs/xfs_buf_item_recover.c   |   5 +-
> >  fs/xfs/xfs_dquot.c              |  29 +-
> >  fs/xfs/xfs_dquot.h              |   1 +
> >  fs/xfs/xfs_dquot_item.c         |  18 -
> >  fs/xfs/xfs_dquot_item_recover.c |   2 +-
> >  fs/xfs/xfs_file.c               |   9 +-
> >  fs/xfs/xfs_icache.c             | 333 ++++++-------------
> >  fs/xfs/xfs_icache.h             |   2 +-
> >  fs/xfs/xfs_inode.c              | 567 ++++++++++++--------------------
> >  fs/xfs/xfs_inode.h              |   2 +-
> >  fs/xfs/xfs_inode_item.c         | 303 +++++++++--------
> >  fs/xfs/xfs_inode_item.h         |  24 +-
> >  fs/xfs/xfs_inode_item_recover.c |   2 +-
> >  fs/xfs/xfs_log_recover.c        |   5 +-
> >  fs/xfs/xfs_mount.c              |  15 +-
> >  fs/xfs/xfs_mount.h              |   1 -
> >  fs/xfs/xfs_super.c              |   3 -
> >  fs/xfs/xfs_trans.h              |   5 -
> >  fs/xfs/xfs_trans_ail.c          |  10 +-
> >  fs/xfs/xfs_trans_buf.c          |  15 +-
> >  27 files changed, 889 insertions(+), 1120 deletions(-)
> > 
> > -- 
> > 2.26.2.761.g0e0b3e54be
> > 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous
  2020-06-30 16:52   ` Darrick J. Wong
@ 2020-06-30 21:51     ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-30 21:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 30, 2020 at 09:52:03AM -0700, Darrick J. Wong wrote:
> On Mon, Jun 29, 2020 at 04:01:30PM -0700, Darrick J. Wong wrote:
> > Both of these failure cases have been difficult to reproduce, which is
> > to say that I can't get them to repro reliably.  Turning PREEMPT on
> > seems to make it reproduce faster, which makes me wonder if something in
> > this patchset is screwing up concurrency handling or something?  KASAN
> > and kmemleak have nothing to say.  I've also noticed that the less
> > heavily loaded the underlying VM host's storage system, the less likely
> > it is to happen, though that could be a coincidence.
> > 
> > Anyway, if I figure something out I'll holler, but I thought it was past
> > time to braindump on the mailing list.
> 
> Last night, Dave and I did some live debugging of a failed VM test
> system, and discovered that the xfs_reclaim_inodes() call does not
> actually reclaim all the IRECLAIMABLE inodes.  Because we fail to call
> xfs_reclaim_inode() on all the inodes, there are still inodes in the
> incore inode xarray, and they still have dquots attached.
> 
> This would explain the symptoms I've seen -- since we didn't reclaim the
> inodes, we didn't dqdetach them either, and so the dqpurge_all will spin
> forever on the still-referenced dquots.  This also explains the slub
> complaints about active xfs_inode/xfs_inode_log_item objects if I turn
> off quotas, since we didn't clean those up either.
> 
> Further analysis (aka adding tracepoints) shows xfs_reclaim_inode_grab
> deciding to skip some inodes because IFLOCK is set.  Adding code to
> cycle the i_flags_lock ahead of the unlocked IFLOCK test didn't make the
> symptoms go away, so I instrumented the inode flush "lock" functions to
> see what was going on (full version available here [1]):

[...]

> Bingo!  The xfs_ail_push_all_sync in xfs_unmountfs takes a bunch of
> inode iflocks, starts the inode cluster buffer write, and since the AIL
> is now empty, returns.  The unmount process moves on to calling
> xfs_reclaim_inodes, which as you can see in the last four lines:
> 
>           umount-10409 [001]    44.118882: xfs_reclaim_inode_grab: dev 259:0 ino 0x8a
> 
> This ^^^ is logged at the start of xfs_reclaim_inode_grab.
> 
>           umount-10409 [001]    44.118883: xfs_reclaim_inode_grab_iflock: dev 259:0 ino 0x8a
> 
> This is logged when x_r_i_g observes that the IFLOCK is set and bails out.
> 
>      kworker/2:1-50    [002]    44.118883: xfs_ifunlock:         dev 259:0 ino 0x8a
> 
> And finally this is the inode cluster buffer IO completion calling
> xfs_buf_inode_iodone -> xfs_iflush_done from a workqueue.
> 
> So it seems to me that inode reclaim races with the AIL for the IFLOCK,
> and when unmount inode reclaim loses, it does the wrong thing.

Yeah, that's what I suspected when I finished up yesterday, but I
couldn't quite connect how the AIL wasn't waiting for the inode
completion.

The moment I looked at it again this morning, I realised that it was
simply that xfs_ail_push_all_sync() is woken when the AIL is
emptied, and that happens about 20 lines of code before the flush
lock is dropped, and if the wakeup to the sleeping task is fast
enough, it can be running before the IO completion finishes the
wakeup.

And with a PREEMPT kernel, we might do preempts on wakeup (that was
the path to the scheduler bug we kept hitting), hence increasing the
chance that the unmount task will run before the IO completion
finishes and drops the inode flush lock.

> Questions: Do we need to teach xfs_reclaim_inodes_ag to increment
> @skipped if xfs_reclaim_inode_grab rejects an inode?  xfs_reclaim_inodes
> is the only consumer of the @skipped value, and elevated skipped will
> cause it to rerun the scan, so I think this will work.

No, we just need to get rid of the racy check in
xfs_reclaim_inode_grab(). I'm going to get rid of the whole skipped
thing, too.

> Or, do we need to wait for the ail items to complete after xfsaild does
> its xfs_buf_delwri_submit_nowait thing?

We've already waited for the -AIL items- to complete, and that's
really all we should be doing at the xfs_ail_push_all_sync layer.

The issue is that xfs_ail_push_all_sync() doesn't quite wait for IO
to complete so we've been conflating these two different operations
for a long time (essentially since we moved to logging everything
and tracking all dirty metadata in the AIL). In general, they mean
the same thing, but in this specific corner case the subtle
distinction actually matters.

It's easy enough to avoid - just get rid of what, independently,
this patchset makes a questionable optimisation in
xfs_reclaim_inode_grab(). i.e we no longer block reclaim on locks
and so optimisations to avoid blocking on locks....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 21/30 V2] xfs: remove SYNC_TRYLOCK from inode reclaim
  2020-06-22  8:15 ` [PATCH 21/30] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
@ 2020-07-01  4:48   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-07-01  4:48 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

All background reclaim is SYNC_TRYLOCK already, and even blocking
reclaim (SYNC_WAIT) can use trylock mechanisms as
xfs_reclaim_inodes_ag() will keep cycling until there are no more
reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode
reclaim and make everything unconditionally non-blocking.

We remove all the optimistic "avoid blocking on locks" checks done
in xfs_reclaim_inode_grab() as nothing blocks on locks anymore.
Further, checking XFS_IFLOCK optimistically can result in detecting
inodes in the process of being cleaned (i.e. between being removed
from the AIL and having the flush lock dropped), so for
xfs_reclaim_inodes() to reliably reclaim all inodes we need to drop
these checks anyway.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
V2
- drop the optimistic unlocked checks from xfs_reclaim_inode_grab()
  because they are now unnecessary and the XFS_IFLOCK check races
  with IO completion on unmount.
- update commit message to reflect changes to
  xfs_reclaim_inode_grab()

 fs/xfs/xfs_icache.c | 63 +++++++++++++++++++++--------------------------------
 1 file changed, 25 insertions(+), 38 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index f387ec21dd35..8d18117242e1 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -174,7 +174,7 @@ xfs_reclaim_worker(
 	struct xfs_mount *mp = container_of(to_delayed_work(work),
 					struct xfs_mount, m_reclaim_work);
 
-	xfs_reclaim_inodes(mp, SYNC_TRYLOCK);
+	xfs_reclaim_inodes(mp, 0);
 	xfs_reclaim_work_queue(mp);
 }
 
@@ -1028,48 +1028,37 @@ xfs_cowblocks_worker(
 
 /*
  * Grab the inode for reclaim exclusively.
- * Return 0 if we grabbed it, non-zero otherwise.
+ *
+ * We have found this inode via a lookup under RCU, so the inode may have
+ * already been freed, or it may be in the process of being recycled by
+ * xfs_iget(). In both cases, the inode will have XFS_IRECLAIM set. If the inode
+ * has been fully recycled by the time we get the i_flags_lock, XFS_IRECLAIMABLE
+ * will not be set. Hence we need to check for both these flag conditions to
+ * avoid inodes that are no longer reclaim candidates.
+ *
+ * Note: checking for other state flags here, under the i_flags_lock or not, is
+ * racy and should be avoided. Those races should be resolved only after we have
+ * ensured that we are able to reclaim this inode and the world can see that we
+ * are going to reclaim it.
+ *
+ * Return true if we grabbed it, false otherwise.
  */
-STATIC int
+static bool
 xfs_reclaim_inode_grab(
-	struct xfs_inode	*ip,
-	int			flags)
+	struct xfs_inode	*ip)
 {
 	ASSERT(rcu_read_lock_held());
 
-	/* quick check for stale RCU freed inode */
-	if (!ip->i_ino)
-		return 1;
-
-	/*
-	 * If we are asked for non-blocking operation, do unlocked checks to
-	 * see if the inode already is being flushed or in reclaim to avoid
-	 * lock traffic.
-	 */
-	if ((flags & SYNC_TRYLOCK) &&
-	    __xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
-		return 1;
-
-	/*
-	 * The radix tree lock here protects a thread in xfs_iget from racing
-	 * with us starting reclaim on the inode.  Once we have the
-	 * XFS_IRECLAIM flag set it will not touch us.
-	 *
-	 * Due to RCU lookup, we may find inodes that have been freed and only
-	 * have XFS_IRECLAIM set.  Indeed, we may see reallocated inodes that
-	 * aren't candidates for reclaim at all, so we must check the
-	 * XFS_IRECLAIMABLE is set first before proceeding to reclaim.
-	 */
 	spin_lock(&ip->i_flags_lock);
 	if (!__xfs_iflags_test(ip, XFS_IRECLAIMABLE) ||
 	    __xfs_iflags_test(ip, XFS_IRECLAIM)) {
 		/* not a reclaim candidate. */
 		spin_unlock(&ip->i_flags_lock);
-		return 1;
+		return false;
 	}
 	__xfs_iflags_set(ip, XFS_IRECLAIM);
 	spin_unlock(&ip->i_flags_lock);
-	return 0;
+	return true;
 }
 
 /*
@@ -1114,8 +1103,7 @@ xfs_reclaim_inode_grab(
 static bool
 xfs_reclaim_inode(
 	struct xfs_inode	*ip,
-	struct xfs_perag	*pag,
-	int			sync_mode)
+	struct xfs_perag	*pag)
 {
 	xfs_ino_t		ino = ip->i_ino; /* for radix_tree_delete */
 
@@ -1209,7 +1197,6 @@ xfs_reclaim_inode(
 static int
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
-	int			flags,
 	int			*nr_to_scan)
 {
 	struct xfs_perag	*pag;
@@ -1254,7 +1241,7 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				struct xfs_inode *ip = batch[i];
 
-				if (done || xfs_reclaim_inode_grab(ip, flags))
+				if (done || !xfs_reclaim_inode_grab(ip))
 					batch[i] = NULL;
 
 				/*
@@ -1285,7 +1272,7 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				if (!batch[i])
 					continue;
-				if (!xfs_reclaim_inode(batch[i], pag, flags))
+				if (!xfs_reclaim_inode(batch[i], pag))
 					skipped++;
 			}
 
@@ -1311,13 +1298,13 @@ xfs_reclaim_inodes(
 	int		nr_to_scan = INT_MAX;
 	int		skipped;
 
-	xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
 	if (!(mode & SYNC_WAIT))
 		return 0;
 
 	do {
 		xfs_ail_push_all_sync(mp->m_ail);
-		skipped = xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+		skipped = xfs_reclaim_inodes_ag(mp, &nr_to_scan);
 	} while (skipped > 0);
 
 	return 0;
@@ -1341,7 +1328,7 @@ xfs_reclaim_inodes_nr(
 	xfs_reclaim_work_queue(mp);
 	xfs_ail_push_all(mp->m_ail);
 
-	xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
+	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 22/30 V2] xfs: remove SYNC_WAIT from xfs_reclaim_inodes()
  2020-06-22  8:15 ` [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Dave Chinner
@ 2020-07-01  4:51   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-07-01  4:51 UTC (permalink / raw)
  To: linux-xfs


From: Dave Chinner <dchinner@redhat.com>

Clean up xfs_reclaim_inodes() callers. Most callers want blocking
behaviour, so just make the existing SYNC_WAIT behaviour the
default.

For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
directly because we just want optimistic clean inode reclaim to be
done in the background.

For xfs_quiesce_attr() we can just remove the inode reclaim calls as
they are a historic relic that was required to flush dirty inodes
that contained unlogged changes. We now log all changes to the
inodes, so the sync AIL push from xfs_log_quiesce() called by
xfs_quiesce_attr() will do all the required inode writeback for
freeze.

Seeing as we now want to loop until all reclaimable inodes have been
reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG
tag rather than having xfs_reclaim_inodes_ag() tell it that inodes
were skipped. This is much more reliable and will always loop until
all reclaimable inodes are reclaimed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
V2
- kill the "skipped inode" checking in xfs_reclaim_inodes_ag() and
  xfs_reclaim_inodes() to trigger looping until the cache is empty
  and replace it with a loop that checks if the XFS_ICI_RECLAIM_TAG
  set on the perag radix tree. This will now always loop if there
  are still inodes to reclaim.
- update commit message to reflect new looping behaviour.

 fs/xfs/xfs_icache.c | 79 ++++++++++++++++++++---------------------------------
 fs/xfs/xfs_icache.h |  2 +-
 fs/xfs/xfs_mount.c  | 11 ++++----
 fs/xfs/xfs_super.c  |  3 --
 4 files changed, 35 insertions(+), 60 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 8d18117242e1..f4e7b98d9639 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -160,24 +160,6 @@ xfs_reclaim_work_queue(
 	rcu_read_unlock();
 }
 
-/*
- * This is a fast pass over the inode cache to try to get reclaim moving on as
- * many inodes as possible in a short period of time. It kicks itself every few
- * seconds, as well as being kicked by the inode cache shrinker when memory
- * goes low. It scans as quickly as possible avoiding locked inodes or those
- * already being flushed, and once done schedules a future pass.
- */
-void
-xfs_reclaim_worker(
-	struct work_struct *work)
-{
-	struct xfs_mount *mp = container_of(to_delayed_work(work),
-					struct xfs_mount, m_reclaim_work);
-
-	xfs_reclaim_inodes(mp, 0);
-	xfs_reclaim_work_queue(mp);
-}
-
 static void
 xfs_perag_set_reclaim_tag(
 	struct xfs_perag	*pag)
@@ -1100,7 +1082,7 @@ xfs_reclaim_inode_grab(
  *	dirty, async	=> requeue
  *	dirty, sync	=> flush, wait and reclaim
  */
-static bool
+static void
 xfs_reclaim_inode(
 	struct xfs_inode	*ip,
 	struct xfs_perag	*pag)
@@ -1173,7 +1155,7 @@ xfs_reclaim_inode(
 	ASSERT(xfs_inode_clean(ip));
 
 	__xfs_inode_free(ip);
-	return true;
+	return;
 
 out_ifunlock:
 	xfs_ifunlock(ip);
@@ -1181,7 +1163,6 @@ xfs_reclaim_inode(
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 out:
 	xfs_iflags_clear(ip, XFS_IRECLAIM);
-	return false;
 }
 
 /*
@@ -1194,14 +1175,13 @@ xfs_reclaim_inode(
  * so that callers that want to block until all dirty inodes are written back
  * and reclaimed can sanely loop.
  */
-static int
+static void
 xfs_reclaim_inodes_ag(
 	struct xfs_mount	*mp,
 	int			*nr_to_scan)
 {
 	struct xfs_perag	*pag;
 	xfs_agnumber_t		ag = 0;
-	int			skipped = 0;
 
 	while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
 		unsigned long	first_index = 0;
@@ -1210,14 +1190,7 @@ xfs_reclaim_inodes_ag(
 
 		ag = pag->pag_agno + 1;
 
-		/*
-		 * If the cursor is not zero, we haven't scanned the whole AG
-		 * so we might have skipped inodes here.
-		 */
 		first_index = READ_ONCE(pag->pag_ici_reclaim_cursor);
-		if (first_index)
-			skipped++;
-
 		do {
 			struct xfs_inode *batch[XFS_LOOKUP_BATCH];
 			int	i;
@@ -1270,16 +1243,12 @@ xfs_reclaim_inodes_ag(
 			rcu_read_unlock();
 
 			for (i = 0; i < nr_found; i++) {
-				if (!batch[i])
-					continue;
-				if (!xfs_reclaim_inode(batch[i], pag))
-					skipped++;
+				if (batch[i])
+					xfs_reclaim_inode(batch[i], pag);
 			}
 
 			*nr_to_scan -= XFS_LOOKUP_BATCH;
-
 			cond_resched();
-
 		} while (nr_found && !done && *nr_to_scan > 0);
 
 		if (done)
@@ -1287,27 +1256,18 @@ xfs_reclaim_inodes_ag(
 		WRITE_ONCE(pag->pag_ici_reclaim_cursor, first_index);
 		xfs_perag_put(pag);
 	}
-	return skipped;
 }
 
-int
+void
 xfs_reclaim_inodes(
-	xfs_mount_t	*mp,
-	int		mode)
+	struct xfs_mount	*mp)
 {
 	int		nr_to_scan = INT_MAX;
-	int		skipped;
 
-	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
-	if (!(mode & SYNC_WAIT))
-		return 0;
-
-	do {
+	while (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) {
 		xfs_ail_push_all_sync(mp->m_ail);
-		skipped = xfs_reclaim_inodes_ag(mp, &nr_to_scan);
-	} while (skipped > 0);
-
-	return 0;
+		xfs_reclaim_inodes_ag(mp, &nr_to_scan);
+	};
 }
 
 /*
@@ -1426,6 +1386,25 @@ xfs_inode_matches_eofb(
 	return true;
 }
 
+/*
+ * This is a fast pass over the inode cache to try to get reclaim moving on as
+ * many inodes as possible in a short period of time. It kicks itself every few
+ * seconds, as well as being kicked by the inode cache shrinker when memory
+ * goes low. It scans as quickly as possible avoiding locked inodes or those
+ * already being flushed, and once done schedules a future pass.
+ */
+void
+xfs_reclaim_worker(
+	struct work_struct *work)
+{
+	struct xfs_mount *mp = container_of(to_delayed_work(work),
+					struct xfs_mount, m_reclaim_work);
+	int		nr_to_scan = INT_MAX;
+
+	xfs_reclaim_inodes_ag(mp, &nr_to_scan);
+	xfs_reclaim_work_queue(mp);
+}
+
 STATIC int
 xfs_inode_free_eofblocks(
 	struct xfs_inode	*ip,
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 93b54e7d55f0..ae92ca53de42 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -51,7 +51,7 @@ void xfs_inode_free(struct xfs_inode *ip);
 
 void xfs_reclaim_worker(struct work_struct *work);
 
-int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
+void xfs_reclaim_inodes(struct xfs_mount *mp);
 int xfs_reclaim_inodes_count(struct xfs_mount *mp);
 long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 03158b42a194..c8ae49a1e99c 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1011,7 +1011,7 @@ xfs_mountfs(
 	 * quota inodes.
 	 */
 	cancel_delayed_work_sync(&mp->m_reclaim_work);
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
+	xfs_reclaim_inodes(mp);
 	xfs_health_unmount(mp);
  out_log_dealloc:
 	mp->m_flags |= XFS_MOUNT_UNMOUNTING;
@@ -1088,13 +1088,12 @@ xfs_unmountfs(
 	xfs_ail_push_all_sync(mp->m_ail);
 
 	/*
-	 * And reclaim all inodes.  At this point there should be no dirty
-	 * inodes and none should be pinned or locked, but use synchronous
-	 * reclaim just to be sure. We can stop background inode reclaim
-	 * here as well if it is still running.
+	 * Reclaim all inodes. At this point there should be no dirty inodes and
+	 * none should be pinned or locked. Stop background inode reclaim here
+	 * if it is still running.
 	 */
 	cancel_delayed_work_sync(&mp->m_reclaim_work);
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
+	xfs_reclaim_inodes(mp);
 	xfs_health_unmount(mp);
 
 	xfs_qm_unmount(mp);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 379cbff438bc..5a5d9453cf51 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -890,9 +890,6 @@ xfs_quiesce_attr(
 	/* force the log to unpin objects from the now complete transactions */
 	xfs_log_force(mp, XFS_LOG_SYNC);
 
-	/* reclaim inodes to do any IO before the freeze completes */
-	xfs_reclaim_inodes(mp, 0);
-	xfs_reclaim_inodes(mp, SYNC_WAIT);
 
 	/* Push the superblock and write an unmount record */
 	error = xfs_log_sbcount(mp);

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous
@ 2020-06-04  7:45 Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-04  7:45 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

Inode flushing requires that we first lock an inode, then check it,
then lock the underlying buffer, flush the inode to the buffer and
finally add the inode to the buffer to be unlocked on IO completion.
We then walk all the other cached inodes in the buffer range and
optimistically lock and flush them to the buffer without blocking.

This cluster write effectively repeats the same code we do with the
initial inode, except now it has to special case that initial inode
that is already locked. Hence we have multiple copies of very
similar code, and it is a result of inode cluster flushing being
based on a specific inode rather than grabbing the buffer and
flushing all available inodes to it.

The problem with this at the moment is that we we can't look up the
buffer until we have guaranteed that an inode is held exclusively
and it's not going away while we get the buffer through an imap
lookup. Hence we are kinda stuck locking an inode before we can look
up the buffer.

This is also a result of inodes being detached from the cluster
buffer except when IO is being done. This has the further problem
that the cluster buffer can be reclaimed from memory and then the
inode can be dirtied. At this point cleaning the inode requires a
read-modify-write cycle on the cluster buffer. If we then are put
under memory pressure, cleaning that dirty inode to reclaim it
requires allocating memory for the cluster buffer and this leads to
all sorts of problems.

We used synchronous inode writeback in reclaim as a throttle that
provided a forwards progress mechanism when RMW cycles were required
to clean inodes. Async writeback of inodes (e.g. via the AIL) would
immediately exhaust remaining memory reserves trying to allocate
inode cluster after inode cluster. The synchronous writeback of an
inode cluster allowed reclaim to release the inode cluster and have
it freed almost immediately which could then be used to allocate the
next inode cluster buffer. Hence the IO based throttling mechanism
largely guaranteed forwards progress in inode reclaim. By removing
the requirement for require memory allocation for inode writeback
filesystem level, we can issue writeback asynchrnously and not have
to worry about the memory exhaustion anymore.

Another issue is that if we have slow disks, we can build up dirty
inodes in memory that can then take hours for an operation like
unmount to flush. A RMW cycle per inode on a slow RAID6 device can
mean we only clean 50 inodes a second, and when there are hundreds
of thousands of dirty inodes that need to be cleaned this can take a
long time. PInning the cluster buffers will greatly speed up inode
writeback on slow storage systems like this.

These limitations all stem from the same source: inode writeback is
inode centric, And they are largely solved by the same architectural
change: make inode writeback cluster buffer centric.  This series is
makes that architectural change.

Firstly, we start by pinning the inode backing buffer in memory
when an inode is marked dirty (i.e. when it is logged). By tracking
the number of dirty inodes on a buffer as a counter rather than a
flag, we avoid the problem of overlapping inode dirtying and buffer
flushing racing to set/clear the dirty flag. Hence as long as there
is a dirty inode in memory, the buffer will not be able to be
reclaimed. We can safely do this inode cluster buffer lookup when we
dirty an inode as we do not hold the buffer locked - we merely take
a reference to it and then release it - and hence we don't cause any
new lock order issues.

When the inode is finally cleaned, the reference to the buffer can
be removed from the inode log item and the buffer released. This is
done from the inode completion callbacks that are attached to the
buffer when the inode is flushed.

Pinning the cluster buffer in this way immediately avoids the RMW
problem in inode writeback and reclaim contexts by moving the memory
allocation and the blocking buffer read into the transaction context
that dirties the inode.  This inverts our dirty inode throttling
mechanism - we now throttle the rate at which we can dirty inodes to
rate at which we can allocate memory and read inode cluster buffers
into memory rather than via throttling reclaim to rate at which we
can clean dirty inodes.

Hence if we are under memory pressure, we'll block on memory
allocation when trying to dirty the referenced inode, rather than in
the memory reclaim path where we are trying to clean unreferenced
inodes to free memory.  Hence we no longer have to guarantee
forwards progress in inode reclaim as we aren't doing memory
allocation, and that means we can remove inode writeback from the
XFS inode shrinker completely without changing the system tolerance
for low memory operation.

Tracking the buffers via the inode log item also allows us to
completely rework the inode flushing mechanism. While the inode log
item is in the AIL, it is safe for the AIL to access any member of
the log item. Hence the AIL push mechanisms can access the buffer
attached to the inode without first having to lock the inode.

This means we can essentially lock the buffer directly and then
call xfs_iflush_cluster() without first going through xfs_iflush()
to find the buffer. Hence we can remove xfs_iflush() altogether,
because the two places that call it - the inode item push code and
inode reclaim - no longer need to flush inodes directly.

This can be further optimised by attaching the inode to the cluster
buffer when the inode is dirtied. i.e. when we add the buffer
reference to the inode log item, we also attach the inode to the
buffer for IO processing. This leads to the dirty inodes always
being attached to the buffer and hence we no longer need to add them
when we flush the inode and remove them when IO completes. Instead
the inodes are attached when the node log item is dirtied, and
removed when the inode log item is cleaned.

With this structure in place, we no longer need to do
lookups to find the dirty inodes in the cache to attach to the
buffer in xfs_iflush_cluster() - they are already attached to the
buffer. Hence when the AIL pushes an inode, we just grab the buffer
from the log item, and then walk the buffer log item list to lock
and flush the dirty inodes attached to the buffer.

This greatly simplifies inode writeback, and removes another memory
allocation from the inode writeback path (the array used for the
radix tree gang lookup). And while the radix tree lookups are fast,
walking the linked list of dirty inodes is faster.

There is followup work I am doing that uses the inode cluster buffer
as a replacement in the AIL for tracking dirty inodes. This part of
the series is not ready yet as it has some intricate locking
requirements. That is an optimisation, so I've left that out because
solving the inode reclaim blocking problems is the important part of
this work.

In short, this series simplifies inode writeback and fixes the long
standing inode reclaim blocking issues without requiring any changes
to the memory reclaim infrastructure.

Note: dquots should probably be converted to cluster flushing in a
similar way, as they have many of the same issues as inode flushing.

Thoughts, comments and improvemnts welcome.

-Dave.

Version 3

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-3

- rebase on 5.7 + for-next
- update comments (p3)
- update commit message (p4)
- renamed xfs_buf_ioerror_sync() (p13)
- added enum for return value from xfs_buf_iodone_error() (p13)
- moved clearing of buffer error to iodone functions (p13)
- whitespace (p13)
- rebase p14 (p13 conflicts)
- rebase p16 (p13 conflicts)
- removed a superfluous assert (p16)
- moved comment and check in xfs_iflush_done() from p16 to p25
- rebase p25 (p16 conflicts)



Version 2

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-2

- describe ili_lock better (p2)
- clean up inode logging code some more (p2)
- move "early read completion" for xfs_buf_ioend() up into p3 from
  p4.
- fixed conflicts in p4 due to p3 changes.
- fixed conflicts in p5 due to p4 changes.
- s/_XBF_LOGRCVY/_XBF_LOG_RECOVERY/ (p5)
- renamed the buf log item iodone callback to xfs_buf_item_iodone and
  reused the xfs_buf_iodone() name for the catch-all buffer write
  iodone completion. (p6)
- history update for commit message (p7)
- subject update for p8
- rework loop in xfs_dquot_done() (p9)
- Fixed conflicts in p10 due to p6 changes
- got rid of entire comments around li_cb (p11)
- new patch to rework buffer io error callbacks
- new patch to unwind ->iop_error calls and remove ->iop_error
- new patch to lift xfs_clear_li_failed() out of
  xfs_ail_delete_one()
- rebased p12 on all the prior changes
- reworked LI_FAILED handling when pinning inodes to the cluster
  buffer (p12) 
- fixed comment about holding buffer references in
  xfs_trans_log_inode() (p12)
- fixed indenting of xfs_iflush_abort() (p12)
- added comments explaining "skipped" indoe reclaim return value
  (p14)
- cleaned up error return stack in xfs_reclaim_inode() (p14)
- cleaned up skipped return in xfs_reclaim_inodes() (p14)
- fixed bug where skipped wasn't incremented if reclaim cursor was
  not zero. This could leave inodes between the start of the AG and
  the cursor unreclaimed (p15)
- reinstate the patch removing SYNC_WAIT from xfs_reclaim_inodes().
  Exposed "skipped" bug in p15.
- cleaned up inode reclaim comments (p18)
- split p19 into two - one to change xfs_ifree_cluster(), one
  for the buffer pinning.
- xfs_ifree_mark_inode_stale() now takes the cluster buffer and we
  get the perag from that rather than having to do a lookup in
  xfs_ifree_cluster().
- moved extra IO reference for xfs_iflush_cluster() from AIL pushing
  to initial xfs_iflush_cluster rework (p22 -> p20)
- fixed static declaration on xfs_iflush() (p22)
- fixed incorrect EIO return from xfs_iflush_cluster()
- rebase p23 because it all rejects now.
- fix INODE_ITEM() usage in p23
- removed long lines from commit message in p24
- new patch to fix logging of XFS_ISTALE inodes which pushes dirty
  inodes through reclaim.



Dave Chinner (30):
  xfs: Don't allow logging of XFS_ISTALE inodes
  xfs: remove logged flag from inode log item
  xfs: add an inode item lock
  xfs: mark inode buffers in cache
  xfs: mark dquot buffers in cache
  xfs: mark log recovery buffers for completion
  xfs: call xfs_buf_iodone directly
  xfs: clean up whacky buffer log item list reinit
  xfs: make inode IO completion buffer centric
  xfs: use direct calls for dquot IO completion
  xfs: clean up the buffer iodone callback functions
  xfs: get rid of log item callbacks
  xfs: handle buffer log item IO errors directly
  xfs: unwind log item error flagging
  xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()
  xfs: pin inode backing buffer to the inode log item
  xfs: make inode reclaim almost non-blocking
  xfs: remove IO submission from xfs_reclaim_inode()
  xfs: allow multiple reclaimers per AG
  xfs: don't block inode reclaim on the ILOCK
  xfs: remove SYNC_TRYLOCK from inode reclaim
  xfs: remove SYNC_WAIT from xfs_reclaim_inodes()
  xfs: clean up inode reclaim comments
  xfs: rework stale inodes in xfs_ifree_cluster
  xfs: attach inodes to the cluster buffer when dirtied
  xfs: xfs_iflush() is no longer necessary
  xfs: rename xfs_iflush_int()
  xfs: rework xfs_iflush_cluster() dirty inode iteration
  xfs: factor xfs_iflush_done
  xfs: remove xfs_inobp_check()

 fs/xfs/libxfs/xfs_inode_buf.c   |  27 +-
 fs/xfs/libxfs/xfs_inode_buf.h   |   6 -
 fs/xfs/libxfs/xfs_trans_inode.c | 110 +++++--
 fs/xfs/xfs_buf.c                |  40 ++-
 fs/xfs/xfs_buf.h                |  48 +--
 fs/xfs/xfs_buf_item.c           | 420 ++++++++++++------------
 fs/xfs/xfs_buf_item.h           |   8 +-
 fs/xfs/xfs_buf_item_recover.c   |   5 +-
 fs/xfs/xfs_dquot.c              |  29 +-
 fs/xfs/xfs_dquot.h              |   1 +
 fs/xfs/xfs_dquot_item.c         |  18 --
 fs/xfs/xfs_dquot_item_recover.c |   2 +-
 fs/xfs/xfs_file.c               |   9 +-
 fs/xfs/xfs_icache.c             | 333 ++++++-------------
 fs/xfs/xfs_icache.h             |   2 +-
 fs/xfs/xfs_inode.c              | 554 ++++++++++++--------------------
 fs/xfs/xfs_inode.h              |   2 +-
 fs/xfs/xfs_inode_item.c         | 301 +++++++++--------
 fs/xfs/xfs_inode_item.h         |  24 +-
 fs/xfs/xfs_inode_item_recover.c |   2 +-
 fs/xfs/xfs_log_recover.c        |   5 +-
 fs/xfs/xfs_mount.c              |  15 +-
 fs/xfs/xfs_mount.h              |   1 -
 fs/xfs/xfs_super.c              |   3 -
 fs/xfs/xfs_trans.h              |   5 -
 fs/xfs/xfs_trans_ail.c          |  10 +-
 fs/xfs/xfs_trans_buf.c          |  15 +-
 27 files changed, 881 insertions(+), 1114 deletions(-)

-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous
@ 2020-06-01 21:42 Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2020-06-01 21:42 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

Inode flushing requires that we first lock an inode, then check it,
then lock the underlying buffer, flush the inode to the buffer and
finally add the inode to the buffer to be unlocked on IO completion.
We then walk all the other cached inodes in the buffer range and
optimistically lock and flush them to the buffer without blocking.

This cluster write effectively repeats the same code we do with the
initial inode, except now it has to special case that initial inode
that is already locked. Hence we have multiple copies of very
similar code, and it is a result of inode cluster flushing being
based on a specific inode rather than grabbing the buffer and
flushing all available inodes to it.

The problem with this at the moment is that we we can't look up the
buffer until we have guaranteed that an inode is held exclusively
and it's not going away while we get the buffer through an imap
lookup. Hence we are kinda stuck locking an inode before we can look
up the buffer.

This is also a result of inodes being detached from the cluster
buffer except when IO is being done. This has the further problem
that the cluster buffer can be reclaimed from memory and then the
inode can be dirtied. At this point cleaning the inode requires a
read-modify-write cycle on the cluster buffer. If we then are put
under memory pressure, cleaning that dirty inode to reclaim it
requires allocating memory for the cluster buffer and this leads to
all sorts of problems.

We used synchronous inode writeback in reclaim as a throttle that
provided a forwards progress mechanism when RMW cycles were required
to clean inodes. Async writeback of inodes (e.g. via the AIL) would
immediately exhaust remaining memory reserves trying to allocate
inode cluster after inode cluster. The synchronous writeback of an
inode cluster allowed reclaim to release the inode cluster and have
it freed almost immediately which could then be used to allocate the
next inode cluster buffer. Hence the IO based throttling mechanism
largely guaranteed forwards progress in inode reclaim. By removing
the requirement for require memory allocation for inode writeback
filesystem level, we can issue writeback asynchrnously and not have
to worry about the memory exhaustion anymore.

Another issue is that if we have slow disks, we can build up dirty
inodes in memory that can then take hours for an operation like
unmount to flush. A RMW cycle per inode on a slow RAID6 device can
mean we only clean 50 inodes a second, and when there are hundreds
of thousands of dirty inodes that need to be cleaned this can take a
long time. PInning the cluster buffers will greatly speed up inode
writeback on slow storage systems like this.

These limitations all stem from the same source: inode writeback is
inode centric, And they are largely solved by the same architectural
change: make inode writeback cluster buffer centric.  This series is
makes that architectural change.

Firstly, we start by pinning the inode backing buffer in memory
when an inode is marked dirty (i.e. when it is logged). By tracking
the number of dirty inodes on a buffer as a counter rather than a
flag, we avoid the problem of overlapping inode dirtying and buffer
flushing racing to set/clear the dirty flag. Hence as long as there
is a dirty inode in memory, the buffer will not be able to be
reclaimed. We can safely do this inode cluster buffer lookup when we
dirty an inode as we do not hold the buffer locked - we merely take
a reference to it and then release it - and hence we don't cause any
new lock order issues.

When the inode is finally cleaned, the reference to the buffer can
be removed from the inode log item and the buffer released. This is
done from the inode completion callbacks that are attached to the
buffer when the inode is flushed.

Pinning the cluster buffer in this way immediately avoids the RMW
problem in inode writeback and reclaim contexts by moving the memory
allocation and the blocking buffer read into the transaction context
that dirties the inode.  This inverts our dirty inode throttling
mechanism - we now throttle the rate at which we can dirty inodes to
rate at which we can allocate memory and read inode cluster buffers
into memory rather than via throttling reclaim to rate at which we
can clean dirty inodes.

Hence if we are under memory pressure, we'll block on memory
allocation when trying to dirty the referenced inode, rather than in
the memory reclaim path where we are trying to clean unreferenced
inodes to free memory.  Hence we no longer have to guarantee
forwards progress in inode reclaim as we aren't doing memory
allocation, and that means we can remove inode writeback from the
XFS inode shrinker completely without changing the system tolerance
for low memory operation.

Tracking the buffers via the inode log item also allows us to
completely rework the inode flushing mechanism. While the inode log
item is in the AIL, it is safe for the AIL to access any member of
the log item. Hence the AIL push mechanisms can access the buffer
attached to the inode without first having to lock the inode.

This means we can essentially lock the buffer directly and then
call xfs_iflush_cluster() without first going through xfs_iflush()
to find the buffer. Hence we can remove xfs_iflush() altogether,
because the two places that call it - the inode item push code and
inode reclaim - no longer need to flush inodes directly.

This can be further optimised by attaching the inode to the cluster
buffer when the inode is dirtied. i.e. when we add the buffer
reference to the inode log item, we also attach the inode to the
buffer for IO processing. This leads to the dirty inodes always
being attached to the buffer and hence we no longer need to add them
when we flush the inode and remove them when IO completes. Instead
the inodes are attached when the node log item is dirtied, and
removed when the inode log item is cleaned.

With this structure in place, we no longer need to do
lookups to find the dirty inodes in the cache to attach to the
buffer in xfs_iflush_cluster() - they are already attached to the
buffer. Hence when the AIL pushes an inode, we just grab the buffer
from the log item, and then walk the buffer log item list to lock
and flush the dirty inodes attached to the buffer.

This greatly simplifies inode writeback, and removes another memory
allocation from the inode writeback path (the array used for the
radix tree gang lookup). And while the radix tree lookups are fast,
walking the linked list of dirty inodes is faster.

There is followup work I am doing that uses the inode cluster buffer
as a replacement in the AIL for tracking dirty inodes. This part of
the series is not ready yet as it has some intricate locking
requirements. That is an optimisation, so I've left that out because
solving the inode reclaim blocking problems is the important part of
this work.

In short, this series simplifies inode writeback and fixes the long
standing inode reclaim blocking issues without requiring any changes
to the memory reclaim infrastructure.

Note: dquots should probably be converted to cluster flushing in a
similar way, as they have many of the same issues as inode flushing.

Thoughts, comments and improvemnts welcome.

-Dave.


Version 2

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim-2

- describe ili_lock better (p2)
- clean up inode logging code some more (p2)
- move "early read completion" for xfs_buf_ioend() up into p3 from
  p4.
- fixed conflicts in p4 due to p3 changes.
- fixed conflicts in p5 due to p4 changes.
- s/_XBF_LOGRCVY/_XBF_LOG_RECOVERY/ (p5)
- renamed the buf log item iodone callback to xfs_buf_item_iodone and
  reused the xfs_buf_iodone() name for the catch-all buffer write
  iodone completion. (p6)
- history update for commit message (p7)
- subject update for p8
- rework loop in xfs_dquot_done() (p9)
- Fixed conflicts in p10 due to p6 changes
- got rid of entire comments around li_cb (p11)
- new patch to rework buffer io error callbacks
- new patch to unwind ->iop_error calls and remove ->iop_error
- new patch to lift xfs_clear_li_failed() out of
  xfs_ail_delete_one()
- rebased p12 on all the prior changes
- reworked LI_FAILED handling when pinning inodes to the cluster
  buffer (p12) 
- fixed comment about holding buffer references in
  xfs_trans_log_inode() (p12)
- fixed indenting of xfs_iflush_abort() (p12)
- added comments explaining "skipped" indoe reclaim return value
  (p14)
- cleaned up error return stack in xfs_reclaim_inode() (p14)
- cleaned up skipped return in xfs_reclaim_inodes() (p14)
- fixed bug where skipped wasn't incremented if reclaim cursor was
  not zero. This could leave inodes between the start of the AG and
  the cursor unreclaimed (p15)
- reinstate the patch removing SYNC_WAIT from xfs_reclaim_inodes().
  Exposed "skipped" bug in p15.
- cleaned up inode reclaim comments (p18)
- split p19 into two - one to change xfs_ifree_cluster(), one
  for the buffer pinning.
- xfs_ifree_mark_inode_stale() now takes the cluster buffer and we
  get the perag from that rather than having to do a lookup in
  xfs_ifree_cluster().
- moved extra IO reference for xfs_iflush_cluster() from AIL pushing
  to initial xfs_iflush_cluster rework (p22 -> p20)
- fixed static declaration on xfs_iflush() (p22)
- fixed incorrect EIO return from xfs_iflush_cluster()
- rebase p23 because it all rejects now.
- fix INODE_ITEM() usage in p23
- removed long lines from commit message in p24
- new patch to fix logging of XFS_ISTALE inodes which pushes dirty
  inodes through reclaim.



Dave Chinner (30):
  xfs: Don't allow logging of XFS_ISTALE inodes
  xfs: remove logged flag from inode log item
  xfs: add an inode item lock
  xfs: mark inode buffers in cache
  xfs: mark dquot buffers in cache
  xfs: mark log recovery buffers for completion
  xfs: call xfs_buf_iodone directly
  xfs: clean up whacky buffer log item list reinit
  xfs: make inode IO completion buffer centric
  xfs: use direct calls for dquot IO completion
  xfs: clean up the buffer iodone callback functions
  xfs: get rid of log item callbacks
  xfs: handle buffer log item IO errors directly
  xfs: unwind log item error flagging
  xfs: move xfs_clear_li_failed out of xfs_ail_delete_one()
  xfs: pin inode backing buffer to the inode log item
  xfs: make inode reclaim almost non-blocking
  xfs: remove IO submission from xfs_reclaim_inode()
  xfs: allow multiple reclaimers per AG
  xfs: don't block inode reclaim on the ILOCK
  xfs: remove SYNC_TRYLOCK from inode reclaim
  xfs: remove SYNC_WAIT from xfs_reclaim_inodes()
  xfs: clean up inode reclaim comments
  xfs: rework stale inodes in xfs_ifree_cluster
  xfs: attach inodes to the cluster buffer when dirtied
  xfs: xfs_iflush() is no longer necessary
  xfs: rename xfs_iflush_int()
  xfs: rework xfs_iflush_cluster() dirty inode iteration
  xfs: factor xfs_iflush_done
  xfs: remove xfs_inobp_check()

 fs/xfs/libxfs/xfs_inode_buf.c   |  27 +-
 fs/xfs/libxfs/xfs_inode_buf.h   |   6 -
 fs/xfs/libxfs/xfs_trans_inode.c | 112 +++++--
 fs/xfs/xfs_buf.c                |  40 ++-
 fs/xfs/xfs_buf.h                |  48 +--
 fs/xfs/xfs_buf_item.c           | 376 +++++++++++-----------
 fs/xfs/xfs_buf_item.h           |   8 +-
 fs/xfs/xfs_buf_item_recover.c   |   5 +-
 fs/xfs/xfs_dquot.c              |  29 +-
 fs/xfs/xfs_dquot.h              |   1 +
 fs/xfs/xfs_dquot_item.c         |  18 --
 fs/xfs/xfs_dquot_item_recover.c |   2 +-
 fs/xfs/xfs_file.c               |   9 +-
 fs/xfs/xfs_icache.c             | 333 ++++++-------------
 fs/xfs/xfs_icache.h             |   2 +-
 fs/xfs/xfs_inode.c              | 554 ++++++++++++--------------------
 fs/xfs/xfs_inode.h              |   2 +-
 fs/xfs/xfs_inode_item.c         | 303 ++++++++---------
 fs/xfs/xfs_inode_item.h         |  24 +-
 fs/xfs/xfs_inode_item_recover.c |   2 +-
 fs/xfs/xfs_log_recover.c        |   5 +-
 fs/xfs/xfs_mount.c              |  15 +-
 fs/xfs/xfs_mount.h              |   1 -
 fs/xfs/xfs_super.c              |   3 -
 fs/xfs/xfs_trans.h              |   5 -
 fs/xfs/xfs_trans_ail.c          |  10 +-
 fs/xfs/xfs_trans_buf.c          |  15 +-
 27 files changed, 854 insertions(+), 1101 deletions(-)

-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2020-07-01  4:51 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-06-22  8:15 ` [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Dave Chinner
2020-06-22  8:15 ` [PATCH 02/30] xfs: remove logged flag from inode log item Dave Chinner
2020-06-22  8:15 ` [PATCH 03/30] xfs: add an inode item lock Dave Chinner
2020-06-23  2:30   ` Darrick J. Wong
2020-06-22  8:15 ` [PATCH 04/30] xfs: mark inode buffers in cache Dave Chinner
2020-06-23  2:32   ` Darrick J. Wong
2020-06-22  8:15 ` [PATCH 05/30] xfs: mark dquot " Dave Chinner
2020-06-22  8:15 ` [PATCH 06/30] xfs: mark log recovery buffers for completion Dave Chinner
2020-06-22  8:15 ` [PATCH 07/30] xfs: call xfs_buf_iodone directly Dave Chinner
2020-06-22  8:15 ` [PATCH 08/30] xfs: clean up whacky buffer log item list reinit Dave Chinner
2020-06-22  8:15 ` [PATCH 09/30] xfs: make inode IO completion buffer centric Dave Chinner
2020-06-22  8:15 ` [PATCH 10/30] xfs: use direct calls for dquot IO completion Dave Chinner
2020-06-22  8:15 ` [PATCH 11/30] xfs: clean up the buffer iodone callback functions Dave Chinner
2020-06-22  8:15 ` [PATCH 12/30] xfs: get rid of log item callbacks Dave Chinner
2020-06-22  8:15 ` [PATCH 13/30] xfs: handle buffer log item IO errors directly Dave Chinner
2020-06-23  2:38   ` Darrick J. Wong
2020-06-22  8:15 ` [PATCH 14/30] xfs: unwind log item error flagging Dave Chinner
2020-06-22  8:15 ` [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() Dave Chinner
2020-06-22  8:15 ` [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Dave Chinner
2020-06-23  2:39   ` Darrick J. Wong
2020-06-22  8:15 ` [PATCH 17/30] xfs: make inode reclaim almost non-blocking Dave Chinner
2020-06-22  8:15 ` [PATCH 18/30] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
2020-06-22  8:15 ` [PATCH 19/30] xfs: allow multiple reclaimers per AG Dave Chinner
2020-06-22  8:15 ` [PATCH 20/30] xfs: don't block inode reclaim on the ILOCK Dave Chinner
2020-06-22  8:15 ` [PATCH 21/30] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
2020-07-01  4:48   ` [PATCH 21/30 V2] " Dave Chinner
2020-06-22  8:15 ` [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Dave Chinner
2020-07-01  4:51   ` [PATCH 22/30 V2] " Dave Chinner
2020-06-22  8:15 ` [PATCH 23/30] xfs: clean up inode reclaim comments Dave Chinner
2020-06-22  8:15 ` [PATCH 24/30] xfs: rework stale inodes in xfs_ifree_cluster Dave Chinner
2020-06-22  8:16 ` [PATCH 25/30] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
2020-06-22  8:16 ` [PATCH 26/30] xfs: xfs_iflush() is no longer necessary Dave Chinner
2020-06-22  8:16 ` [PATCH 27/30] xfs: rename xfs_iflush_int() Dave Chinner
2020-06-22  8:16 ` [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
2020-06-22  8:16 ` [PATCH 29/30] xfs: factor xfs_iflush_done Dave Chinner
2020-06-22 22:16   ` [PATCH 29/30 V2] " Dave Chinner
2020-06-22  8:16 ` [PATCH 30/30] xfs: remove xfs_inobp_check() Dave Chinner
2020-06-29 23:01 ` [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Darrick J. Wong
2020-06-30 16:52   ` Darrick J. Wong
2020-06-30 21:51     ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2020-06-04  7:45 Dave Chinner
2020-06-01 21:42 Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.