All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9 v3] xfs: in-memory iunlink items
@ 2022-06-27  0:43 Dave Chinner
  2022-06-27  0:43 ` [PATCH 1/9] xfs: factor the xfs_iunlink functions Dave Chinner
                   ` (8 more replies)
  0 siblings, 9 replies; 35+ messages in thread
From: Dave Chinner @ 2022-06-27  0:43 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

Time to revist a patchset that has been sitting in my queue for
a couple of years now - inode cluster buffer lock ordering. The
description that follows is from the original RFC - nothing has
really changed in the past couple of years - this is still relevant
to fixing some of the AIL list scalability issues we have.

This all works, passes fstests in many configurations, doesn't cause
performance regressions in unlink perf scalability tests, etc, so I
think it's largely ready for review and merge.

-Dave.

----

Inode cluster buffer pinning by dirty inodes could allow us to
improve dirty inode tracking efficiency in the AIL by using the
inode cluster buffer item rather than individual inode items.
However, this brings with it some new issues, namely the order in
which we lock inode cluster buffers. While the dirty buffer pinning
doesn't currently result in nested cluster buffer locking in
xfs_trans_log_inode() and so there are no current problems, future
changes will need to nest cluster buffer locking in this context.

That is, transactions that dirty and commit multiple inodes in a
transaction will now need to locking multiple inode cluster buffers
in each transaction (e.g. create, rename, etc). This introduces new
lock ordering constraints in these operations. It also introduces
lock ordering constraints between the AGI and inode cluster buffers
as a result of allocation/freeing being serialised by the AGI
buffer lock. And then there is unlinked inode list logging, which
currently has no fixed order of inode cluster buffer locking.

It's all a bit Wild West right now.

Locking pure inode modifications in order is relatively easy. We
don't actually need to attach and log the buffer to the transaction
until the last moment. We have all the inodes locked, so nothing
other than unlinked inode list modification can race with the
transaction modifying inodes. Hence we can safely move the
attachment of the inodes to the cluster buffer from when we first
dirty them in xfs_trans_log_inode to just before we commit the
transaction.

At this point, all the inodes that have been dirtied in the
transaction have already been locked, modified, logged and attached
to the transaction. Hence if we add a hook into xfs_trans_commit()
to run a "precommit" operation on these log items, we can use this
operation to attach the inodes to the cluster buffer at commit time
instead of in xfs_trans_log_inode().

This, by itself, doesn't solve the lock ordering problem. What it
does do, however, is give us a place where we can -order- all the
dirty items in the transaction list. Hence before we call the
precommit operation on each log item, we sort them. This allows us
to sort all the inode items so that the pre-commit functions that
locks and logs the cluster buffers are run in a deterministic order.
This solves the lock order problem for pure inode modifications.

The unlinked inode list buffer locking is more complex. The unlinked
list is unordered - we add to the tail, remove from where-ever the
inode is in the list. Hence we might need to lock two inode buffers
here (previous inode in list and the one being removed). While we
can order the locking of these buffers correctly within the confines
of the unlinked list, there may be other inodes that need buffer
locking in the same transaction. e.g. O_TMPFILE being linked into a
directory also modifies the directory inode.

Hence we need a mechanism for defering unlinked inode list updates
to the pre-commit operation where it can be sorted into the correct
order. We can do this by first observing that we serialise unlinked
list modifications by holding the AGI buffer lock. IOWs, the AGI is
going to be locked until the transaction commits any time we modify
the unlinked list. Hence it doesn't matter when in the unlink
transactions that we actually load, lock and modify the inode
cluster buffer.

IOWs, what we need is an unlinked inode log item to defer the inode
cluster buffer update to transaction commit time where it can be
ordered with all the other inode cluster operations. Essentially all
we need to do is record the inodes that need to have their unlinked
list pointer updated in a new log item that we attached to the
transaction.

This log item exists purely for the purpose of delaying the update
of the unlinked list pointer until the inode cluster buffer can be
locked in the correct order around the other inode cluster buffers.
It plays no part in the actual commit, and there's no change to
anything that is written to the log. i.e. the inode cluster buffers
still have to be fully logged here (not just ordered) as log
recovery depedends on this to replay mods to the unlinked inode
list.

To make this unlinked inode list processing simpler and easier to
implement as a log item, we need to change the way we track the
unlinked list in memory. Starting from the observation that an inode
on the unlinked list is pinned in memory by the VFS, we can use the
xfs_inode itself to track the unlinked list. To do this efficiently,
we want the unlinked list to be a double linked list. The problem
here is that we need a list per AGI unlinked list, and there are 64
of these per AGI. The approach taken in this patchset is to shadow
the AGI unlinked list heads in the perag, and link inodes by agino,
hence requiring only 8 extra bytes per inode to track this state.

We can then use the agino pointers for lockless inode cache lookups
to retreive the inode. The aginos in the inode are modified only
under the AGI lock, just like the cluster buffer pointers, so we
don't need any extra locking here.  The i_next_unlinked field tracks
the on-disk value of the unlinked list, and the i_prev_unlinked is a
purely in-memory pointer that enables us to efficiently remove
inodes from the middle of the list.

This results in moving a lot of the unlink modification work into
the precommit operations on the unlink log item. tracking all the
unlinked inodes in the inodes themselves also gets rid of the
unlinked list reference hash table that is used to track this back
pointer relationship. This greatly simplifies the the unlinked list
modification code, and removes memory allocations in this hot path
to track back pointers. This, overall, slightly reduces the CPU
overhead of the unlink path.

The result of this log item means that we move all the actual
manipulation of objects to be logged out of the iunlink path and
into the iunlink item. This allows for future optimisation of this
mechanism without needing changes to high level unlink path, as
well as making the unlink lock ordering predictable and synchronised
with other operations that may require inode cluster locking.

Changelog:

Version 3:
- rebase on 5.19-rc2
- fixed log recovery bugs that arose from interactions with
  background inode inactivation being added since the original RFC
  patches.
- Note: has significant, non-trivial merge conflicts with
  xfs-perag-conv-5.20.  Should probably be rebased on that branch to
  avoid these. Need to know what order these branches are going to
  get merged first.

Version 2:
- unpublished
- rebase on 5.15-rc2 + xlog-intent-whiteouts
- reverts changes made in V1 that moved to a list_head based double
  linked list.  This requires on-disk format changes to allow the
  AGI to use just a single bucket and so is not generically
  applicable to all XFS filesystem formats. This version goes back
  to using prev/next agino values and radix tree lookups as the
  original RFC used.

Version 1:
- https://lore.kernel.org/linux-xfs/20200812092556.2567285-1-david@fromorbit.com/
- split up into many smaller patches
- includes Xiang's single unlinked list bucket modification
- uses a list_head for the in memory double unlinked inode list
  rather than aginos and lockless inode lookups
- much simpler as it doesn't need to look up inodes from agino
  values
- iunlink log item changed to take an xfs_inode pointer rather than
  an imap and agino values
- a handful of small cleanups that breaking up into small patches
  allowed.

[RFC]
- https://lore.kernel.org/linux-xfs/20200623095015.1934171-1-david@fromorbit.com/



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/9] xfs: factor the xfs_iunlink functions
  2022-06-27  0:43 [PATCH 0/9 v3] xfs: in-memory iunlink items Dave Chinner
@ 2022-06-27  0:43 ` Dave Chinner
  2022-06-29  7:05   ` Christoph Hellwig
  2022-06-29 20:48   ` Darrick J. Wong
  2022-06-27  0:43 ` [PATCH 2/9] xfs: track the iunlink list pointer in the xfs_inode Dave Chinner
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 35+ messages in thread
From: Dave Chinner @ 2022-06-27  0:43 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Prep work that separates the locking that protects the unlinked list
from the actual operations being performed. This also helps document
the fact they are performing list insert  and remove operations. No
functional code change.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 113 +++++++++++++++++++++++++++------------------
 1 file changed, 68 insertions(+), 45 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 52d6f2c7d58b..2a371c3431c9 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2129,39 +2129,20 @@ xfs_iunlink_update_inode(
 	return error;
 }
 
-/*
- * This is called when the inode's link count has gone to 0 or we are creating
- * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
- *
- * We place the on-disk inode on a list in the AGI.  It will be pulled from this
- * list when the inode is freed.
- */
-STATIC int
-xfs_iunlink(
+static int
+xfs_iunlink_insert_inode(
 	struct xfs_trans	*tp,
+	struct xfs_perag	*pag,
+	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
-{
+ {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_perag	*pag;
-	struct xfs_agi		*agi;
-	struct xfs_buf		*agibp;
+	struct xfs_agi		*agi = agibp->b_addr;
 	xfs_agino_t		next_agino;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
-	ASSERT(VFS_I(ip)->i_nlink == 0);
-	ASSERT(VFS_I(ip)->i_mode != 0);
-	trace_xfs_iunlink(ip);
-
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, pag->pag_agno, &agibp);
-	if (error)
-		goto out;
-	agi = agibp->b_addr;
-
 	/*
 	 * Get the index into the agi hash table for the list this inode will
 	 * go on.  Make sure the pointer isn't garbage and that this inode
@@ -2171,8 +2152,7 @@ xfs_iunlink(
 	if (next_agino == agino ||
 	    !xfs_verify_agino_or_null(mp, pag->pag_agno, next_agino)) {
 		xfs_buf_mark_corrupt(agibp);
-		error = -EFSCORRUPTED;
-		goto out;
+		return -EFSCORRUPTED;
 	}
 
 	if (next_agino != NULLAGINO) {
@@ -2185,7 +2165,7 @@ xfs_iunlink(
 		error = xfs_iunlink_update_inode(tp, ip, pag, next_agino,
 				&old_agino);
 		if (error)
-			goto out;
+			return error;
 		ASSERT(old_agino == NULLAGINO);
 
 		/*
@@ -2194,11 +2174,42 @@ xfs_iunlink(
 		 */
 		error = xfs_iunlink_add_backref(pag, agino, next_agino);
 		if (error)
-			goto out;
+			return error;
 	}
 
 	/* Point the head of the list to point to this inode. */
-	error = xfs_iunlink_update_bucket(tp, pag, agibp, bucket_index, agino);
+	return xfs_iunlink_update_bucket(tp, pag, agibp, bucket_index, agino);
+}
+
+/*
+ * This is called when the inode's link count has gone to 0 or we are creating
+ * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
+ *
+ * We place the on-disk inode on a list in the AGI.  It will be pulled from this
+ * list when the inode is freed.
+ */
+STATIC int
+xfs_iunlink(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_perag	*pag;
+	struct xfs_buf		*agibp;
+	int			error;
+
+	ASSERT(VFS_I(ip)->i_nlink == 0);
+	ASSERT(VFS_I(ip)->i_mode != 0);
+	trace_xfs_iunlink(ip);
+
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
+
+	/* Get the agi buffer first.  It ensures lock ordering on the list. */
+	error = xfs_read_agi(mp, tp, pag->pag_agno, &agibp);
+	if (error)
+		goto out;
+
+	error = xfs_iunlink_insert_inode(tp, pag, agibp, ip);
 out:
 	xfs_perag_put(pag);
 	return error;
@@ -2319,18 +2330,15 @@ xfs_iunlink_map_prev(
 	return 0;
 }
 
-/*
- * Pull the on-disk inode from the AGI unlinked list.
- */
-STATIC int
-xfs_iunlink_remove(
+static int
+xfs_iunlink_remove_inode(
 	struct xfs_trans	*tp,
 	struct xfs_perag	*pag,
+	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi;
-	struct xfs_buf		*agibp;
+	struct xfs_agi		*agi = agibp->b_addr;
 	struct xfs_buf		*last_ibp;
 	struct xfs_dinode	*last_dip = NULL;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
@@ -2339,14 +2347,6 @@ xfs_iunlink_remove(
 	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
-	trace_xfs_iunlink_remove(ip);
-
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, pag->pag_agno, &agibp);
-	if (error)
-		return error;
-	agi = agibp->b_addr;
-
 	/*
 	 * Get the index into the agi hash table for the list this inode will
 	 * go on.  Make sure the head pointer isn't garbage.
@@ -2411,6 +2411,29 @@ xfs_iunlink_remove(
 			next_agino);
 }
 
+/*
+ * Pull the on-disk inode from the AGI unlinked list.
+ */
+STATIC int
+xfs_iunlink_remove(
+	struct xfs_trans	*tp,
+	struct xfs_perag	*pag,
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_buf		*agibp;
+	int			error;
+
+	trace_xfs_iunlink_remove(ip);
+
+	/* Get the agi buffer first.  It ensures lock ordering on the list. */
+	error = xfs_read_agi(mp, tp, pag->pag_agno, &agibp);
+	if (error)
+		return error;
+
+	return xfs_iunlink_remove_inode(tp, pag, agibp, ip);
+}
+
 /*
  * Look up the inode number specified and if it is not already marked XFS_ISTALE
  * mark it stale. We should only find clean inodes in this lookup that aren't
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/9] xfs: track the iunlink list pointer in the xfs_inode
  2022-06-27  0:43 [PATCH 0/9 v3] xfs: in-memory iunlink items Dave Chinner
  2022-06-27  0:43 ` [PATCH 1/9] xfs: factor the xfs_iunlink functions Dave Chinner
@ 2022-06-27  0:43 ` Dave Chinner
  2022-06-29  7:08   ` Christoph Hellwig
  2022-06-29 20:50   ` Darrick J. Wong
  2022-06-27  0:43 ` [PATCH 3/9] xfs: refactor xlog_recover_process_iunlinks() Dave Chinner
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 35+ messages in thread
From: Dave Chinner @ 2022-06-27  0:43 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Having direct access to the i_next_unlinked pointer in unlinked
inodes greatly simplifies the processing of inodes on the unlinked
list. We no longer need to look up the inode buffer just to find
next inode in the list if the xfs_inode is in memory. These
improvements will be realised over upcoming patches as other
dependencies on the inode buffer for unlinked list processing are
removed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c |  3 ++-
 fs/xfs/xfs_inode.c            |  5 ++++-
 fs/xfs/xfs_inode.h            |  3 +++
 fs/xfs/xfs_log_recover.c      | 16 +---------------
 4 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 3b1b63f9d886..d05a3294020a 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -230,7 +230,8 @@ xfs_inode_from_disk(
 	ip->i_nblocks = be64_to_cpu(from->di_nblocks);
 	ip->i_extsize = be32_to_cpu(from->di_extsize);
 	ip->i_forkoff = from->di_forkoff;
-	ip->i_diflags	= be16_to_cpu(from->di_flags);
+	ip->i_diflags = be16_to_cpu(from->di_flags);
+	ip->i_next_unlinked = be32_to_cpu(from->di_next_unlinked);
 
 	if (from->di_dmevmask || from->di_dmstate)
 		xfs_iflags_set(ip, XFS_IPRESERVE_DM_FIELDS);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 2a371c3431c9..c507370bd885 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2098,7 +2098,8 @@ xfs_iunlink_update_inode(
 
 	/* Make sure the old pointer isn't garbage. */
 	old_value = be32_to_cpu(dip->di_next_unlinked);
-	if (!xfs_verify_agino_or_null(mp, pag->pag_agno, old_value)) {
+	if (old_value != ip->i_next_unlinked ||
+	    !xfs_verify_agino_or_null(mp, pag->pag_agno, old_value)) {
 		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
 				sizeof(*dip), __this_address);
 		error = -EFSCORRUPTED;
@@ -2167,6 +2168,7 @@ xfs_iunlink_insert_inode(
 		if (error)
 			return error;
 		ASSERT(old_agino == NULLAGINO);
+		ip->i_next_unlinked = next_agino;
 
 		/*
 		 * agino has been unlinked, add a backref from the next inode
@@ -2366,6 +2368,7 @@ xfs_iunlink_remove_inode(
 	error = xfs_iunlink_update_inode(tp, ip, pag, NULLAGINO, &next_agino);
 	if (error)
 		return error;
+	ip->i_next_unlinked = NULLAGINO;
 
 	/*
 	 * If there was a backref pointing from the next inode back to this
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 7be6f8e705ab..8e2a33c6cbe2 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -68,6 +68,9 @@ typedef struct xfs_inode {
 	uint64_t		i_diflags2;	/* XFS_DIFLAG2_... */
 	struct timespec64	i_crtime;	/* time created */
 
+	/* unlinked list pointers */
+	xfs_agino_t		i_next_unlinked;
+
 	/* VFS inode */
 	struct inode		i_vnode;	/* embedded VFS inode */
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 5f7e4e6e33ce..f360b46533a6 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2673,8 +2673,6 @@ xlog_recover_process_one_iunlink(
 	xfs_agino_t			agino,
 	int				bucket)
 {
-	struct xfs_buf			*ibp;
-	struct xfs_dinode		*dip;
 	struct xfs_inode		*ip;
 	xfs_ino_t			ino;
 	int				error;
@@ -2684,27 +2682,15 @@ xlog_recover_process_one_iunlink(
 	if (error)
 		goto fail;
 
-	/*
-	 * Get the on disk inode to find the next inode in the bucket.
-	 */
-	error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &ibp);
-	if (error)
-		goto fail_iput;
-	dip = xfs_buf_offset(ibp, ip->i_imap.im_boffset);
-
 	xfs_iflags_clear(ip, XFS_IRECOVERY);
 	ASSERT(VFS_I(ip)->i_nlink == 0);
 	ASSERT(VFS_I(ip)->i_mode != 0);
 
 	/* setup for the next pass */
-	agino = be32_to_cpu(dip->di_next_unlinked);
-	xfs_buf_relse(ibp);
-
+	agino = ip->i_next_unlinked;
 	xfs_irele(ip);
 	return agino;
 
- fail_iput:
-	xfs_irele(ip);
  fail:
 	/*
 	 * We can't read in the inode this bucket points to, or this inode
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/9] xfs: refactor xlog_recover_process_iunlinks()
  2022-06-27  0:43 [PATCH 0/9 v3] xfs: in-memory iunlink items Dave Chinner
  2022-06-27  0:43 ` [PATCH 1/9] xfs: factor the xfs_iunlink functions Dave Chinner
  2022-06-27  0:43 ` [PATCH 2/9] xfs: track the iunlink list pointer in the xfs_inode Dave Chinner
@ 2022-06-27  0:43 ` Dave Chinner
  2022-06-29  7:12   ` Christoph Hellwig
  2022-06-29 20:56   ` Darrick J. Wong
  2022-06-27  0:43 ` [PATCH 4/9] xfs: introduce xfs_iunlink_lookup Dave Chinner
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 35+ messages in thread
From: Dave Chinner @ 2022-06-27  0:43 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

For upcoming changes to the way inode unlinked list processing is
done, the structure of recovery needs to change slightly. We also
really need to untangle the messy error handling in list recovery
so that actions like emptying the bucket on inode lookup failure
are associated with the bucket list walk failing, not failing
to look up the inode.

Refactor the recovery code now to keep the re-organisation seperate
to the algorithm changes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_recover.c | 162 ++++++++++++++++++++-------------------
 1 file changed, 84 insertions(+), 78 deletions(-)

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index f360b46533a6..7d0f530d7e3c 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2627,23 +2627,23 @@ xlog_recover_cancel_intents(
  * This routine performs a transaction to null out a bad inode pointer
  * in an agi unlinked inode hash bucket.
  */
-STATIC void
+static void
 xlog_recover_clear_agi_bucket(
-	xfs_mount_t	*mp,
-	xfs_agnumber_t	agno,
-	int		bucket)
+	struct xfs_mount	*mp,
+	struct xfs_perag	*pag,
+	int			bucket)
 {
-	xfs_trans_t	*tp;
-	xfs_agi_t	*agi;
-	struct xfs_buf	*agibp;
-	int		offset;
-	int		error;
+	struct xfs_trans	*tp;
+	struct xfs_agi		*agi;
+	struct xfs_buf		*agibp;
+	int			offset;
+	int			error;
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_clearagi, 0, 0, 0, &tp);
 	if (error)
 		goto out_error;
 
-	error = xfs_read_agi(mp, tp, agno, &agibp);
+	error = xfs_read_agi(mp, tp, pag->pag_agno, &agibp);
 	if (error)
 		goto out_abort;
 
@@ -2662,46 +2662,40 @@ xlog_recover_clear_agi_bucket(
 out_abort:
 	xfs_trans_cancel(tp);
 out_error:
-	xfs_warn(mp, "%s: failed to clear agi %d. Continuing.", __func__, agno);
+	xfs_warn(mp, "%s: failed to clear agi %d. Continuing.",
+		__func__, pag->pag_agno);
 	return;
 }
 
-STATIC xfs_agino_t
-xlog_recover_process_one_iunlink(
-	struct xfs_mount		*mp,
-	xfs_agnumber_t			agno,
-	xfs_agino_t			agino,
-	int				bucket)
+static int
+xlog_recover_iunlink_bucket(
+	struct xfs_mount	*mp,
+	struct xfs_perag	*pag,
+	struct xfs_agi		*agi,
+	int			bucket)
 {
-	struct xfs_inode		*ip;
-	xfs_ino_t			ino;
-	int				error;
+	struct xfs_inode	*ip;
+	xfs_agino_t		agino;
 
-	ino = XFS_AGINO_TO_INO(mp, agno, agino);
-	error = xfs_iget(mp, NULL, ino, 0, 0, &ip);
-	if (error)
-		goto fail;
+	agino = be32_to_cpu(agi->agi_unlinked[bucket]);
+	while (agino != NULLAGINO) {
+		int	error;
 
-	xfs_iflags_clear(ip, XFS_IRECOVERY);
-	ASSERT(VFS_I(ip)->i_nlink == 0);
-	ASSERT(VFS_I(ip)->i_mode != 0);
+		error = xfs_iget(mp, NULL,
+				XFS_AGINO_TO_INO(mp, pag->pag_agno, agino),
+				0, 0, &ip);
+		if (error)
+			return error;;
 
-	/* setup for the next pass */
-	agino = ip->i_next_unlinked;
-	xfs_irele(ip);
-	return agino;
+		ASSERT(VFS_I(ip)->i_nlink == 0);
+		ASSERT(VFS_I(ip)->i_mode != 0);
+		xfs_iflags_clear(ip, XFS_IRECOVERY);
+		agino = ip->i_next_unlinked;
 
- fail:
-	/*
-	 * We can't read in the inode this bucket points to, or this inode
-	 * is messed up.  Just ditch this bucket of inodes.  We will lose
-	 * some inodes and space, but at least we won't hang.
-	 *
-	 * Call xlog_recover_clear_agi_bucket() to perform a transaction to
-	 * clear the inode pointer in the bucket.
-	 */
-	xlog_recover_clear_agi_bucket(mp, agno, bucket);
-	return NULLAGINO;
+		xfs_irele(ip);
+		cond_resched();
+	}
+	return 0;
 }
 
 /*
@@ -2727,51 +2721,63 @@ xlog_recover_process_one_iunlink(
  * scheduled on this CPU to ensure other scheduled work can run without undue
  * latency.
  */
-STATIC void
-xlog_recover_process_iunlinks(
-	struct xlog	*log)
+static void
+xlog_recover_iunlink_ag(
+	struct xfs_mount	*mp,
+	struct xfs_perag	*pag)
 {
-	struct xfs_mount	*mp = log->l_mp;
-	struct xfs_perag	*pag;
-	xfs_agnumber_t		agno;
 	struct xfs_agi		*agi;
 	struct xfs_buf		*agibp;
-	xfs_agino_t		agino;
 	int			bucket;
 	int			error;
 
-	for_each_perag(mp, agno, pag) {
-		error = xfs_read_agi(mp, NULL, pag->pag_agno, &agibp);
+	error = xfs_read_agi(mp, NULL, pag->pag_agno, &agibp);
+	if (error) {
+		/*
+		 * AGI is b0rked. Don't process it.
+		 *
+		 * We should probably mark the filesystem as corrupt after we've
+		 * recovered all the ag's we can....
+		 */
+		return;
+	}
+
+	/*
+	 * Unlock the buffer so that it can be acquired in the normal course of
+	 * the transaction to truncate and free each inode.  Because we are not
+	 * racing with anyone else here for the AGI buffer, we don't even need
+	 * to hold it locked to read the initial unlinked bucket entries out of
+	 * the buffer. We keep buffer reference though, so that it stays pinned
+	 * in memory while we need the buffer.
+	 */
+	agi = agibp->b_addr;
+	xfs_buf_unlock(agibp);
+
+	for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
+		error = xlog_recover_iunlink_bucket(mp, pag, agi, bucket);
 		if (error) {
 			/*
-			 * AGI is b0rked. Don't process it.
-			 *
-			 * We should probably mark the filesystem as corrupt
-			 * after we've recovered all the ag's we can....
+			 * Bucket is unrecoverable, so only a repair scan can
+			 * free the remaining unlinked inodes. Just empty the
+			 * bucket and remaining inodes on it unreferenced and
+			 * unfreeable.
 			 */
-			continue;
+			xlog_recover_clear_agi_bucket(mp, pag, bucket);
 		}
-		/*
-		 * Unlock the buffer so that it can be acquired in the normal
-		 * course of the transaction to truncate and free each inode.
-		 * Because we are not racing with anyone else here for the AGI
-		 * buffer, we don't even need to hold it locked to read the
-		 * initial unlinked bucket entries out of the buffer. We keep
-		 * buffer reference though, so that it stays pinned in memory
-		 * while we need the buffer.
-		 */
-		agi = agibp->b_addr;
-		xfs_buf_unlock(agibp);
-
-		for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
-			agino = be32_to_cpu(agi->agi_unlinked[bucket]);
-			while (agino != NULLAGINO) {
-				agino = xlog_recover_process_one_iunlink(mp,
-						pag->pag_agno, agino, bucket);
-				cond_resched();
-			}
-		}
-		xfs_buf_rele(agibp);
+	}
+
+	xfs_buf_rele(agibp);
+}
+
+static void
+xlog_recover_process_iunlinks(
+	struct xlog	*log)
+{
+	struct xfs_perag	*pag;
+	xfs_agnumber_t		agno;
+
+	for_each_perag(log->l_mp, agno, pag) {
+		xlog_recover_iunlink_ag(log->l_mp, pag);
 	}
 
 	/*
@@ -2779,7 +2785,7 @@ xlog_recover_process_iunlinks(
 	 * are fully completed on disk and the incore inodes can be reclaimed
 	 * before we signal that recovery is complete.
 	 */
-	xfs_inodegc_flush(mp);
+	xfs_inodegc_flush(log->l_mp);
 }
 
 STATIC void
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 4/9] xfs: introduce xfs_iunlink_lookup
  2022-06-27  0:43 [PATCH 0/9 v3] xfs: in-memory iunlink items Dave Chinner
                   ` (2 preceding siblings ...)
  2022-06-27  0:43 ` [PATCH 3/9] xfs: refactor xlog_recover_process_iunlinks() Dave Chinner
@ 2022-06-27  0:43 ` Dave Chinner
  2022-06-29  7:17   ` Christoph Hellwig
  2022-06-29 21:01   ` Darrick J. Wong
  2022-06-27  0:43 ` [PATCH 5/9] xfs: double link the unlinked inode list Dave Chinner
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 35+ messages in thread
From: Dave Chinner @ 2022-06-27  0:43 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When an inode is on an unlinked list during normal operation, it is
guaranteed to be pinned in memory as it is either referenced by the
current unlink operation or it has a open file descriptor that
references it and has it pinned in memory. Hence to look up an inode
on the unlinked list, we can do a direct inode cache lookup and
always expect the lookup to succeed.

Add a function to do this lookup based on the agino that we use to
link the chain of unlinked inodes together so we can begin the
conversion the unlinked list manipulations to use in-memory inodes
rather than inode cluster buffers and remove the backref cache.

Use this lookup function to replace the on-disk inode buffer walk
when removing inodes from the unlinked list with an in-core inode
unlinked list walk.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 165 +++++++++++++++++++--------------------------
 1 file changed, 71 insertions(+), 94 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c507370bd885..e6ac13174062 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2006,6 +2006,39 @@ xfs_iunlink_destroy(
 	ASSERT(freed_anything == false || xfs_is_shutdown(pag->pag_mount));
 }
 
+/*
+ * Find an inode on the unlinked list. This does not take references to the
+ * inode, we have existence guarantees by holding the AGI buffer lock and that
+ * only unlinked, referenced inodes can be on the unlinked inode list.  If we
+ * don't find the inode in cache, then let the caller handle the situation.
+ */
+static struct xfs_inode *
+xfs_iunlink_lookup(
+	struct xfs_perag	*pag,
+	xfs_agino_t		agino)
+{
+	struct xfs_mount	*mp = pag->pag_mount;
+	struct xfs_inode	*ip;
+
+	rcu_read_lock();
+	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
+
+	/* Inode not in memory, nothing to do */
+	if (!ip) {
+		rcu_read_unlock();
+		return NULL;
+	}
+	spin_lock(&ip->i_flags_lock);
+	if (ip->i_ino != XFS_AGINO_TO_INO(mp, pag->pag_agno, agino) ||
+	    (ip->i_flags & (XFS_IRECLAIMABLE | XFS_IRECLAIM))) {
+		/* Uh-oh! */
+		ip = NULL;
+	}
+	spin_unlock(&ip->i_flags_lock);
+	rcu_read_unlock();
+	return ip;
+}
+
 /*
  * Point the AGI unlinked bucket at an inode and log the results.  The caller
  * is responsible for validating the old value.
@@ -2111,7 +2144,8 @@ xfs_iunlink_update_inode(
 	 * current pointer is the same as the new value, unless we're
 	 * terminating the list.
 	 */
-	*old_next_agino = old_value;
+	if (old_next_agino)
+		*old_next_agino = old_value;
 	if (old_value == next_agino) {
 		if (next_agino != NULLAGINO) {
 			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
@@ -2217,38 +2251,6 @@ xfs_iunlink(
 	return error;
 }
 
-/* Return the imap, dinode pointer, and buffer for an inode. */
-STATIC int
-xfs_iunlink_map_ino(
-	struct xfs_trans	*tp,
-	xfs_agnumber_t		agno,
-	xfs_agino_t		agino,
-	struct xfs_imap		*imap,
-	struct xfs_dinode	**dipp,
-	struct xfs_buf		**bpp)
-{
-	struct xfs_mount	*mp = tp->t_mountp;
-	int			error;
-
-	imap->im_blkno = 0;
-	error = xfs_imap(mp, tp, XFS_AGINO_TO_INO(mp, agno, agino), imap, 0);
-	if (error) {
-		xfs_warn(mp, "%s: xfs_imap returned error %d.",
-				__func__, error);
-		return error;
-	}
-
-	error = xfs_imap_to_bp(mp, tp, imap, bpp);
-	if (error) {
-		xfs_warn(mp, "%s: xfs_imap_to_bp returned error %d.",
-				__func__, error);
-		return error;
-	}
-
-	*dipp = xfs_buf_offset(*bpp, imap->im_boffset);
-	return 0;
-}
-
 /*
  * Walk the unlinked chain from @head_agino until we find the inode that
  * points to @target_agino.  Return the inode number, map, dinode pointer,
@@ -2259,77 +2261,51 @@ xfs_iunlink_map_ino(
  *
  * Do not call this function if @target_agino is the head of the list.
  */
-STATIC int
-xfs_iunlink_map_prev(
-	struct xfs_trans	*tp,
+static int
+xfs_iunlink_lookup_prev(
 	struct xfs_perag	*pag,
 	xfs_agino_t		head_agino,
 	xfs_agino_t		target_agino,
-	xfs_agino_t		*agino,
-	struct xfs_imap		*imap,
-	struct xfs_dinode	**dipp,
-	struct xfs_buf		**bpp)
+	struct xfs_inode	**ipp)
 {
-	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_mount	*mp = pag->pag_mount;
+	struct xfs_inode	*ip;
 	xfs_agino_t		next_agino;
-	int			error;
 
-	ASSERT(head_agino != target_agino);
-	*bpp = NULL;
+	*ipp = NULL;
 
 	/* See if our backref cache can find it faster. */
-	*agino = xfs_iunlink_lookup_backref(pag, target_agino);
-	if (*agino != NULLAGINO) {
-		error = xfs_iunlink_map_ino(tp, pag->pag_agno, *agino, imap,
-				dipp, bpp);
-		if (error)
-			return error;
-
-		if (be32_to_cpu((*dipp)->di_next_unlinked) == target_agino)
+	next_agino = xfs_iunlink_lookup_backref(pag, target_agino);
+	if (next_agino != NULLAGINO) {
+		ip = xfs_iunlink_lookup(pag, next_agino);
+		if (ip && ip->i_next_unlinked == target_agino) {
+			*ipp = ip;
 			return 0;
-
-		/*
-		 * If we get here the cache contents were corrupt, so drop the
-		 * buffer and fall back to walking the bucket list.
-		 */
-		xfs_trans_brelse(tp, *bpp);
-		*bpp = NULL;
-		WARN_ON_ONCE(1);
+		}
 	}
 
-	trace_xfs_iunlink_map_prev_fallback(mp, pag->pag_agno);
-
 	/* Otherwise, walk the entire bucket until we find it. */
 	next_agino = head_agino;
-	while (next_agino != target_agino) {
-		xfs_agino_t	unlinked_agino;
+	while (next_agino != NULLAGINO) {
+		ip = xfs_iunlink_lookup(pag, next_agino);
+		if (!ip)
+			return -EFSCORRUPTED;
 
-		if (*bpp)
-			xfs_trans_brelse(tp, *bpp);
-
-		*agino = next_agino;
-		error = xfs_iunlink_map_ino(tp, pag->pag_agno, next_agino, imap,
-				dipp, bpp);
-		if (error)
-			return error;
-
-		unlinked_agino = be32_to_cpu((*dipp)->di_next_unlinked);
 		/*
 		 * Make sure this pointer is valid and isn't an obvious
 		 * infinite loop.
 		 */
-		if (!xfs_verify_agino(mp, pag->pag_agno, unlinked_agino) ||
-		    next_agino == unlinked_agino) {
-			XFS_CORRUPTION_ERROR(__func__,
-					XFS_ERRLEVEL_LOW, mp,
-					*dipp, sizeof(**dipp));
-			error = -EFSCORRUPTED;
-			return error;
+		if (!xfs_verify_agino(mp, pag->pag_agno, ip->i_next_unlinked) ||
+		    next_agino == ip->i_next_unlinked)
+			return -EFSCORRUPTED;
+
+		if (ip->i_next_unlinked == target_agino) {
+			*ipp = ip;
+			return 0;
 		}
-		next_agino = unlinked_agino;
+		next_agino = ip->i_next_unlinked;
 	}
-
-	return 0;
+	return -EFSCORRUPTED;
 }
 
 static int
@@ -2341,8 +2317,6 @@ xfs_iunlink_remove_inode(
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_agi		*agi = agibp->b_addr;
-	struct xfs_buf		*last_ibp;
-	struct xfs_dinode	*last_dip = NULL;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
 	xfs_agino_t		head_agino;
@@ -2368,7 +2342,6 @@ xfs_iunlink_remove_inode(
 	error = xfs_iunlink_update_inode(tp, ip, pag, NULLAGINO, &next_agino);
 	if (error)
 		return error;
-	ip->i_next_unlinked = NULLAGINO;
 
 	/*
 	 * If there was a backref pointing from the next inode back to this
@@ -2384,18 +2357,21 @@ xfs_iunlink_remove_inode(
 	}
 
 	if (head_agino != agino) {
-		struct xfs_imap	imap;
-		xfs_agino_t	prev_agino;
+		struct xfs_inode	*prev_ip;
 
-		/* We need to search the list for the inode being freed. */
-		error = xfs_iunlink_map_prev(tp, pag, head_agino, agino,
-				&prev_agino, &imap, &last_dip, &last_ibp);
+		error = xfs_iunlink_lookup_prev(pag, head_agino, agino,
+				&prev_ip);
 		if (error)
 			return error;
 
 		/* Point the previous inode on the list to the next inode. */
-		xfs_iunlink_update_dinode(tp, pag, prev_agino, last_ibp,
-				last_dip, &imap, next_agino);
+		error = xfs_iunlink_update_inode(tp, prev_ip, pag, next_agino,
+				NULL);
+		if (error)
+			return error;
+
+		prev_ip->i_next_unlinked = ip->i_next_unlinked;
+		ip->i_next_unlinked = NULLAGINO;
 
 		/*
 		 * Now we deal with the backref for this inode.  If this inode
@@ -2410,6 +2386,7 @@ xfs_iunlink_remove_inode(
 	}
 
 	/* Point the head of the list to the next unlinked inode. */
+	ip->i_next_unlinked = NULLAGINO;
 	return xfs_iunlink_update_bucket(tp, pag, agibp, bucket_index,
 			next_agino);
 }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 5/9] xfs: double link the unlinked inode list
  2022-06-27  0:43 [PATCH 0/9 v3] xfs: in-memory iunlink items Dave Chinner
                   ` (3 preceding siblings ...)
  2022-06-27  0:43 ` [PATCH 4/9] xfs: introduce xfs_iunlink_lookup Dave Chinner
@ 2022-06-27  0:43 ` Dave Chinner
  2022-06-29  7:20   ` Christoph Hellwig
  2022-06-29 21:06   ` Darrick J. Wong
  2022-06-27  0:43 ` [PATCH 6/9] xfs: clean up xfs_iunlink_update_inode() Dave Chinner
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 35+ messages in thread
From: Dave Chinner @ 2022-06-27  0:43 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now we have forwards traversal via the incore inode in place, we now
need to add back pointers to the incore inode to entirely replace
the back reference cache. We use the same lookup semantics and
constraints as for the forwards pointer lookups during unlinks, and
so we can look up any inode in the unlinked list directly and update
the list pointers, forwards or backwards, at any time.

The only wrinkle in converting the unlinked list manipulations to
use in-core previous pointers is that log recovery doesn't have the
incore inode state built up so it can't just read in an inode and
release it to finish off the unlink. Hence we need to modify the
traversal in recovery to read one inode ahead before we
release the inode at the head of the list. This populates the
next->prev relationship sufficient to be able to replay the unlinked
list and hence greatly simplify the runtime code.

This recovery algorithm also requires that we actually remove inodes
from the unlinked list one at a time as background inode
inactivation will result in unlinked list removal racing with the
building of the in-memory unlinked list state. We could serialise
this by holding the AGI buffer lock when constructing the in memory
state, but all that does is lockstep background processing with list
building. It is much simpler to flush the inodegc immediately after
releasing the inode so that it is unlinked immediately and there is
no races present at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ag.c   |   8 -
 fs/xfs/libxfs/xfs_ag.h   |   6 -
 fs/xfs/xfs_icache.c      |   2 +
 fs/xfs/xfs_inode.c       | 348 +++++++--------------------------------
 fs/xfs/xfs_inode.h       |   4 +-
 fs/xfs/xfs_log_recover.c |  34 +++-
 6 files changed, 90 insertions(+), 312 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 3e920cf1b454..2de67389fe60 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -194,7 +194,6 @@ xfs_free_perag(
 		XFS_IS_CORRUPT(pag->pag_mount, atomic_read(&pag->pag_ref) != 0);
 
 		cancel_delayed_work_sync(&pag->pag_blockgc_work);
-		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
 
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
@@ -263,10 +262,6 @@ xfs_initialize_perag(
 		if (error)
 			goto out_remove_pag;
 
-		error = xfs_iunlink_init(pag);
-		if (error)
-			goto out_hash_destroy;
-
 		/* first new pag is fully initialized */
 		if (first_initialised == NULLAGNUMBER)
 			first_initialised = index;
@@ -280,8 +275,6 @@ xfs_initialize_perag(
 	mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp);
 	return 0;
 
-out_hash_destroy:
-	xfs_buf_hash_destroy(pag);
 out_remove_pag:
 	radix_tree_delete(&mp->m_perag_tree, index);
 out_free_pag:
@@ -293,7 +286,6 @@ xfs_initialize_perag(
 		if (!pag)
 			break;
 		xfs_buf_hash_destroy(pag);
-		xfs_iunlink_destroy(pag);
 		kmem_free(pag);
 	}
 	return error;
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index e411d51c2589..bd5f715d116d 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -97,12 +97,6 @@ struct xfs_perag {
 	/* background prealloc block trimming */
 	struct delayed_work	pag_blockgc_work;
 
-	/*
-	 * Unlinked inode information.  This incore information reflects
-	 * data stored in the AGI, so callers must hold the AGI buffer lock
-	 * or have some other means to control concurrency.
-	 */
-	struct rhashtable	pagi_unlinked_hash;
 #endif /* __KERNEL__ */
 };
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 5269354b1b69..9fc324a29535 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -111,6 +111,8 @@ xfs_inode_alloc(
 	INIT_WORK(&ip->i_ioend_work, xfs_end_io);
 	INIT_LIST_HEAD(&ip->i_ioend_list);
 	spin_lock_init(&ip->i_ioend_lock);
+	ip->i_next_unlinked = NULLAGINO;
+	ip->i_prev_unlinked = NULLAGINO;
 
 	return ip;
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index e6ac13174062..f390a91243bf 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1815,196 +1815,21 @@ xfs_inactive(
  * because we must walk that list to find the inode that points to the inode
  * being removed from the unlinked hash bucket list.
  *
- * What if we modelled the unlinked list as a collection of records capturing
- * "X.next_unlinked = Y" relations?  If we indexed those records on Y, we'd
- * have a fast way to look up unlinked list predecessors, which avoids the
- * slow list walk.  That's exactly what we do here (in-core) with a per-AG
- * rhashtable.
+ * Hence we keep an in-memory double linked list to link each inode on an
+ * unlinked list. Because there are 64 unlinked lists per AGI, keeping pointer
+ * based lists qould require having 64 list heads in the perag, one for each
+ * list. This is expensive in terms of memory (think millions of AGs) and cache
+ * misses on lookups. Instead, use the fact that inodes on the unlinked list
+ * must be referenced at the VFS level to keep them on the list and hence we
+ * have an existence guarantee for inodes on the unlinked list.
  *
- * Because this is a backref cache, we ignore operational failures since the
- * iunlink code can fall back to the slow bucket walk.  The only errors that
- * should bubble out are for obviously incorrect situations.
- *
- * All users of the backref cache MUST hold the AGI buffer lock to serialize
- * access or have otherwise provided for concurrency control.
- */
-
-/* Capture a "X.next_unlinked = Y" relationship. */
-struct xfs_iunlink {
-	struct rhash_head	iu_rhash_head;
-	xfs_agino_t		iu_agino;		/* X */
-	xfs_agino_t		iu_next_unlinked;	/* Y */
-};
-
-/* Unlinked list predecessor lookup hashtable construction */
-static int
-xfs_iunlink_obj_cmpfn(
-	struct rhashtable_compare_arg	*arg,
-	const void			*obj)
-{
-	const xfs_agino_t		*key = arg->key;
-	const struct xfs_iunlink	*iu = obj;
-
-	if (iu->iu_next_unlinked != *key)
-		return 1;
-	return 0;
-}
-
-static const struct rhashtable_params xfs_iunlink_hash_params = {
-	.min_size		= XFS_AGI_UNLINKED_BUCKETS,
-	.key_len		= sizeof(xfs_agino_t),
-	.key_offset		= offsetof(struct xfs_iunlink,
-					   iu_next_unlinked),
-	.head_offset		= offsetof(struct xfs_iunlink, iu_rhash_head),
-	.automatic_shrinking	= true,
-	.obj_cmpfn		= xfs_iunlink_obj_cmpfn,
-};
-
-/*
- * Return X, where X.next_unlinked == @agino.  Returns NULLAGINO if no such
- * relation is found.
- */
-static xfs_agino_t
-xfs_iunlink_lookup_backref(
-	struct xfs_perag	*pag,
-	xfs_agino_t		agino)
-{
-	struct xfs_iunlink	*iu;
-
-	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
-			xfs_iunlink_hash_params);
-	return iu ? iu->iu_agino : NULLAGINO;
-}
-
-/*
- * Take ownership of an iunlink cache entry and insert it into the hash table.
- * If successful, the entry will be owned by the cache; if not, it is freed.
- * Either way, the caller does not own @iu after this call.
- */
-static int
-xfs_iunlink_insert_backref(
-	struct xfs_perag	*pag,
-	struct xfs_iunlink	*iu)
-{
-	int			error;
-
-	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
-			&iu->iu_rhash_head, xfs_iunlink_hash_params);
-	/*
-	 * Fail loudly if there already was an entry because that's a sign of
-	 * corruption of in-memory data.  Also fail loudly if we see an error
-	 * code we didn't anticipate from the rhashtable code.  Currently we
-	 * only anticipate ENOMEM.
-	 */
-	if (error) {
-		WARN(error != -ENOMEM, "iunlink cache insert error %d", error);
-		kmem_free(iu);
-	}
-	/*
-	 * Absorb any runtime errors that aren't a result of corruption because
-	 * this is a cache and we can always fall back to bucket list scanning.
-	 */
-	if (error != 0 && error != -EEXIST)
-		error = 0;
-	return error;
-}
-
-/* Remember that @prev_agino.next_unlinked = @this_agino. */
-static int
-xfs_iunlink_add_backref(
-	struct xfs_perag	*pag,
-	xfs_agino_t		prev_agino,
-	xfs_agino_t		this_agino)
-{
-	struct xfs_iunlink	*iu;
-
-	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
-		return 0;
-
-	iu = kmem_zalloc(sizeof(*iu), KM_NOFS);
-	iu->iu_agino = prev_agino;
-	iu->iu_next_unlinked = this_agino;
-
-	return xfs_iunlink_insert_backref(pag, iu);
-}
-
-/*
- * Replace X.next_unlinked = @agino with X.next_unlinked = @next_unlinked.
- * If @next_unlinked is NULLAGINO, we drop the backref and exit.  If there
- * wasn't any such entry then we don't bother.
+ * Given we have an existence guarantee, we can use lockless inode cache lookups
+ * to resolve aginos to xfs inodes. This means we only need 8 bytes per inode
+ * for the double linked unlinked list, and we don't need any extra locking to
+ * keep the list safe as all manipulations are done under the AGI buffer lock.
+ * Keeping the list up to date does not require memory allocation, just finding
+ * the XFS inode and updating the next/prev unlinked list aginos.
  */
-static int
-xfs_iunlink_change_backref(
-	struct xfs_perag	*pag,
-	xfs_agino_t		agino,
-	xfs_agino_t		next_unlinked)
-{
-	struct xfs_iunlink	*iu;
-	int			error;
-
-	/* Look up the old entry; if there wasn't one then exit. */
-	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
-			xfs_iunlink_hash_params);
-	if (!iu)
-		return 0;
-
-	/*
-	 * Remove the entry.  This shouldn't ever return an error, but if we
-	 * couldn't remove the old entry we don't want to add it again to the
-	 * hash table, and if the entry disappeared on us then someone's
-	 * violated the locking rules and we need to fail loudly.  Either way
-	 * we cannot remove the inode because internal state is or would have
-	 * been corrupt.
-	 */
-	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
-			&iu->iu_rhash_head, xfs_iunlink_hash_params);
-	if (error)
-		return error;
-
-	/* If there is no new next entry just free our item and return. */
-	if (next_unlinked == NULLAGINO) {
-		kmem_free(iu);
-		return 0;
-	}
-
-	/* Update the entry and re-add it to the hash table. */
-	iu->iu_next_unlinked = next_unlinked;
-	return xfs_iunlink_insert_backref(pag, iu);
-}
-
-/* Set up the in-core predecessor structures. */
-int
-xfs_iunlink_init(
-	struct xfs_perag	*pag)
-{
-	return rhashtable_init(&pag->pagi_unlinked_hash,
-			&xfs_iunlink_hash_params);
-}
-
-/* Free the in-core predecessor structures. */
-static void
-xfs_iunlink_free_item(
-	void			*ptr,
-	void			*arg)
-{
-	struct xfs_iunlink	*iu = ptr;
-	bool			*freed_anything = arg;
-
-	*freed_anything = true;
-	kmem_free(iu);
-}
-
-void
-xfs_iunlink_destroy(
-	struct xfs_perag	*pag)
-{
-	bool			freed_anything = false;
-
-	rhashtable_free_and_destroy(&pag->pagi_unlinked_hash,
-			xfs_iunlink_free_item, &freed_anything);
-
-	ASSERT(freed_anything == false || xfs_is_shutdown(pag->pag_mount));
-}
 
 /*
  * Find an inode on the unlinked list. This does not take references to the
@@ -2039,6 +1864,26 @@ xfs_iunlink_lookup(
 	return ip;
 }
 
+/* Update the prev pointer of the next agino. */
+static int
+xfs_iunlink_update_backref(
+	struct xfs_perag	*pag,
+	xfs_agino_t		prev_agino,
+	xfs_agino_t		next_agino)
+{
+	struct xfs_inode	*ip;
+
+	/* No update necessary if we are at the end of the list. */
+	if (next_agino == NULLAGINO)
+		return 0;
+
+	ip = xfs_iunlink_lookup(pag, next_agino);
+	if (!ip)
+		return -EFSCORRUPTED;
+	ip->i_prev_unlinked = prev_agino;
+	return 0;
+}
+
 /*
  * Point the AGI unlinked bucket at an inode and log the results.  The caller
  * is responsible for validating the old value.
@@ -2170,7 +2015,7 @@ xfs_iunlink_insert_inode(
 	struct xfs_perag	*pag,
 	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
- {
+{
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_agi		*agi = agibp->b_addr;
 	xfs_agino_t		next_agino;
@@ -2190,6 +2035,14 @@ xfs_iunlink_insert_inode(
 		return -EFSCORRUPTED;
 	}
 
+	/*
+	 * Update the prev pointer in the next inode to point back to this
+	 * inode.
+	 */
+	error = xfs_iunlink_update_backref(pag, agino, next_agino);
+	if (error)
+		return error;
+
 	if (next_agino != NULLAGINO) {
 		xfs_agino_t		old_agino;
 
@@ -2203,14 +2056,6 @@ xfs_iunlink_insert_inode(
 			return error;
 		ASSERT(old_agino == NULLAGINO);
 		ip->i_next_unlinked = next_agino;
-
-		/*
-		 * agino has been unlinked, add a backref from the next inode
-		 * back to agino.
-		 */
-		error = xfs_iunlink_add_backref(pag, agino, next_agino);
-		if (error)
-			return error;
 	}
 
 	/* Point the head of the list to point to this inode. */
@@ -2251,63 +2096,6 @@ xfs_iunlink(
 	return error;
 }
 
-/*
- * Walk the unlinked chain from @head_agino until we find the inode that
- * points to @target_agino.  Return the inode number, map, dinode pointer,
- * and inode cluster buffer of that inode as @agino, @imap, @dipp, and @bpp.
- *
- * @tp, @pag, @head_agino, and @target_agino are input parameters.
- * @agino, @imap, @dipp, and @bpp are all output parameters.
- *
- * Do not call this function if @target_agino is the head of the list.
- */
-static int
-xfs_iunlink_lookup_prev(
-	struct xfs_perag	*pag,
-	xfs_agino_t		head_agino,
-	xfs_agino_t		target_agino,
-	struct xfs_inode	**ipp)
-{
-	struct xfs_mount	*mp = pag->pag_mount;
-	struct xfs_inode	*ip;
-	xfs_agino_t		next_agino;
-
-	*ipp = NULL;
-
-	/* See if our backref cache can find it faster. */
-	next_agino = xfs_iunlink_lookup_backref(pag, target_agino);
-	if (next_agino != NULLAGINO) {
-		ip = xfs_iunlink_lookup(pag, next_agino);
-		if (ip && ip->i_next_unlinked == target_agino) {
-			*ipp = ip;
-			return 0;
-		}
-	}
-
-	/* Otherwise, walk the entire bucket until we find it. */
-	next_agino = head_agino;
-	while (next_agino != NULLAGINO) {
-		ip = xfs_iunlink_lookup(pag, next_agino);
-		if (!ip)
-			return -EFSCORRUPTED;
-
-		/*
-		 * Make sure this pointer is valid and isn't an obvious
-		 * infinite loop.
-		 */
-		if (!xfs_verify_agino(mp, pag->pag_agno, ip->i_next_unlinked) ||
-		    next_agino == ip->i_next_unlinked)
-			return -EFSCORRUPTED;
-
-		if (ip->i_next_unlinked == target_agino) {
-			*ipp = ip;
-			return 0;
-		}
-		next_agino = ip->i_next_unlinked;
-	}
-	return -EFSCORRUPTED;
-}
-
 static int
 xfs_iunlink_remove_inode(
 	struct xfs_trans	*tp,
@@ -2344,51 +2132,33 @@ xfs_iunlink_remove_inode(
 		return error;
 
 	/*
-	 * If there was a backref pointing from the next inode back to this
-	 * one, remove it because we've removed this inode from the list.
-	 *
-	 * Later, if this inode was in the middle of the list we'll update
-	 * this inode's backref to point from the next inode.
+	 * Update the prev pointer in the next inode to point back to previous
+	 * inode in the chain.
 	 */
-	if (next_agino != NULLAGINO) {
-		error = xfs_iunlink_change_backref(pag, next_agino, NULLAGINO);
-		if (error)
-			return error;
-	}
+	error = xfs_iunlink_update_backref(pag, ip->i_prev_unlinked,
+			ip->i_next_unlinked);
+	if (error)
+		return error;
 
 	if (head_agino != agino) {
 		struct xfs_inode	*prev_ip;
 
-		error = xfs_iunlink_lookup_prev(pag, head_agino, agino,
-				&prev_ip);
-		if (error)
-			return error;
-
-		/* Point the previous inode on the list to the next inode. */
-		error = xfs_iunlink_update_inode(tp, prev_ip, pag, next_agino,
-				NULL);
-		if (error)
-			return error;
+		prev_ip = xfs_iunlink_lookup(pag, ip->i_prev_unlinked);
+		if (!prev_ip)
+			return -EFSCORRUPTED;
 
+		error = xfs_iunlink_update_inode(tp, prev_ip, pag,
+				ip->i_next_unlinked, NULL);
 		prev_ip->i_next_unlinked = ip->i_next_unlinked;
-		ip->i_next_unlinked = NULLAGINO;
-
-		/*
-		 * Now we deal with the backref for this inode.  If this inode
-		 * pointed at a real inode, change the backref that pointed to
-		 * us to point to our old next.  If this inode was the end of
-		 * the list, delete the backref that pointed to us.  Note that
-		 * change_backref takes care of deleting the backref if
-		 * next_agino is NULLAGINO.
-		 */
-		return xfs_iunlink_change_backref(agibp->b_pag, agino,
-				next_agino);
+	} else {
+		/* Point the head of the list to the next unlinked inode. */
+		error = xfs_iunlink_update_bucket(tp, pag, agibp, bucket_index,
+				ip->i_next_unlinked);
 	}
 
-	/* Point the head of the list to the next unlinked inode. */
 	ip->i_next_unlinked = NULLAGINO;
-	return xfs_iunlink_update_bucket(tp, pag, agibp, bucket_index,
-			next_agino);
+	ip->i_prev_unlinked = NULLAGINO;
+	return error;
 }
 
 /*
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 8e2a33c6cbe2..8d8cce61e5ba 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -70,6 +70,7 @@ typedef struct xfs_inode {
 
 	/* unlinked list pointers */
 	xfs_agino_t		i_next_unlinked;
+	xfs_agino_t		i_prev_unlinked;
 
 	/* VFS inode */
 	struct inode		i_vnode;	/* embedded VFS inode */
@@ -508,9 +509,6 @@ extern struct kmem_cache	*xfs_inode_cache;
 
 bool xfs_inode_needs_inactive(struct xfs_inode *ip);
 
-int xfs_iunlink_init(struct xfs_perag *pag);
-void xfs_iunlink_destroy(struct xfs_perag *pag);
-
 void xfs_end_io(struct work_struct *work);
 
 int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 7d0f530d7e3c..8e97f4240b93 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2674,28 +2674,50 @@ xlog_recover_iunlink_bucket(
 	struct xfs_agi		*agi,
 	int			bucket)
 {
+	struct xfs_inode	*prev_ip = NULL;
 	struct xfs_inode	*ip;
-	xfs_agino_t		agino;
+	xfs_agino_t		prev_agino, agino;
+	int			error = 0;
 
 	agino = be32_to_cpu(agi->agi_unlinked[bucket]);
 	while (agino != NULLAGINO) {
-		int	error;
 
 		error = xfs_iget(mp, NULL,
 				XFS_AGINO_TO_INO(mp, pag->pag_agno, agino),
 				0, 0, &ip);
 		if (error)
-			return error;;
+			break;
 
 		ASSERT(VFS_I(ip)->i_nlink == 0);
 		ASSERT(VFS_I(ip)->i_mode != 0);
 		xfs_iflags_clear(ip, XFS_IRECOVERY);
 		agino = ip->i_next_unlinked;
 
-		xfs_irele(ip);
-		cond_resched();
+		if (prev_ip) {
+			ip->i_prev_unlinked = prev_agino;
+			xfs_irele(prev_ip);
+
+			/*
+			 * Ensure the inode is removed from the unlinked list
+			 * before we continue so that it race with building the
+			 * in-memory list here. THis could be serialised with
+			 * the agibp lock, but that just serialises via
+			 * lockstepping and it's much simpler just to flush the
+			 * inodegc queue and wait for it to complete.
+			 */
+			xfs_inodegc_flush(mp);
+		}
+
+		prev_agino = agino;
+		prev_ip = ip;
 	}
-	return 0;
+
+	if (prev_ip) {
+		ip->i_prev_unlinked = prev_agino;
+		xfs_irele(prev_ip);
+	}
+	xfs_inodegc_flush(mp);
+	return error;
 }
 
 /*
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 6/9] xfs: clean up xfs_iunlink_update_inode()
  2022-06-27  0:43 [PATCH 0/9 v3] xfs: in-memory iunlink items Dave Chinner
                   ` (4 preceding siblings ...)
  2022-06-27  0:43 ` [PATCH 5/9] xfs: double link the unlinked inode list Dave Chinner
@ 2022-06-27  0:43 ` Dave Chinner
  2022-06-29  7:21   ` Christoph Hellwig
  2022-06-29 21:07   ` Darrick J. Wong
  2022-06-27  0:43 ` [PATCH 7/9] xfs: combine iunlink inode update functions Dave Chinner
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 35+ messages in thread
From: Dave Chinner @ 2022-06-27  0:43 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We no longer need to have this function return the previous next
agino value from the on-disk inode as we have it in the in-core
inode now.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index f390a91243bf..8d4edb8129b5 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1953,13 +1953,12 @@ xfs_iunlink_update_dinode(
 }
 
 /* Set an in-core inode's unlinked pointer and return the old value. */
-STATIC int
+static int
 xfs_iunlink_update_inode(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
 	struct xfs_perag	*pag,
-	xfs_agino_t		next_agino,
-	xfs_agino_t		*old_next_agino)
+	xfs_agino_t		next_agino)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_dinode	*dip;
@@ -1989,8 +1988,6 @@ xfs_iunlink_update_inode(
 	 * current pointer is the same as the new value, unless we're
 	 * terminating the list.
 	 */
-	if (old_next_agino)
-		*old_next_agino = old_value;
 	if (old_value == next_agino) {
 		if (next_agino != NULLAGINO) {
 			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
@@ -2044,17 +2041,13 @@ xfs_iunlink_insert_inode(
 		return error;
 
 	if (next_agino != NULLAGINO) {
-		xfs_agino_t		old_agino;
-
 		/*
 		 * There is already another inode in the bucket, so point this
 		 * inode to the current head of the list.
 		 */
-		error = xfs_iunlink_update_inode(tp, ip, pag, next_agino,
-				&old_agino);
+		error = xfs_iunlink_update_inode(tp, ip, pag, next_agino);
 		if (error)
 			return error;
-		ASSERT(old_agino == NULLAGINO);
 		ip->i_next_unlinked = next_agino;
 	}
 
@@ -2106,7 +2099,6 @@ xfs_iunlink_remove_inode(
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_agi		*agi = agibp->b_addr;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
-	xfs_agino_t		next_agino;
 	xfs_agino_t		head_agino;
 	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
@@ -2127,7 +2119,7 @@ xfs_iunlink_remove_inode(
 	 * the old pointer value so that we can update whatever was previous
 	 * to us in the list to point to whatever was next in the list.
 	 */
-	error = xfs_iunlink_update_inode(tp, ip, pag, NULLAGINO, &next_agino);
+	error = xfs_iunlink_update_inode(tp, ip, pag, NULLAGINO);
 	if (error)
 		return error;
 
@@ -2148,7 +2140,7 @@ xfs_iunlink_remove_inode(
 			return -EFSCORRUPTED;
 
 		error = xfs_iunlink_update_inode(tp, prev_ip, pag,
-				ip->i_next_unlinked, NULL);
+				ip->i_next_unlinked);
 		prev_ip->i_next_unlinked = ip->i_next_unlinked;
 	} else {
 		/* Point the head of the list to the next unlinked inode. */
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 7/9] xfs: combine iunlink inode update functions
  2022-06-27  0:43 [PATCH 0/9 v3] xfs: in-memory iunlink items Dave Chinner
                   ` (5 preceding siblings ...)
  2022-06-27  0:43 ` [PATCH 6/9] xfs: clean up xfs_iunlink_update_inode() Dave Chinner
@ 2022-06-27  0:43 ` Dave Chinner
  2022-06-29  7:21   ` Christoph Hellwig
  2022-06-29 21:07   ` Darrick J. Wong
  2022-06-27  0:43 ` [PATCH 8/9] xfs: add log item precommit operation Dave Chinner
  2022-06-27  0:43 ` [PATCH 9/9] xfs: add in-memory iunlink log item Dave Chinner
  8 siblings, 2 replies; 35+ messages in thread
From: Dave Chinner @ 2022-06-27  0:43 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Combine the logging of the inode unlink list update into the
calling function that looks up the buffer we end up logging. These
do not need to be separate functions as they are both short, simple
operations and there's only a single call path through them. This
new function will end up being the core of the iunlink log item
processing...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 52 ++++++++++++++--------------------------------
 1 file changed, 16 insertions(+), 36 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 8d4edb8129b5..d6c88a27f29d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1923,38 +1923,9 @@ xfs_iunlink_update_bucket(
 	return 0;
 }
 
-/* Set an on-disk inode's next_unlinked pointer. */
-STATIC void
-xfs_iunlink_update_dinode(
-	struct xfs_trans	*tp,
-	struct xfs_perag	*pag,
-	xfs_agino_t		agino,
-	struct xfs_buf		*ibp,
-	struct xfs_dinode	*dip,
-	struct xfs_imap		*imap,
-	xfs_agino_t		next_agino)
-{
-	struct xfs_mount	*mp = tp->t_mountp;
-	int			offset;
-
-	ASSERT(xfs_verify_agino_or_null(mp, pag->pag_agno, next_agino));
-
-	trace_xfs_iunlink_update_dinode(mp, pag->pag_agno, agino,
-			be32_to_cpu(dip->di_next_unlinked), next_agino);
-
-	dip->di_next_unlinked = cpu_to_be32(next_agino);
-	offset = imap->im_boffset +
-			offsetof(struct xfs_dinode, di_next_unlinked);
-
-	/* need to recalc the inode CRC if appropriate */
-	xfs_dinode_calc_crc(mp, dip);
-	xfs_trans_inode_buf(tp, ibp);
-	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
-}
-
 /* Set an in-core inode's unlinked pointer and return the old value. */
 static int
-xfs_iunlink_update_inode(
+xfs_iunlink_log_inode(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
 	struct xfs_perag	*pag,
@@ -1964,6 +1935,7 @@ xfs_iunlink_update_inode(
 	struct xfs_dinode	*dip;
 	struct xfs_buf		*ibp;
 	xfs_agino_t		old_value;
+	int			offset;
 	int			error;
 
 	ASSERT(xfs_verify_agino_or_null(mp, pag->pag_agno, next_agino));
@@ -1997,9 +1969,17 @@ xfs_iunlink_update_inode(
 		goto out;
 	}
 
-	/* Ok, update the new pointer. */
-	xfs_iunlink_update_dinode(tp, pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			ibp, dip, &ip->i_imap, next_agino);
+	trace_xfs_iunlink_update_dinode(mp, pag->pag_agno,
+			XFS_INO_TO_AGINO(mp, ip->i_ino),
+			be32_to_cpu(dip->di_next_unlinked), next_agino);
+
+	dip->di_next_unlinked = cpu_to_be32(next_agino);
+	offset = ip->i_imap.im_boffset +
+			offsetof(struct xfs_dinode, di_next_unlinked);
+
+	xfs_dinode_calc_crc(mp, dip);
+	xfs_trans_inode_buf(tp, ibp);
+	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
 	return 0;
 out:
 	xfs_trans_brelse(tp, ibp);
@@ -2045,7 +2025,7 @@ xfs_iunlink_insert_inode(
 		 * There is already another inode in the bucket, so point this
 		 * inode to the current head of the list.
 		 */
-		error = xfs_iunlink_update_inode(tp, ip, pag, next_agino);
+		error = xfs_iunlink_log_inode(tp, ip, pag, next_agino);
 		if (error)
 			return error;
 		ip->i_next_unlinked = next_agino;
@@ -2119,7 +2099,7 @@ xfs_iunlink_remove_inode(
 	 * the old pointer value so that we can update whatever was previous
 	 * to us in the list to point to whatever was next in the list.
 	 */
-	error = xfs_iunlink_update_inode(tp, ip, pag, NULLAGINO);
+	error = xfs_iunlink_log_inode(tp, ip, pag, NULLAGINO);
 	if (error)
 		return error;
 
@@ -2139,7 +2119,7 @@ xfs_iunlink_remove_inode(
 		if (!prev_ip)
 			return -EFSCORRUPTED;
 
-		error = xfs_iunlink_update_inode(tp, prev_ip, pag,
+		error = xfs_iunlink_log_inode(tp, prev_ip, pag,
 				ip->i_next_unlinked);
 		prev_ip->i_next_unlinked = ip->i_next_unlinked;
 	} else {
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 8/9] xfs: add log item precommit operation
  2022-06-27  0:43 [PATCH 0/9 v3] xfs: in-memory iunlink items Dave Chinner
                   ` (6 preceding siblings ...)
  2022-06-27  0:43 ` [PATCH 7/9] xfs: combine iunlink inode update functions Dave Chinner
@ 2022-06-27  0:43 ` Dave Chinner
  2022-06-29 21:16   ` Darrick J. Wong
  2022-06-27  0:43 ` [PATCH 9/9] xfs: add in-memory iunlink log item Dave Chinner
  8 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2022-06-27  0:43 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

For inodes that are dirty, we have an attached cluster buffer that
we want to use to track the dirty inode through the AIL.
Unfortunately, locking the cluster buffer and adding it to the
transaction when the inode is first logged in a transaction leads to
buffer lock ordering inversions.

The specific problem is ordering against the AGI buffer. When
modifying unlinked lists, the buffer lock order is AGI -> inode
cluster buffer as the AGI buffer lock serialises all access to the
unlinked lists. Unfortunately, functionality like xfs_droplink()
logs the inode before calling xfs_iunlink(), as do various directory
manipulation functions. The inode can be logged way down in the
stack as far as the bmapi routines and hence, without a major
rewrite of lots of APIs there's no way we can avoid the inode being
logged by something until after the AGI has been logged.

As we are going to be using ordered buffers for inode AIL tracking,
there isn't a need to actually lock that buffer against modification
as all the modifications are captured by logging the inode item
itself. Hence we don't actually need to join the cluster buffer into
the transaction until just before it is committed. This means we do
not perturb any of the existing buffer lock orders in transactions,
and the inode cluster buffer is always locked last in a transaction
that doesn't otherwise touch inode cluster buffers.

We do this by introducing a precommit log item method. A log item
method is used because it is likely dquots will be moved to this
same ordered buffer tracking scheme and hence will need a similar
callout. This commit just introduces the mechanism; the inode item
implementation is in followup commits.

The precommit items need to be sorted into consistent order as we
may be locking multiple items here. Hence if we have two dirty
inodes in cluster buffers A and B, and some other transaction has
two separate dirty inodes in the same cluster buffers, locking them
in different orders opens us up to ABBA deadlocks. Hence we sort the
items on the transaction based on the presence of a sort log item
method.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c |  1 +
 fs/xfs/xfs_trans.c  | 91 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trans.h  |  6 ++-
 3 files changed, 96 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 9fc324a29535..374b3bafaeb0 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -914,6 +914,7 @@ xfs_reclaim_inode(
 	ip->i_checked = 0;
 	spin_unlock(&ip->i_flags_lock);
 
+	ASSERT(!ip->i_itemp || ip->i_itemp->ili_item.li_buf == NULL);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
 	XFS_STATS_INC(ip->i_mount, xs_ig_reclaims);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 82cf0189c0db..0acb31093d9f 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -844,6 +844,90 @@ xfs_trans_committed_bulk(
 	spin_unlock(&ailp->ail_lock);
 }
 
+/*
+ * Sort transaction items prior to running precommit operations. This will
+ * attempt to order the items such that they will always be locked in the same
+ * order. Items that have no sort function are moved to the end of the list
+ * and so are locked last (XXX: need to check the logic matches the comment).
+ *
+ * This may need refinement as different types of objects add sort functions.
+ *
+ * Function is more complex than it needs to be because we are comparing 64 bit
+ * values and the function only returns 32 bit values.
+ */
+static int
+xfs_trans_precommit_sort(
+	void			*unused_arg,
+	const struct list_head	*a,
+	const struct list_head	*b)
+{
+	struct xfs_log_item	*lia = container_of(a,
+					struct xfs_log_item, li_trans);
+	struct xfs_log_item	*lib = container_of(b,
+					struct xfs_log_item, li_trans);
+	int64_t			diff;
+
+	/*
+	 * If both items are non-sortable, leave them alone. If only one is
+	 * sortable, move the non-sortable item towards the end of the list.
+	 */
+	if (!lia->li_ops->iop_sort && !lib->li_ops->iop_sort)
+		return 0;
+	if (!lia->li_ops->iop_sort)
+		return 1;
+	if (!lib->li_ops->iop_sort)
+		return -1;
+
+	diff = lia->li_ops->iop_sort(lia) - lib->li_ops->iop_sort(lib);
+	if (diff < 0)
+		return -1;
+	if (diff > 0)
+		return 1;
+	return 0;
+}
+
+/*
+ * Run transaction precommit functions.
+ *
+ * If there is an error in any of the callouts, then stop immediately and
+ * trigger a shutdown to abort the transaction. There is no recovery possible
+ * from errors at this point as the transaction is dirty....
+ */
+static int
+xfs_trans_run_precommits(
+	struct xfs_trans	*tp)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_log_item	*lip, *n;
+	int			error = 0;
+
+	/*
+	 * Sort the item list to avoid ABBA deadlocks with other transactions
+	 * running precommit operations that lock multiple shared items such as
+	 * inode cluster buffers.
+	 */
+	list_sort(NULL, &tp->t_items, xfs_trans_precommit_sort);
+
+	/*
+	 * Precommit operations can remove the log item from the transaction
+	 * if the log item exists purely to delay modifications until they
+	 * can be ordered against other operations. Hence we have to use
+	 * list_for_each_entry_safe() here.
+	 */
+	list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
+		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
+			continue;
+		if (lip->li_ops->iop_precommit) {
+			error = lip->li_ops->iop_precommit(tp, lip);
+			if (error)
+				break;
+		}
+	}
+	if (error)
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+	return error;
+}
+
 /*
  * Commit the given transaction to the log.
  *
@@ -869,6 +953,13 @@ __xfs_trans_commit(
 
 	trace_xfs_trans_commit(tp, _RET_IP_);
 
+	error = xfs_trans_run_precommits(tp);
+	if (error) {
+		if (tp->t_flags & XFS_TRANS_PERM_LOG_RES)
+			xfs_defer_cancel(tp);
+		goto out_unreserve;
+	}
+
 	/*
 	 * Finish deferred items on final commit. Only permanent transactions
 	 * should ever have deferred ops.
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 9561f193e7e1..64062e3b788b 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -71,10 +71,12 @@ struct xfs_item_ops {
 	void (*iop_format)(struct xfs_log_item *, struct xfs_log_vec *);
 	void (*iop_pin)(struct xfs_log_item *);
 	void (*iop_unpin)(struct xfs_log_item *, int remove);
-	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
+	uint64_t (*iop_sort)(struct xfs_log_item *lip);
+	int (*iop_precommit)(struct xfs_trans *tp, struct xfs_log_item *lip);
 	void (*iop_committing)(struct xfs_log_item *lip, xfs_csn_t seq);
-	void (*iop_release)(struct xfs_log_item *);
 	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
+	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
+	void (*iop_release)(struct xfs_log_item *);
 	int (*iop_recover)(struct xfs_log_item *lip,
 			   struct list_head *capture_list);
 	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 9/9] xfs: add in-memory iunlink log item
  2022-06-27  0:43 [PATCH 0/9 v3] xfs: in-memory iunlink items Dave Chinner
                   ` (7 preceding siblings ...)
  2022-06-27  0:43 ` [PATCH 8/9] xfs: add log item precommit operation Dave Chinner
@ 2022-06-27  0:43 ` Dave Chinner
  2022-06-29  7:25   ` Christoph Hellwig
  2022-06-29 21:21   ` Darrick J. Wong
  8 siblings, 2 replies; 35+ messages in thread
From: Dave Chinner @ 2022-06-27  0:43 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we have a clean operation to update the di_next_unlinked
field of inode cluster buffers, we can easily defer this operation
to transaction commit time so we can order the inode cluster buffer
locking consistently.

TO do this, we introduce a new in-memory log item to track the
unlinked list item modification that we are going to make. This
follows the same observations as the in-memory double linked list
used to track unlinked inodes in that the inodes on the list are
pinned in memory and cannot go away, and hence we can simply
reference them for the duration of the transaction without needing
to take active references or pin them or look them up.

This allows us to pass the xfs_inode to the transaction commit code
along with the modification to be made, and then order the logged
modifications via the ->iop_sort and ->iop_precommit operations
for the new log item type. As this is an in-memory log item, it
doesn't have formatting, CIL or AIL operational hooks - it exists
purely to run the inode unlink modifications and is then removed
from the transaction item list and freed once the precommit
operation has run.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/Makefile           |   1 +
 fs/xfs/xfs_inode.c        |  64 +-------------
 fs/xfs/xfs_iunlink_item.c | 180 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iunlink_item.h |  27 ++++++
 fs/xfs/xfs_super.c        |  10 +++
 5 files changed, 219 insertions(+), 63 deletions(-)
 create mode 100644 fs/xfs/xfs_iunlink_item.c
 create mode 100644 fs/xfs/xfs_iunlink_item.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index b056cfc6398e..1131dd01e4fe 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -106,6 +106,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_icreate_item.o \
 				   xfs_inode_item.o \
 				   xfs_inode_item_recover.o \
+				   xfs_iunlink_item.o \
 				   xfs_refcount_item.o \
 				   xfs_rmap_item.o \
 				   xfs_log_recover.o \
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d6c88a27f29d..ef1dae9dbaa4 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -20,6 +20,7 @@
 #include "xfs_trans.h"
 #include "xfs_buf_item.h"
 #include "xfs_inode_item.h"
+#include "xfs_iunlink_item.h"
 #include "xfs_ialloc.h"
 #include "xfs_bmap.h"
 #include "xfs_bmap_util.h"
@@ -1923,69 +1924,6 @@ xfs_iunlink_update_bucket(
 	return 0;
 }
 
-/* Set an in-core inode's unlinked pointer and return the old value. */
-static int
-xfs_iunlink_log_inode(
-	struct xfs_trans	*tp,
-	struct xfs_inode	*ip,
-	struct xfs_perag	*pag,
-	xfs_agino_t		next_agino)
-{
-	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_dinode	*dip;
-	struct xfs_buf		*ibp;
-	xfs_agino_t		old_value;
-	int			offset;
-	int			error;
-
-	ASSERT(xfs_verify_agino_or_null(mp, pag->pag_agno, next_agino));
-
-	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &ibp);
-	if (error)
-		return error;
-	dip = xfs_buf_offset(ibp, ip->i_imap.im_boffset);
-
-	/* Make sure the old pointer isn't garbage. */
-	old_value = be32_to_cpu(dip->di_next_unlinked);
-	if (old_value != ip->i_next_unlinked ||
-	    !xfs_verify_agino_or_null(mp, pag->pag_agno, old_value)) {
-		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
-				sizeof(*dip), __this_address);
-		error = -EFSCORRUPTED;
-		goto out;
-	}
-
-	/*
-	 * Since we're updating a linked list, we should never find that the
-	 * current pointer is the same as the new value, unless we're
-	 * terminating the list.
-	 */
-	if (old_value == next_agino) {
-		if (next_agino != NULLAGINO) {
-			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
-					dip, sizeof(*dip), __this_address);
-			error = -EFSCORRUPTED;
-		}
-		goto out;
-	}
-
-	trace_xfs_iunlink_update_dinode(mp, pag->pag_agno,
-			XFS_INO_TO_AGINO(mp, ip->i_ino),
-			be32_to_cpu(dip->di_next_unlinked), next_agino);
-
-	dip->di_next_unlinked = cpu_to_be32(next_agino);
-	offset = ip->i_imap.im_boffset +
-			offsetof(struct xfs_dinode, di_next_unlinked);
-
-	xfs_dinode_calc_crc(mp, dip);
-	xfs_trans_inode_buf(tp, ibp);
-	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
-	return 0;
-out:
-	xfs_trans_brelse(tp, ibp);
-	return error;
-}
-
 static int
 xfs_iunlink_insert_inode(
 	struct xfs_trans	*tp,
diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
new file mode 100644
index 000000000000..fe38fc61f79e
--- /dev/null
+++ b/fs/xfs/xfs_iunlink_item.c
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2020, Red Hat, Inc.
+ * All Rights Reserved.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_ag.h"
+#include "xfs_iunlink_item.h"
+#include "xfs_trace.h"
+#include "xfs_error.h"
+
+struct kmem_cache	*xfs_iunlink_cache;
+
+static inline struct xfs_iunlink_item *IUL_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_iunlink_item, item);
+}
+
+static void
+xfs_iunlink_item_release(
+	struct xfs_log_item	*lip)
+{
+	struct xfs_iunlink_item	*iup = IUL_ITEM(lip);
+
+	xfs_perag_put(iup->pag);
+	kmem_cache_free(xfs_iunlink_cache, IUL_ITEM(lip));
+}
+
+
+static uint64_t
+xfs_iunlink_item_sort(
+	struct xfs_log_item	*lip)
+{
+	return IUL_ITEM(lip)->ip->i_ino;
+}
+
+/*
+ * Look up the inode cluster buffer and log the on-disk unlinked inode change
+ * we need to make.
+ */
+static int
+xfs_iunlink_log_dinode(
+	struct xfs_trans	*tp,
+	struct xfs_iunlink_item	*iup)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_inode	*ip = iup->ip;
+	struct xfs_dinode	*dip;
+	struct xfs_buf		*ibp;
+	int			offset;
+	int			error;
+
+	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &ibp);
+	if (error)
+		return error;
+	/*
+	 * Don't log the unlinked field on stale buffers as this may be the
+	 * transaction that frees the inode cluster and relogging the buffer
+	 * here will incorrectly remove the stale state.
+	 */
+	if (ibp->b_flags & XBF_STALE)
+		goto out;
+
+	dip = xfs_buf_offset(ibp, ip->i_imap.im_boffset);
+
+	/* Make sure the old pointer isn't garbage. */
+	if (be32_to_cpu(dip->di_next_unlinked) != iup->old_agino) {
+		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
+				sizeof(*dip), __this_address);
+		error = -EFSCORRUPTED;
+		goto out;
+	}
+
+	trace_xfs_iunlink_update_dinode(mp, iup->pag->pag_agno,
+			XFS_INO_TO_AGINO(mp, ip->i_ino),
+			be32_to_cpu(dip->di_next_unlinked), iup->next_agino);
+
+	dip->di_next_unlinked = cpu_to_be32(iup->next_agino);
+	offset = ip->i_imap.im_boffset +
+			offsetof(struct xfs_dinode, di_next_unlinked);
+
+	xfs_dinode_calc_crc(mp, dip);
+	xfs_trans_inode_buf(tp, ibp);
+	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
+	return 0;
+out:
+	xfs_trans_brelse(tp, ibp);
+	return error;
+}
+
+/*
+ * On precommit, we grab the inode cluster buffer for the inode number we were
+ * passed, then update the next unlinked field for that inode in the buffer and
+ * log the buffer. This ensures that the inode cluster buffer was logged in the
+ * correct order w.r.t. other inode cluster buffers. We can then remove the
+ * iunlink item from the transaction and release it as it is has now served it's
+ * purpose.
+ */
+static int
+xfs_iunlink_item_precommit(
+	struct xfs_trans	*tp,
+	struct xfs_log_item	*lip)
+{
+	struct xfs_iunlink_item	*iup = IUL_ITEM(lip);
+	int			error;
+
+	error = xfs_iunlink_log_dinode(tp, iup);
+	list_del(&lip->li_trans);
+	xfs_iunlink_item_release(lip);
+	return error;
+}
+
+static const struct xfs_item_ops xfs_iunlink_item_ops = {
+	.iop_release	= xfs_iunlink_item_release,
+	.iop_sort	= xfs_iunlink_item_sort,
+	.iop_precommit	= xfs_iunlink_item_precommit,
+};
+
+
+/*
+ * Initialize the inode log item for a newly allocated (in-core) inode.
+ *
+ * Inode extents can only reside within an AG. Hence specify the starting
+ * block for the inode chunk by offset within an AG as well as the
+ * length of the allocated extent.
+ *
+ * This joins the item to the transaction and marks it dirty so
+ * that we don't need a separate call to do this, nor does the
+ * caller need to know anything about the iunlink item.
+ */
+int
+xfs_iunlink_log_inode(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	struct xfs_perag	*pag,
+	xfs_agino_t		next_agino)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_iunlink_item	*iup;
+
+	ASSERT(xfs_verify_agino_or_null(mp, pag->pag_agno, next_agino));
+	ASSERT(xfs_verify_agino_or_null(mp, pag->pag_agno, ip->i_next_unlinked));
+
+	/*
+	 * Since we're updating a linked list, we should never find that the
+	 * current pointer is the same as the new value, unless we're
+	 * terminating the list.
+	 */
+	if (ip->i_next_unlinked == next_agino) {
+		if (next_agino != NULLAGINO)
+			return -EFSCORRUPTED;
+		return 0;
+	}
+
+	iup = kmem_cache_zalloc(xfs_iunlink_cache, GFP_KERNEL | __GFP_NOFAIL);
+	xfs_log_item_init(mp, &iup->item, XFS_LI_IUNLINK,
+			  &xfs_iunlink_item_ops);
+
+	iup->ip = ip;
+	iup->next_agino = next_agino;
+	iup->old_agino = ip->i_next_unlinked;
+
+	atomic_inc(&pag->pag_ref);
+	iup->pag = pag;
+
+	xfs_trans_add_item(tp, &iup->item);
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	set_bit(XFS_LI_DIRTY, &iup->item.li_flags);
+	return 0;
+}
+
diff --git a/fs/xfs/xfs_iunlink_item.h b/fs/xfs/xfs_iunlink_item.h
new file mode 100644
index 000000000000..280dbf533989
--- /dev/null
+++ b/fs/xfs/xfs_iunlink_item.h
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2020, Red Hat, Inc.
+ * All Rights Reserved.
+ */
+#ifndef XFS_IUNLINK_ITEM_H
+#define XFS_IUNLINK_ITEM_H	1
+
+struct xfs_trans;
+struct xfs_inode;
+struct xfs_perag;
+
+/* in memory log item structure */
+struct xfs_iunlink_item {
+	struct xfs_log_item	item;
+	struct xfs_inode	*ip;
+	struct xfs_perag	*pag;
+	xfs_agino_t		next_agino;
+	xfs_agino_t		old_agino;
+};
+
+extern struct kmem_cache *xfs_iunlink_cache;
+
+int xfs_iunlink_log_inode(struct xfs_trans *tp, struct xfs_inode *ip,
+			struct xfs_perag *pag, xfs_agino_t next_agino);
+
+#endif	/* XFS_IUNLINK_ITEM_H */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ed18160e6181..a7e6b2d7686b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -40,6 +40,7 @@
 #include "xfs_defer.h"
 #include "xfs_attr_item.h"
 #include "xfs_xattr.h"
+#include "xfs_iunlink_item.h"
 
 #include <linux/magic.h>
 #include <linux/fs_context.h>
@@ -2093,8 +2094,16 @@ xfs_init_caches(void)
 	if (!xfs_attri_cache)
 		goto out_destroy_attrd_cache;
 
+	xfs_iunlink_cache = kmem_cache_create("xfs_iul_item",
+					     sizeof(struct xfs_iunlink_item),
+					     0, 0, NULL);
+	if (!xfs_iunlink_cache)
+		goto out_destroy_attri_cache;
+
 	return 0;
 
+ out_destroy_attri_cache:
+	kmem_cache_destroy(xfs_attri_cache);
  out_destroy_attrd_cache:
 	kmem_cache_destroy(xfs_attrd_cache);
  out_destroy_bui_cache:
@@ -2145,6 +2154,7 @@ xfs_destroy_caches(void)
 	 * destroy caches.
 	 */
 	rcu_barrier();
+	kmem_cache_destroy(xfs_iunlink_cache);
 	kmem_cache_destroy(xfs_attri_cache);
 	kmem_cache_destroy(xfs_attrd_cache);
 	kmem_cache_destroy(xfs_bui_cache);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/9] xfs: factor the xfs_iunlink functions
  2022-06-27  0:43 ` [PATCH 1/9] xfs: factor the xfs_iunlink functions Dave Chinner
@ 2022-06-29  7:05   ` Christoph Hellwig
  2022-06-29 20:48   ` Darrick J. Wong
  1 sibling, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2022-06-29  7:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> +xfs_iunlink_insert_inode(
>  	struct xfs_trans	*tp,
> +	struct xfs_perag	*pag,
> +	struct xfs_buf		*agibp,
>  	struct xfs_inode	*ip)
> -{
> + {

Whitespace damage here.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/9] xfs: track the iunlink list pointer in the xfs_inode
  2022-06-27  0:43 ` [PATCH 2/9] xfs: track the iunlink list pointer in the xfs_inode Dave Chinner
@ 2022-06-29  7:08   ` Christoph Hellwig
  2022-06-29 20:50   ` Darrick J. Wong
  1 sibling, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2022-06-29  7:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 27, 2022 at 10:43:29AM +1000, Dave Chinner wrote:
> @@ -68,6 +68,9 @@ typedef struct xfs_inode {
>  	uint64_t		i_diflags2;	/* XFS_DIFLAG2_... */
>  	struct timespec64	i_crtime;	/* time created */
>  
> +	/* unlinked list pointers */
> +	xfs_agino_t		i_next_unlinked;

The placement here seems unfortunate as it grows the inode vs just
filling holes.  I think placing it before i_ioend_lock might work
better, but a little pahole evaluation might be a good idea.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/9] xfs: refactor xlog_recover_process_iunlinks()
  2022-06-27  0:43 ` [PATCH 3/9] xfs: refactor xlog_recover_process_iunlinks() Dave Chinner
@ 2022-06-29  7:12   ` Christoph Hellwig
  2022-06-29 20:56   ` Darrick J. Wong
  1 sibling, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2022-06-29  7:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/9] xfs: introduce xfs_iunlink_lookup
  2022-06-27  0:43 ` [PATCH 4/9] xfs: introduce xfs_iunlink_lookup Dave Chinner
@ 2022-06-29  7:17   ` Christoph Hellwig
  2022-06-29 21:01   ` Darrick J. Wong
  1 sibling, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2022-06-29  7:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 27, 2022 at 10:43:31AM +1000, Dave Chinner wrote:
> When an inode is on an unlinked list during normal operation, it is
> guaranteed to be pinned in memory as it is either referenced by the
> current unlink operation or it has a open file descriptor that
> references it and has it pinned in memory. Hence to look up an inode
> on the unlinked list, we can do a direct inode cache lookup and
> always expect the lookup to succeed.

> +	rcu_read_lock();
> +	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
> +
> +	/* Inode not in memory, nothing to do */
> +	if (!ip) {
> +		rcu_read_unlock();
> +		return NULL;
> +	}

If the above commit log is true (which I think it is), the comment here
is wrong, and the check should probably grow a WARN_ON as well.

> +	spin_lock(&ip->i_flags_lock);
> +	if (ip->i_ino != XFS_AGINO_TO_INO(mp, pag->pag_agno, agino) ||
> +	    (ip->i_flags & (XFS_IRECLAIMABLE | XFS_IRECLAIM))) {
> +		/* Uh-oh! */
> +		ip = NULL;
> +	}

And this should not happen either and could use an assert.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/9] xfs: double link the unlinked inode list
  2022-06-27  0:43 ` [PATCH 5/9] xfs: double link the unlinked inode list Dave Chinner
@ 2022-06-29  7:20   ` Christoph Hellwig
  2022-06-29 20:02     ` Darrick J. Wong
  2022-06-29 21:06   ` Darrick J. Wong
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2022-06-29  7:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -70,6 +70,7 @@ typedef struct xfs_inode {
>  
>  	/* unlinked list pointers */
>  	xfs_agino_t		i_next_unlinked;
> +	xfs_agino_t		i_prev_unlinked;

Ok, this somewhat invalidates my previous comment, unless we find
another hole in the xfs_inode layout.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/9] xfs: clean up xfs_iunlink_update_inode()
  2022-06-27  0:43 ` [PATCH 6/9] xfs: clean up xfs_iunlink_update_inode() Dave Chinner
@ 2022-06-29  7:21   ` Christoph Hellwig
  2022-06-29 21:07   ` Darrick J. Wong
  1 sibling, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2022-06-29  7:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/9] xfs: combine iunlink inode update functions
  2022-06-27  0:43 ` [PATCH 7/9] xfs: combine iunlink inode update functions Dave Chinner
@ 2022-06-29  7:21   ` Christoph Hellwig
  2022-06-29 21:07   ` Darrick J. Wong
  1 sibling, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2022-06-29  7:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 9/9] xfs: add in-memory iunlink log item
  2022-06-27  0:43 ` [PATCH 9/9] xfs: add in-memory iunlink log item Dave Chinner
@ 2022-06-29  7:25   ` Christoph Hellwig
  2022-06-29 21:37     ` Dave Chinner
  2022-06-29 21:21   ` Darrick J. Wong
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2022-06-29  7:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 27, 2022 at 10:43:36AM +1000, Dave Chinner wrote:
> TO do this, we introduce a new in-memory log item to track the

s/To/To/

> This allows us to pass the xfs_inode to the transaction commit code
> along with the modification to be made, and then order the logged
> modifications via the ->iop_sort and ->iop_precommit operations
> for the new log item type. As this is an in-memory log item, it
> doesn't have formatting, CIL or AIL operational hooks - it exists
> purely to run the inode unlink modifications and is then removed
> from the transaction item list and freed once the precommit
> operation has run.

Do we need to document the fact that we now have purely in-memory log
items better somewhere?  I see how this works, but it its a little
non-obvious.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/9] xfs: double link the unlinked inode list
  2022-06-29  7:20   ` Christoph Hellwig
@ 2022-06-29 20:02     ` Darrick J. Wong
  0 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2022-06-29 20:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs

On Wed, Jun 29, 2022 at 12:20:53AM -0700, Christoph Hellwig wrote:
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -70,6 +70,7 @@ typedef struct xfs_inode {
> >  
> >  	/* unlinked list pointers */
> >  	xfs_agino_t		i_next_unlinked;
> > +	xfs_agino_t		i_prev_unlinked;
> 
> Ok, this somewhat invalidates my previous comment, unless we find
> another hole in the xfs_inode layout.

I think there's a 4-byte hole after i_df, and there's about to be
another one after i_af once I send my series to make the attr ifork a
permanent part of the inode.

--D

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/9] xfs: factor the xfs_iunlink functions
  2022-06-27  0:43 ` [PATCH 1/9] xfs: factor the xfs_iunlink functions Dave Chinner
  2022-06-29  7:05   ` Christoph Hellwig
@ 2022-06-29 20:48   ` Darrick J. Wong
  1 sibling, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2022-06-29 20:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 27, 2022 at 10:43:28AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Prep work that separates the locking that protects the unlinked list
> from the actual operations being performed. This also helps document
> the fact they are performing list insert  and remove operations. No
> functional code change.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

With the whitespace damage fixed,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_inode.c | 113 +++++++++++++++++++++++++++------------------
>  1 file changed, 68 insertions(+), 45 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 52d6f2c7d58b..2a371c3431c9 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2129,39 +2129,20 @@ xfs_iunlink_update_inode(
>  	return error;
>  }
>  
> -/*
> - * This is called when the inode's link count has gone to 0 or we are creating
> - * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> - *
> - * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> - * list when the inode is freed.
> - */
> -STATIC int
> -xfs_iunlink(
> +static int
> +xfs_iunlink_insert_inode(
>  	struct xfs_trans	*tp,
> +	struct xfs_perag	*pag,
> +	struct xfs_buf		*agibp,
>  	struct xfs_inode	*ip)
> -{
> + {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_perag	*pag;
> -	struct xfs_agi		*agi;
> -	struct xfs_buf		*agibp;
> +	struct xfs_agi		*agi = agibp->b_addr;
>  	xfs_agino_t		next_agino;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  	int			error;
>  
> -	ASSERT(VFS_I(ip)->i_nlink == 0);
> -	ASSERT(VFS_I(ip)->i_mode != 0);
> -	trace_xfs_iunlink(ip);
> -
> -	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> -
> -	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> -	error = xfs_read_agi(mp, tp, pag->pag_agno, &agibp);
> -	if (error)
> -		goto out;
> -	agi = agibp->b_addr;
> -
>  	/*
>  	 * Get the index into the agi hash table for the list this inode will
>  	 * go on.  Make sure the pointer isn't garbage and that this inode
> @@ -2171,8 +2152,7 @@ xfs_iunlink(
>  	if (next_agino == agino ||
>  	    !xfs_verify_agino_or_null(mp, pag->pag_agno, next_agino)) {
>  		xfs_buf_mark_corrupt(agibp);
> -		error = -EFSCORRUPTED;
> -		goto out;
> +		return -EFSCORRUPTED;
>  	}
>  
>  	if (next_agino != NULLAGINO) {
> @@ -2185,7 +2165,7 @@ xfs_iunlink(
>  		error = xfs_iunlink_update_inode(tp, ip, pag, next_agino,
>  				&old_agino);
>  		if (error)
> -			goto out;
> +			return error;
>  		ASSERT(old_agino == NULLAGINO);
>  
>  		/*
> @@ -2194,11 +2174,42 @@ xfs_iunlink(
>  		 */
>  		error = xfs_iunlink_add_backref(pag, agino, next_agino);
>  		if (error)
> -			goto out;
> +			return error;
>  	}
>  
>  	/* Point the head of the list to point to this inode. */
> -	error = xfs_iunlink_update_bucket(tp, pag, agibp, bucket_index, agino);
> +	return xfs_iunlink_update_bucket(tp, pag, agibp, bucket_index, agino);
> +}
> +
> +/*
> + * This is called when the inode's link count has gone to 0 or we are creating
> + * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> + *
> + * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> + * list when the inode is freed.
> + */
> +STATIC int
> +xfs_iunlink(
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_perag	*pag;
> +	struct xfs_buf		*agibp;
> +	int			error;
> +
> +	ASSERT(VFS_I(ip)->i_nlink == 0);
> +	ASSERT(VFS_I(ip)->i_mode != 0);
> +	trace_xfs_iunlink(ip);
> +
> +	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> +
> +	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> +	error = xfs_read_agi(mp, tp, pag->pag_agno, &agibp);
> +	if (error)
> +		goto out;
> +
> +	error = xfs_iunlink_insert_inode(tp, pag, agibp, ip);
>  out:
>  	xfs_perag_put(pag);
>  	return error;
> @@ -2319,18 +2330,15 @@ xfs_iunlink_map_prev(
>  	return 0;
>  }
>  
> -/*
> - * Pull the on-disk inode from the AGI unlinked list.
> - */
> -STATIC int
> -xfs_iunlink_remove(
> +static int
> +xfs_iunlink_remove_inode(
>  	struct xfs_trans	*tp,
>  	struct xfs_perag	*pag,
> +	struct xfs_buf		*agibp,
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_agi		*agi;
> -	struct xfs_buf		*agibp;
> +	struct xfs_agi		*agi = agibp->b_addr;
>  	struct xfs_buf		*last_ibp;
>  	struct xfs_dinode	*last_dip = NULL;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> @@ -2339,14 +2347,6 @@ xfs_iunlink_remove(
>  	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  	int			error;
>  
> -	trace_xfs_iunlink_remove(ip);
> -
> -	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> -	error = xfs_read_agi(mp, tp, pag->pag_agno, &agibp);
> -	if (error)
> -		return error;
> -	agi = agibp->b_addr;
> -
>  	/*
>  	 * Get the index into the agi hash table for the list this inode will
>  	 * go on.  Make sure the head pointer isn't garbage.
> @@ -2411,6 +2411,29 @@ xfs_iunlink_remove(
>  			next_agino);
>  }
>  
> +/*
> + * Pull the on-disk inode from the AGI unlinked list.
> + */
> +STATIC int
> +xfs_iunlink_remove(
> +	struct xfs_trans	*tp,
> +	struct xfs_perag	*pag,
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_buf		*agibp;
> +	int			error;
> +
> +	trace_xfs_iunlink_remove(ip);
> +
> +	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> +	error = xfs_read_agi(mp, tp, pag->pag_agno, &agibp);
> +	if (error)
> +		return error;
> +
> +	return xfs_iunlink_remove_inode(tp, pag, agibp, ip);
> +}
> +
>  /*
>   * Look up the inode number specified and if it is not already marked XFS_ISTALE
>   * mark it stale. We should only find clean inodes in this lookup that aren't
> -- 
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/9] xfs: track the iunlink list pointer in the xfs_inode
  2022-06-27  0:43 ` [PATCH 2/9] xfs: track the iunlink list pointer in the xfs_inode Dave Chinner
  2022-06-29  7:08   ` Christoph Hellwig
@ 2022-06-29 20:50   ` Darrick J. Wong
  1 sibling, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2022-06-29 20:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 27, 2022 at 10:43:29AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Having direct access to the i_next_unlinked pointer in unlinked
> inodes greatly simplifies the processing of inodes on the unlinked
> list. We no longer need to look up the inode buffer just to find
> next inode in the list if the xfs_inode is in memory. These
> improvements will be realised over upcoming patches as other
> dependencies on the inode buffer for unlinked list processing are
> removed.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Pretty straightforward
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_inode_buf.c |  3 ++-
>  fs/xfs/xfs_inode.c            |  5 ++++-
>  fs/xfs/xfs_inode.h            |  3 +++
>  fs/xfs/xfs_log_recover.c      | 16 +---------------
>  4 files changed, 10 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 3b1b63f9d886..d05a3294020a 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -230,7 +230,8 @@ xfs_inode_from_disk(
>  	ip->i_nblocks = be64_to_cpu(from->di_nblocks);
>  	ip->i_extsize = be32_to_cpu(from->di_extsize);
>  	ip->i_forkoff = from->di_forkoff;
> -	ip->i_diflags	= be16_to_cpu(from->di_flags);
> +	ip->i_diflags = be16_to_cpu(from->di_flags);
> +	ip->i_next_unlinked = be32_to_cpu(from->di_next_unlinked);
>  
>  	if (from->di_dmevmask || from->di_dmstate)
>  		xfs_iflags_set(ip, XFS_IPRESERVE_DM_FIELDS);
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 2a371c3431c9..c507370bd885 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2098,7 +2098,8 @@ xfs_iunlink_update_inode(
>  
>  	/* Make sure the old pointer isn't garbage. */
>  	old_value = be32_to_cpu(dip->di_next_unlinked);
> -	if (!xfs_verify_agino_or_null(mp, pag->pag_agno, old_value)) {
> +	if (old_value != ip->i_next_unlinked ||
> +	    !xfs_verify_agino_or_null(mp, pag->pag_agno, old_value)) {
>  		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
>  				sizeof(*dip), __this_address);
>  		error = -EFSCORRUPTED;
> @@ -2167,6 +2168,7 @@ xfs_iunlink_insert_inode(
>  		if (error)
>  			return error;
>  		ASSERT(old_agino == NULLAGINO);
> +		ip->i_next_unlinked = next_agino;
>  
>  		/*
>  		 * agino has been unlinked, add a backref from the next inode
> @@ -2366,6 +2368,7 @@ xfs_iunlink_remove_inode(
>  	error = xfs_iunlink_update_inode(tp, ip, pag, NULLAGINO, &next_agino);
>  	if (error)
>  		return error;
> +	ip->i_next_unlinked = NULLAGINO;
>  
>  	/*
>  	 * If there was a backref pointing from the next inode back to this
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 7be6f8e705ab..8e2a33c6cbe2 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -68,6 +68,9 @@ typedef struct xfs_inode {
>  	uint64_t		i_diflags2;	/* XFS_DIFLAG2_... */
>  	struct timespec64	i_crtime;	/* time created */
>  
> +	/* unlinked list pointers */
> +	xfs_agino_t		i_next_unlinked;
> +
>  	/* VFS inode */
>  	struct inode		i_vnode;	/* embedded VFS inode */
>  
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 5f7e4e6e33ce..f360b46533a6 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2673,8 +2673,6 @@ xlog_recover_process_one_iunlink(
>  	xfs_agino_t			agino,
>  	int				bucket)
>  {
> -	struct xfs_buf			*ibp;
> -	struct xfs_dinode		*dip;
>  	struct xfs_inode		*ip;
>  	xfs_ino_t			ino;
>  	int				error;
> @@ -2684,27 +2682,15 @@ xlog_recover_process_one_iunlink(
>  	if (error)
>  		goto fail;
>  
> -	/*
> -	 * Get the on disk inode to find the next inode in the bucket.
> -	 */
> -	error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &ibp);
> -	if (error)
> -		goto fail_iput;
> -	dip = xfs_buf_offset(ibp, ip->i_imap.im_boffset);
> -
>  	xfs_iflags_clear(ip, XFS_IRECOVERY);
>  	ASSERT(VFS_I(ip)->i_nlink == 0);
>  	ASSERT(VFS_I(ip)->i_mode != 0);
>  
>  	/* setup for the next pass */
> -	agino = be32_to_cpu(dip->di_next_unlinked);
> -	xfs_buf_relse(ibp);
> -
> +	agino = ip->i_next_unlinked;
>  	xfs_irele(ip);
>  	return agino;
>  
> - fail_iput:
> -	xfs_irele(ip);
>   fail:
>  	/*
>  	 * We can't read in the inode this bucket points to, or this inode
> -- 
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/9] xfs: refactor xlog_recover_process_iunlinks()
  2022-06-27  0:43 ` [PATCH 3/9] xfs: refactor xlog_recover_process_iunlinks() Dave Chinner
  2022-06-29  7:12   ` Christoph Hellwig
@ 2022-06-29 20:56   ` Darrick J. Wong
  1 sibling, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2022-06-29 20:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 27, 2022 at 10:43:30AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For upcoming changes to the way inode unlinked list processing is
> done, the structure of recovery needs to change slightly. We also
> really need to untangle the messy error handling in list recovery
> so that actions like emptying the bucket on inode lookup failure
> are associated with the bucket list walk failing, not failing
> to look up the inode.
> 
> Refactor the recovery code now to keep the re-organisation seperate
> to the algorithm changes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_recover.c | 162 ++++++++++++++++++++-------------------
>  1 file changed, 84 insertions(+), 78 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index f360b46533a6..7d0f530d7e3c 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2627,23 +2627,23 @@ xlog_recover_cancel_intents(
>   * This routine performs a transaction to null out a bad inode pointer
>   * in an agi unlinked inode hash bucket.
>   */
> -STATIC void
> +static void
>  xlog_recover_clear_agi_bucket(
> -	xfs_mount_t	*mp,
> -	xfs_agnumber_t	agno,
> -	int		bucket)
> +	struct xfs_mount	*mp,
> +	struct xfs_perag	*pag,

Why even pass in *mp when you've got *pag?

> +	int			bucket)
>  {
> -	xfs_trans_t	*tp;
> -	xfs_agi_t	*agi;
> -	struct xfs_buf	*agibp;
> -	int		offset;
> -	int		error;
> +	struct xfs_trans	*tp;
> +	struct xfs_agi		*agi;
> +	struct xfs_buf		*agibp;
> +	int			offset;
> +	int			error;
>  
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_clearagi, 0, 0, 0, &tp);
>  	if (error)
>  		goto out_error;
>  
> -	error = xfs_read_agi(mp, tp, agno, &agibp);
> +	error = xfs_read_agi(mp, tp, pag->pag_agno, &agibp);
>  	if (error)
>  		goto out_abort;
>  
> @@ -2662,46 +2662,40 @@ xlog_recover_clear_agi_bucket(
>  out_abort:
>  	xfs_trans_cancel(tp);
>  out_error:
> -	xfs_warn(mp, "%s: failed to clear agi %d. Continuing.", __func__, agno);
> +	xfs_warn(mp, "%s: failed to clear agi %d. Continuing.",
> +		__func__, pag->pag_agno);
>  	return;

Nit: Don't need the return here.

Otherwise looks good.

--D

>  }
>  
> -STATIC xfs_agino_t
> -xlog_recover_process_one_iunlink(
> -	struct xfs_mount		*mp,
> -	xfs_agnumber_t			agno,
> -	xfs_agino_t			agino,
> -	int				bucket)
> +static int
> +xlog_recover_iunlink_bucket(
> +	struct xfs_mount	*mp,
> +	struct xfs_perag	*pag,
> +	struct xfs_agi		*agi,
> +	int			bucket)
>  {
> -	struct xfs_inode		*ip;
> -	xfs_ino_t			ino;
> -	int				error;
> +	struct xfs_inode	*ip;
> +	xfs_agino_t		agino;
>  
> -	ino = XFS_AGINO_TO_INO(mp, agno, agino);
> -	error = xfs_iget(mp, NULL, ino, 0, 0, &ip);
> -	if (error)
> -		goto fail;
> +	agino = be32_to_cpu(agi->agi_unlinked[bucket]);
> +	while (agino != NULLAGINO) {
> +		int	error;
>  
> -	xfs_iflags_clear(ip, XFS_IRECOVERY);
> -	ASSERT(VFS_I(ip)->i_nlink == 0);
> -	ASSERT(VFS_I(ip)->i_mode != 0);
> +		error = xfs_iget(mp, NULL,
> +				XFS_AGINO_TO_INO(mp, pag->pag_agno, agino),
> +				0, 0, &ip);
> +		if (error)
> +			return error;;
>  
> -	/* setup for the next pass */
> -	agino = ip->i_next_unlinked;
> -	xfs_irele(ip);
> -	return agino;
> +		ASSERT(VFS_I(ip)->i_nlink == 0);
> +		ASSERT(VFS_I(ip)->i_mode != 0);
> +		xfs_iflags_clear(ip, XFS_IRECOVERY);
> +		agino = ip->i_next_unlinked;
>  
> - fail:
> -	/*
> -	 * We can't read in the inode this bucket points to, or this inode
> -	 * is messed up.  Just ditch this bucket of inodes.  We will lose
> -	 * some inodes and space, but at least we won't hang.
> -	 *
> -	 * Call xlog_recover_clear_agi_bucket() to perform a transaction to
> -	 * clear the inode pointer in the bucket.
> -	 */
> -	xlog_recover_clear_agi_bucket(mp, agno, bucket);
> -	return NULLAGINO;
> +		xfs_irele(ip);
> +		cond_resched();
> +	}
> +	return 0;
>  }
>  
>  /*
> @@ -2727,51 +2721,63 @@ xlog_recover_process_one_iunlink(
>   * scheduled on this CPU to ensure other scheduled work can run without undue
>   * latency.
>   */
> -STATIC void
> -xlog_recover_process_iunlinks(
> -	struct xlog	*log)
> +static void
> +xlog_recover_iunlink_ag(
> +	struct xfs_mount	*mp,
> +	struct xfs_perag	*pag)
>  {
> -	struct xfs_mount	*mp = log->l_mp;
> -	struct xfs_perag	*pag;
> -	xfs_agnumber_t		agno;
>  	struct xfs_agi		*agi;
>  	struct xfs_buf		*agibp;
> -	xfs_agino_t		agino;
>  	int			bucket;
>  	int			error;
>  
> -	for_each_perag(mp, agno, pag) {
> -		error = xfs_read_agi(mp, NULL, pag->pag_agno, &agibp);
> +	error = xfs_read_agi(mp, NULL, pag->pag_agno, &agibp);
> +	if (error) {
> +		/*
> +		 * AGI is b0rked. Don't process it.
> +		 *
> +		 * We should probably mark the filesystem as corrupt after we've
> +		 * recovered all the ag's we can....
> +		 */
> +		return;
> +	}
> +
> +	/*
> +	 * Unlock the buffer so that it can be acquired in the normal course of
> +	 * the transaction to truncate and free each inode.  Because we are not
> +	 * racing with anyone else here for the AGI buffer, we don't even need
> +	 * to hold it locked to read the initial unlinked bucket entries out of
> +	 * the buffer. We keep buffer reference though, so that it stays pinned
> +	 * in memory while we need the buffer.
> +	 */
> +	agi = agibp->b_addr;
> +	xfs_buf_unlock(agibp);
> +
> +	for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
> +		error = xlog_recover_iunlink_bucket(mp, pag, agi, bucket);
>  		if (error) {
>  			/*
> -			 * AGI is b0rked. Don't process it.
> -			 *
> -			 * We should probably mark the filesystem as corrupt
> -			 * after we've recovered all the ag's we can....
> +			 * Bucket is unrecoverable, so only a repair scan can
> +			 * free the remaining unlinked inodes. Just empty the
> +			 * bucket and remaining inodes on it unreferenced and
> +			 * unfreeable.
>  			 */
> -			continue;
> +			xlog_recover_clear_agi_bucket(mp, pag, bucket);
>  		}
> -		/*
> -		 * Unlock the buffer so that it can be acquired in the normal
> -		 * course of the transaction to truncate and free each inode.
> -		 * Because we are not racing with anyone else here for the AGI
> -		 * buffer, we don't even need to hold it locked to read the
> -		 * initial unlinked bucket entries out of the buffer. We keep
> -		 * buffer reference though, so that it stays pinned in memory
> -		 * while we need the buffer.
> -		 */
> -		agi = agibp->b_addr;
> -		xfs_buf_unlock(agibp);
> -
> -		for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
> -			agino = be32_to_cpu(agi->agi_unlinked[bucket]);
> -			while (agino != NULLAGINO) {
> -				agino = xlog_recover_process_one_iunlink(mp,
> -						pag->pag_agno, agino, bucket);
> -				cond_resched();
> -			}
> -		}
> -		xfs_buf_rele(agibp);
> +	}
> +
> +	xfs_buf_rele(agibp);
> +}
> +
> +static void
> +xlog_recover_process_iunlinks(
> +	struct xlog	*log)
> +{
> +	struct xfs_perag	*pag;
> +	xfs_agnumber_t		agno;
> +
> +	for_each_perag(log->l_mp, agno, pag) {
> +		xlog_recover_iunlink_ag(log->l_mp, pag);
>  	}
>  
>  	/*
> @@ -2779,7 +2785,7 @@ xlog_recover_process_iunlinks(
>  	 * are fully completed on disk and the incore inodes can be reclaimed
>  	 * before we signal that recovery is complete.
>  	 */
> -	xfs_inodegc_flush(mp);
> +	xfs_inodegc_flush(log->l_mp);
>  }
>  
>  STATIC void
> -- 
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/9] xfs: introduce xfs_iunlink_lookup
  2022-06-27  0:43 ` [PATCH 4/9] xfs: introduce xfs_iunlink_lookup Dave Chinner
  2022-06-29  7:17   ` Christoph Hellwig
@ 2022-06-29 21:01   ` Darrick J. Wong
  1 sibling, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2022-06-29 21:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 27, 2022 at 10:43:31AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When an inode is on an unlinked list during normal operation, it is
> guaranteed to be pinned in memory as it is either referenced by the
> current unlink operation or it has a open file descriptor that
> references it and has it pinned in memory. Hence to look up an inode
> on the unlinked list, we can do a direct inode cache lookup and
> always expect the lookup to succeed.
> 
> Add a function to do this lookup based on the agino that we use to
> link the chain of unlinked inodes together so we can begin the
> conversion the unlinked list manipulations to use in-memory inodes
> rather than inode cluster buffers and remove the backref cache.
> 
> Use this lookup function to replace the on-disk inode buffer walk
> when removing inodes from the unlinked list with an in-core inode
> unlinked list walk.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode.c | 165 +++++++++++++++++++--------------------------
>  1 file changed, 71 insertions(+), 94 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index c507370bd885..e6ac13174062 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2006,6 +2006,39 @@ xfs_iunlink_destroy(
>  	ASSERT(freed_anything == false || xfs_is_shutdown(pag->pag_mount));
>  }
>  
> +/*
> + * Find an inode on the unlinked list. This does not take references to the
> + * inode, we have existence guarantees by holding the AGI buffer lock and that
> + * only unlinked, referenced inodes can be on the unlinked inode list.  If we
> + * don't find the inode in cache, then let the caller handle the situation.
> + */
> +static struct xfs_inode *
> +xfs_iunlink_lookup(
> +	struct xfs_perag	*pag,
> +	xfs_agino_t		agino)
> +{
> +	struct xfs_mount	*mp = pag->pag_mount;
> +	struct xfs_inode	*ip;
> +
> +	rcu_read_lock();
> +	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
> +
> +	/* Inode not in memory, nothing to do */
> +	if (!ip) {
> +		rcu_read_unlock();
> +		return NULL;

Does this mean that someone else already removed the inode from the
unlink list?  Which I guess happens if ... what?  We're slowly
processing the unlinked list and someone else reclaims something that we
thought was on that list?

(Oh, I see hch already asked about this.)

> +	}
> +	spin_lock(&ip->i_flags_lock);
> +	if (ip->i_ino != XFS_AGINO_TO_INO(mp, pag->pag_agno, agino) ||
> +	    (ip->i_flags & (XFS_IRECLAIMABLE | XFS_IRECLAIM))) {
> +		/* Uh-oh! */
> +		ip = NULL;
> +	}
> +	spin_unlock(&ip->i_flags_lock);
> +	rcu_read_unlock();
> +	return ip;
> +}
> +
>  /*
>   * Point the AGI unlinked bucket at an inode and log the results.  The caller
>   * is responsible for validating the old value.
> @@ -2111,7 +2144,8 @@ xfs_iunlink_update_inode(
>  	 * current pointer is the same as the new value, unless we're
>  	 * terminating the list.
>  	 */
> -	*old_next_agino = old_value;
> +	if (old_next_agino)
> +		*old_next_agino = old_value;
>  	if (old_value == next_agino) {
>  		if (next_agino != NULLAGINO) {
>  			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
> @@ -2217,38 +2251,6 @@ xfs_iunlink(
>  	return error;
>  }
>  
> -/* Return the imap, dinode pointer, and buffer for an inode. */
> -STATIC int
> -xfs_iunlink_map_ino(
> -	struct xfs_trans	*tp,
> -	xfs_agnumber_t		agno,
> -	xfs_agino_t		agino,
> -	struct xfs_imap		*imap,
> -	struct xfs_dinode	**dipp,
> -	struct xfs_buf		**bpp)
> -{
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	int			error;
> -
> -	imap->im_blkno = 0;
> -	error = xfs_imap(mp, tp, XFS_AGINO_TO_INO(mp, agno, agino), imap, 0);
> -	if (error) {
> -		xfs_warn(mp, "%s: xfs_imap returned error %d.",
> -				__func__, error);
> -		return error;
> -	}
> -
> -	error = xfs_imap_to_bp(mp, tp, imap, bpp);
> -	if (error) {
> -		xfs_warn(mp, "%s: xfs_imap_to_bp returned error %d.",
> -				__func__, error);
> -		return error;
> -	}
> -
> -	*dipp = xfs_buf_offset(*bpp, imap->im_boffset);
> -	return 0;
> -}
> -
>  /*
>   * Walk the unlinked chain from @head_agino until we find the inode that
>   * points to @target_agino.  Return the inode number, map, dinode pointer,
> @@ -2259,77 +2261,51 @@ xfs_iunlink_map_ino(
>   *
>   * Do not call this function if @target_agino is the head of the list.
>   */
> -STATIC int
> -xfs_iunlink_map_prev(
> -	struct xfs_trans	*tp,
> +static int
> +xfs_iunlink_lookup_prev(
>  	struct xfs_perag	*pag,
>  	xfs_agino_t		head_agino,
>  	xfs_agino_t		target_agino,
> -	xfs_agino_t		*agino,
> -	struct xfs_imap		*imap,
> -	struct xfs_dinode	**dipp,
> -	struct xfs_buf		**bpp)
> +	struct xfs_inode	**ipp)
>  {
> -	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_mount	*mp = pag->pag_mount;
> +	struct xfs_inode	*ip;
>  	xfs_agino_t		next_agino;
> -	int			error;
>  
> -	ASSERT(head_agino != target_agino);
> -	*bpp = NULL;
> +	*ipp = NULL;
>  
>  	/* See if our backref cache can find it faster. */
> -	*agino = xfs_iunlink_lookup_backref(pag, target_agino);
> -	if (*agino != NULLAGINO) {
> -		error = xfs_iunlink_map_ino(tp, pag->pag_agno, *agino, imap,
> -				dipp, bpp);
> -		if (error)
> -			return error;
> -
> -		if (be32_to_cpu((*dipp)->di_next_unlinked) == target_agino)
> +	next_agino = xfs_iunlink_lookup_backref(pag, target_agino);
> +	if (next_agino != NULLAGINO) {
> +		ip = xfs_iunlink_lookup(pag, next_agino);
> +		if (ip && ip->i_next_unlinked == target_agino) {
> +			*ipp = ip;
>  			return 0;
> -
> -		/*
> -		 * If we get here the cache contents were corrupt, so drop the
> -		 * buffer and fall back to walking the bucket list.
> -		 */
> -		xfs_trans_brelse(tp, *bpp);
> -		*bpp = NULL;
> -		WARN_ON_ONCE(1);
> +		}
>  	}
>  
> -	trace_xfs_iunlink_map_prev_fallback(mp, pag->pag_agno);

Might want to remove this tracepoint from xfs_trace.h.  The rest looks
ok though.

--D

> -
>  	/* Otherwise, walk the entire bucket until we find it. */
>  	next_agino = head_agino;
> -	while (next_agino != target_agino) {
> -		xfs_agino_t	unlinked_agino;
> +	while (next_agino != NULLAGINO) {
> +		ip = xfs_iunlink_lookup(pag, next_agino);
> +		if (!ip)
> +			return -EFSCORRUPTED;
>  
> -		if (*bpp)
> -			xfs_trans_brelse(tp, *bpp);
> -
> -		*agino = next_agino;
> -		error = xfs_iunlink_map_ino(tp, pag->pag_agno, next_agino, imap,
> -				dipp, bpp);
> -		if (error)
> -			return error;
> -
> -		unlinked_agino = be32_to_cpu((*dipp)->di_next_unlinked);
>  		/*
>  		 * Make sure this pointer is valid and isn't an obvious
>  		 * infinite loop.
>  		 */
> -		if (!xfs_verify_agino(mp, pag->pag_agno, unlinked_agino) ||
> -		    next_agino == unlinked_agino) {
> -			XFS_CORRUPTION_ERROR(__func__,
> -					XFS_ERRLEVEL_LOW, mp,
> -					*dipp, sizeof(**dipp));
> -			error = -EFSCORRUPTED;
> -			return error;
> +		if (!xfs_verify_agino(mp, pag->pag_agno, ip->i_next_unlinked) ||
> +		    next_agino == ip->i_next_unlinked)
> +			return -EFSCORRUPTED;
> +
> +		if (ip->i_next_unlinked == target_agino) {
> +			*ipp = ip;
> +			return 0;
>  		}
> -		next_agino = unlinked_agino;
> +		next_agino = ip->i_next_unlinked;
>  	}
> -
> -	return 0;
> +	return -EFSCORRUPTED;
>  }
>  
>  static int
> @@ -2341,8 +2317,6 @@ xfs_iunlink_remove_inode(
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_agi		*agi = agibp->b_addr;
> -	struct xfs_buf		*last_ibp;
> -	struct xfs_dinode	*last_dip = NULL;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agino_t		next_agino;
>  	xfs_agino_t		head_agino;
> @@ -2368,7 +2342,6 @@ xfs_iunlink_remove_inode(
>  	error = xfs_iunlink_update_inode(tp, ip, pag, NULLAGINO, &next_agino);
>  	if (error)
>  		return error;
> -	ip->i_next_unlinked = NULLAGINO;
>  
>  	/*
>  	 * If there was a backref pointing from the next inode back to this
> @@ -2384,18 +2357,21 @@ xfs_iunlink_remove_inode(
>  	}
>  
>  	if (head_agino != agino) {
> -		struct xfs_imap	imap;
> -		xfs_agino_t	prev_agino;
> +		struct xfs_inode	*prev_ip;
>  
> -		/* We need to search the list for the inode being freed. */
> -		error = xfs_iunlink_map_prev(tp, pag, head_agino, agino,
> -				&prev_agino, &imap, &last_dip, &last_ibp);
> +		error = xfs_iunlink_lookup_prev(pag, head_agino, agino,
> +				&prev_ip);
>  		if (error)
>  			return error;
>  
>  		/* Point the previous inode on the list to the next inode. */
> -		xfs_iunlink_update_dinode(tp, pag, prev_agino, last_ibp,
> -				last_dip, &imap, next_agino);
> +		error = xfs_iunlink_update_inode(tp, prev_ip, pag, next_agino,
> +				NULL);
> +		if (error)
> +			return error;
> +
> +		prev_ip->i_next_unlinked = ip->i_next_unlinked;
> +		ip->i_next_unlinked = NULLAGINO;
>  
>  		/*
>  		 * Now we deal with the backref for this inode.  If this inode
> @@ -2410,6 +2386,7 @@ xfs_iunlink_remove_inode(
>  	}
>  
>  	/* Point the head of the list to the next unlinked inode. */
> +	ip->i_next_unlinked = NULLAGINO;
>  	return xfs_iunlink_update_bucket(tp, pag, agibp, bucket_index,
>  			next_agino);
>  }
> -- 
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/9] xfs: double link the unlinked inode list
  2022-06-27  0:43 ` [PATCH 5/9] xfs: double link the unlinked inode list Dave Chinner
  2022-06-29  7:20   ` Christoph Hellwig
@ 2022-06-29 21:06   ` Darrick J. Wong
  1 sibling, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2022-06-29 21:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 27, 2022 at 10:43:32AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now we have forwards traversal via the incore inode in place, we now
> need to add back pointers to the incore inode to entirely replace
> the back reference cache. We use the same lookup semantics and
> constraints as for the forwards pointer lookups during unlinks, and
> so we can look up any inode in the unlinked list directly and update
> the list pointers, forwards or backwards, at any time.
> 
> The only wrinkle in converting the unlinked list manipulations to
> use in-core previous pointers is that log recovery doesn't have the
> incore inode state built up so it can't just read in an inode and
> release it to finish off the unlink. Hence we need to modify the
> traversal in recovery to read one inode ahead before we
> release the inode at the head of the list. This populates the
> next->prev relationship sufficient to be able to replay the unlinked
> list and hence greatly simplify the runtime code.
> 
> This recovery algorithm also requires that we actually remove inodes
> from the unlinked list one at a time as background inode
> inactivation will result in unlinked list removal racing with the
> building of the in-memory unlinked list state. We could serialise
> this by holding the AGI buffer lock when constructing the in memory
> state, but all that does is lockstep background processing with list
> building. It is much simpler to flush the inodegc immediately after
> releasing the inode so that it is unlinked immediately and there is
> no races present at all.

...and probably eliminates the secondary issue of iunlinks processing
OOMing the box if you put a few million inodes in the hash buckets and
recover on a crap system with like 560MB of memory, right? ;)

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ag.c   |   8 -
>  fs/xfs/libxfs/xfs_ag.h   |   6 -
>  fs/xfs/xfs_icache.c      |   2 +
>  fs/xfs/xfs_inode.c       | 348 +++++++--------------------------------
>  fs/xfs/xfs_inode.h       |   4 +-
>  fs/xfs/xfs_log_recover.c |  34 +++-
>  6 files changed, 90 insertions(+), 312 deletions(-)

Ahh my first rhashtable bites the dust.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> 
> diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> index 3e920cf1b454..2de67389fe60 100644
> --- a/fs/xfs/libxfs/xfs_ag.c
> +++ b/fs/xfs/libxfs/xfs_ag.c
> @@ -194,7 +194,6 @@ xfs_free_perag(
>  		XFS_IS_CORRUPT(pag->pag_mount, atomic_read(&pag->pag_ref) != 0);
>  
>  		cancel_delayed_work_sync(&pag->pag_blockgc_work);
> -		xfs_iunlink_destroy(pag);
>  		xfs_buf_hash_destroy(pag);
>  
>  		call_rcu(&pag->rcu_head, __xfs_free_perag);
> @@ -263,10 +262,6 @@ xfs_initialize_perag(
>  		if (error)
>  			goto out_remove_pag;
>  
> -		error = xfs_iunlink_init(pag);
> -		if (error)
> -			goto out_hash_destroy;
> -
>  		/* first new pag is fully initialized */
>  		if (first_initialised == NULLAGNUMBER)
>  			first_initialised = index;
> @@ -280,8 +275,6 @@ xfs_initialize_perag(
>  	mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp);
>  	return 0;
>  
> -out_hash_destroy:
> -	xfs_buf_hash_destroy(pag);
>  out_remove_pag:
>  	radix_tree_delete(&mp->m_perag_tree, index);
>  out_free_pag:
> @@ -293,7 +286,6 @@ xfs_initialize_perag(
>  		if (!pag)
>  			break;
>  		xfs_buf_hash_destroy(pag);
> -		xfs_iunlink_destroy(pag);
>  		kmem_free(pag);
>  	}
>  	return error;
> diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
> index e411d51c2589..bd5f715d116d 100644
> --- a/fs/xfs/libxfs/xfs_ag.h
> +++ b/fs/xfs/libxfs/xfs_ag.h
> @@ -97,12 +97,6 @@ struct xfs_perag {
>  	/* background prealloc block trimming */
>  	struct delayed_work	pag_blockgc_work;
>  
> -	/*
> -	 * Unlinked inode information.  This incore information reflects
> -	 * data stored in the AGI, so callers must hold the AGI buffer lock
> -	 * or have some other means to control concurrency.
> -	 */
> -	struct rhashtable	pagi_unlinked_hash;
>  #endif /* __KERNEL__ */
>  };
>  
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 5269354b1b69..9fc324a29535 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -111,6 +111,8 @@ xfs_inode_alloc(
>  	INIT_WORK(&ip->i_ioend_work, xfs_end_io);
>  	INIT_LIST_HEAD(&ip->i_ioend_list);
>  	spin_lock_init(&ip->i_ioend_lock);
> +	ip->i_next_unlinked = NULLAGINO;
> +	ip->i_prev_unlinked = NULLAGINO;
>  
>  	return ip;
>  }
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index e6ac13174062..f390a91243bf 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1815,196 +1815,21 @@ xfs_inactive(
>   * because we must walk that list to find the inode that points to the inode
>   * being removed from the unlinked hash bucket list.
>   *
> - * What if we modelled the unlinked list as a collection of records capturing
> - * "X.next_unlinked = Y" relations?  If we indexed those records on Y, we'd
> - * have a fast way to look up unlinked list predecessors, which avoids the
> - * slow list walk.  That's exactly what we do here (in-core) with a per-AG
> - * rhashtable.
> + * Hence we keep an in-memory double linked list to link each inode on an
> + * unlinked list. Because there are 64 unlinked lists per AGI, keeping pointer
> + * based lists qould require having 64 list heads in the perag, one for each
> + * list. This is expensive in terms of memory (think millions of AGs) and cache
> + * misses on lookups. Instead, use the fact that inodes on the unlinked list
> + * must be referenced at the VFS level to keep them on the list and hence we
> + * have an existence guarantee for inodes on the unlinked list.
>   *
> - * Because this is a backref cache, we ignore operational failures since the
> - * iunlink code can fall back to the slow bucket walk.  The only errors that
> - * should bubble out are for obviously incorrect situations.
> - *
> - * All users of the backref cache MUST hold the AGI buffer lock to serialize
> - * access or have otherwise provided for concurrency control.
> - */
> -
> -/* Capture a "X.next_unlinked = Y" relationship. */
> -struct xfs_iunlink {
> -	struct rhash_head	iu_rhash_head;
> -	xfs_agino_t		iu_agino;		/* X */
> -	xfs_agino_t		iu_next_unlinked;	/* Y */
> -};
> -
> -/* Unlinked list predecessor lookup hashtable construction */
> -static int
> -xfs_iunlink_obj_cmpfn(
> -	struct rhashtable_compare_arg	*arg,
> -	const void			*obj)
> -{
> -	const xfs_agino_t		*key = arg->key;
> -	const struct xfs_iunlink	*iu = obj;
> -
> -	if (iu->iu_next_unlinked != *key)
> -		return 1;
> -	return 0;
> -}
> -
> -static const struct rhashtable_params xfs_iunlink_hash_params = {
> -	.min_size		= XFS_AGI_UNLINKED_BUCKETS,
> -	.key_len		= sizeof(xfs_agino_t),
> -	.key_offset		= offsetof(struct xfs_iunlink,
> -					   iu_next_unlinked),
> -	.head_offset		= offsetof(struct xfs_iunlink, iu_rhash_head),
> -	.automatic_shrinking	= true,
> -	.obj_cmpfn		= xfs_iunlink_obj_cmpfn,
> -};
> -
> -/*
> - * Return X, where X.next_unlinked == @agino.  Returns NULLAGINO if no such
> - * relation is found.
> - */
> -static xfs_agino_t
> -xfs_iunlink_lookup_backref(
> -	struct xfs_perag	*pag,
> -	xfs_agino_t		agino)
> -{
> -	struct xfs_iunlink	*iu;
> -
> -	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
> -			xfs_iunlink_hash_params);
> -	return iu ? iu->iu_agino : NULLAGINO;
> -}
> -
> -/*
> - * Take ownership of an iunlink cache entry and insert it into the hash table.
> - * If successful, the entry will be owned by the cache; if not, it is freed.
> - * Either way, the caller does not own @iu after this call.
> - */
> -static int
> -xfs_iunlink_insert_backref(
> -	struct xfs_perag	*pag,
> -	struct xfs_iunlink	*iu)
> -{
> -	int			error;
> -
> -	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> -			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> -	/*
> -	 * Fail loudly if there already was an entry because that's a sign of
> -	 * corruption of in-memory data.  Also fail loudly if we see an error
> -	 * code we didn't anticipate from the rhashtable code.  Currently we
> -	 * only anticipate ENOMEM.
> -	 */
> -	if (error) {
> -		WARN(error != -ENOMEM, "iunlink cache insert error %d", error);
> -		kmem_free(iu);
> -	}
> -	/*
> -	 * Absorb any runtime errors that aren't a result of corruption because
> -	 * this is a cache and we can always fall back to bucket list scanning.
> -	 */
> -	if (error != 0 && error != -EEXIST)
> -		error = 0;
> -	return error;
> -}
> -
> -/* Remember that @prev_agino.next_unlinked = @this_agino. */
> -static int
> -xfs_iunlink_add_backref(
> -	struct xfs_perag	*pag,
> -	xfs_agino_t		prev_agino,
> -	xfs_agino_t		this_agino)
> -{
> -	struct xfs_iunlink	*iu;
> -
> -	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
> -		return 0;
> -
> -	iu = kmem_zalloc(sizeof(*iu), KM_NOFS);
> -	iu->iu_agino = prev_agino;
> -	iu->iu_next_unlinked = this_agino;
> -
> -	return xfs_iunlink_insert_backref(pag, iu);
> -}
> -
> -/*
> - * Replace X.next_unlinked = @agino with X.next_unlinked = @next_unlinked.
> - * If @next_unlinked is NULLAGINO, we drop the backref and exit.  If there
> - * wasn't any such entry then we don't bother.
> + * Given we have an existence guarantee, we can use lockless inode cache lookups
> + * to resolve aginos to xfs inodes. This means we only need 8 bytes per inode
> + * for the double linked unlinked list, and we don't need any extra locking to
> + * keep the list safe as all manipulations are done under the AGI buffer lock.
> + * Keeping the list up to date does not require memory allocation, just finding
> + * the XFS inode and updating the next/prev unlinked list aginos.
>   */
> -static int
> -xfs_iunlink_change_backref(
> -	struct xfs_perag	*pag,
> -	xfs_agino_t		agino,
> -	xfs_agino_t		next_unlinked)
> -{
> -	struct xfs_iunlink	*iu;
> -	int			error;
> -
> -	/* Look up the old entry; if there wasn't one then exit. */
> -	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
> -			xfs_iunlink_hash_params);
> -	if (!iu)
> -		return 0;
> -
> -	/*
> -	 * Remove the entry.  This shouldn't ever return an error, but if we
> -	 * couldn't remove the old entry we don't want to add it again to the
> -	 * hash table, and if the entry disappeared on us then someone's
> -	 * violated the locking rules and we need to fail loudly.  Either way
> -	 * we cannot remove the inode because internal state is or would have
> -	 * been corrupt.
> -	 */
> -	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
> -			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> -	if (error)
> -		return error;
> -
> -	/* If there is no new next entry just free our item and return. */
> -	if (next_unlinked == NULLAGINO) {
> -		kmem_free(iu);
> -		return 0;
> -	}
> -
> -	/* Update the entry and re-add it to the hash table. */
> -	iu->iu_next_unlinked = next_unlinked;
> -	return xfs_iunlink_insert_backref(pag, iu);
> -}
> -
> -/* Set up the in-core predecessor structures. */
> -int
> -xfs_iunlink_init(
> -	struct xfs_perag	*pag)
> -{
> -	return rhashtable_init(&pag->pagi_unlinked_hash,
> -			&xfs_iunlink_hash_params);
> -}
> -
> -/* Free the in-core predecessor structures. */
> -static void
> -xfs_iunlink_free_item(
> -	void			*ptr,
> -	void			*arg)
> -{
> -	struct xfs_iunlink	*iu = ptr;
> -	bool			*freed_anything = arg;
> -
> -	*freed_anything = true;
> -	kmem_free(iu);
> -}
> -
> -void
> -xfs_iunlink_destroy(
> -	struct xfs_perag	*pag)
> -{
> -	bool			freed_anything = false;
> -
> -	rhashtable_free_and_destroy(&pag->pagi_unlinked_hash,
> -			xfs_iunlink_free_item, &freed_anything);
> -
> -	ASSERT(freed_anything == false || xfs_is_shutdown(pag->pag_mount));
> -}
>  
>  /*
>   * Find an inode on the unlinked list. This does not take references to the
> @@ -2039,6 +1864,26 @@ xfs_iunlink_lookup(
>  	return ip;
>  }
>  
> +/* Update the prev pointer of the next agino. */
> +static int
> +xfs_iunlink_update_backref(
> +	struct xfs_perag	*pag,
> +	xfs_agino_t		prev_agino,
> +	xfs_agino_t		next_agino)
> +{
> +	struct xfs_inode	*ip;
> +
> +	/* No update necessary if we are at the end of the list. */
> +	if (next_agino == NULLAGINO)
> +		return 0;
> +
> +	ip = xfs_iunlink_lookup(pag, next_agino);
> +	if (!ip)
> +		return -EFSCORRUPTED;
> +	ip->i_prev_unlinked = prev_agino;
> +	return 0;
> +}
> +
>  /*
>   * Point the AGI unlinked bucket at an inode and log the results.  The caller
>   * is responsible for validating the old value.
> @@ -2170,7 +2015,7 @@ xfs_iunlink_insert_inode(
>  	struct xfs_perag	*pag,
>  	struct xfs_buf		*agibp,
>  	struct xfs_inode	*ip)
> - {
> +{
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_agi		*agi = agibp->b_addr;
>  	xfs_agino_t		next_agino;
> @@ -2190,6 +2035,14 @@ xfs_iunlink_insert_inode(
>  		return -EFSCORRUPTED;
>  	}
>  
> +	/*
> +	 * Update the prev pointer in the next inode to point back to this
> +	 * inode.
> +	 */
> +	error = xfs_iunlink_update_backref(pag, agino, next_agino);
> +	if (error)
> +		return error;
> +
>  	if (next_agino != NULLAGINO) {
>  		xfs_agino_t		old_agino;
>  
> @@ -2203,14 +2056,6 @@ xfs_iunlink_insert_inode(
>  			return error;
>  		ASSERT(old_agino == NULLAGINO);
>  		ip->i_next_unlinked = next_agino;
> -
> -		/*
> -		 * agino has been unlinked, add a backref from the next inode
> -		 * back to agino.
> -		 */
> -		error = xfs_iunlink_add_backref(pag, agino, next_agino);
> -		if (error)
> -			return error;
>  	}
>  
>  	/* Point the head of the list to point to this inode. */
> @@ -2251,63 +2096,6 @@ xfs_iunlink(
>  	return error;
>  }
>  
> -/*
> - * Walk the unlinked chain from @head_agino until we find the inode that
> - * points to @target_agino.  Return the inode number, map, dinode pointer,
> - * and inode cluster buffer of that inode as @agino, @imap, @dipp, and @bpp.
> - *
> - * @tp, @pag, @head_agino, and @target_agino are input parameters.
> - * @agino, @imap, @dipp, and @bpp are all output parameters.
> - *
> - * Do not call this function if @target_agino is the head of the list.
> - */
> -static int
> -xfs_iunlink_lookup_prev(
> -	struct xfs_perag	*pag,
> -	xfs_agino_t		head_agino,
> -	xfs_agino_t		target_agino,
> -	struct xfs_inode	**ipp)
> -{
> -	struct xfs_mount	*mp = pag->pag_mount;
> -	struct xfs_inode	*ip;
> -	xfs_agino_t		next_agino;
> -
> -	*ipp = NULL;
> -
> -	/* See if our backref cache can find it faster. */
> -	next_agino = xfs_iunlink_lookup_backref(pag, target_agino);
> -	if (next_agino != NULLAGINO) {
> -		ip = xfs_iunlink_lookup(pag, next_agino);
> -		if (ip && ip->i_next_unlinked == target_agino) {
> -			*ipp = ip;
> -			return 0;
> -		}
> -	}
> -
> -	/* Otherwise, walk the entire bucket until we find it. */
> -	next_agino = head_agino;
> -	while (next_agino != NULLAGINO) {
> -		ip = xfs_iunlink_lookup(pag, next_agino);
> -		if (!ip)
> -			return -EFSCORRUPTED;
> -
> -		/*
> -		 * Make sure this pointer is valid and isn't an obvious
> -		 * infinite loop.
> -		 */
> -		if (!xfs_verify_agino(mp, pag->pag_agno, ip->i_next_unlinked) ||
> -		    next_agino == ip->i_next_unlinked)
> -			return -EFSCORRUPTED;
> -
> -		if (ip->i_next_unlinked == target_agino) {
> -			*ipp = ip;
> -			return 0;
> -		}
> -		next_agino = ip->i_next_unlinked;
> -	}
> -	return -EFSCORRUPTED;
> -}
> -
>  static int
>  xfs_iunlink_remove_inode(
>  	struct xfs_trans	*tp,
> @@ -2344,51 +2132,33 @@ xfs_iunlink_remove_inode(
>  		return error;
>  
>  	/*
> -	 * If there was a backref pointing from the next inode back to this
> -	 * one, remove it because we've removed this inode from the list.
> -	 *
> -	 * Later, if this inode was in the middle of the list we'll update
> -	 * this inode's backref to point from the next inode.
> +	 * Update the prev pointer in the next inode to point back to previous
> +	 * inode in the chain.
>  	 */
> -	if (next_agino != NULLAGINO) {
> -		error = xfs_iunlink_change_backref(pag, next_agino, NULLAGINO);
> -		if (error)
> -			return error;
> -	}
> +	error = xfs_iunlink_update_backref(pag, ip->i_prev_unlinked,
> +			ip->i_next_unlinked);
> +	if (error)
> +		return error;
>  
>  	if (head_agino != agino) {
>  		struct xfs_inode	*prev_ip;
>  
> -		error = xfs_iunlink_lookup_prev(pag, head_agino, agino,
> -				&prev_ip);
> -		if (error)
> -			return error;
> -
> -		/* Point the previous inode on the list to the next inode. */
> -		error = xfs_iunlink_update_inode(tp, prev_ip, pag, next_agino,
> -				NULL);
> -		if (error)
> -			return error;
> +		prev_ip = xfs_iunlink_lookup(pag, ip->i_prev_unlinked);
> +		if (!prev_ip)
> +			return -EFSCORRUPTED;
>  
> +		error = xfs_iunlink_update_inode(tp, prev_ip, pag,
> +				ip->i_next_unlinked, NULL);
>  		prev_ip->i_next_unlinked = ip->i_next_unlinked;
> -		ip->i_next_unlinked = NULLAGINO;
> -
> -		/*
> -		 * Now we deal with the backref for this inode.  If this inode
> -		 * pointed at a real inode, change the backref that pointed to
> -		 * us to point to our old next.  If this inode was the end of
> -		 * the list, delete the backref that pointed to us.  Note that
> -		 * change_backref takes care of deleting the backref if
> -		 * next_agino is NULLAGINO.
> -		 */
> -		return xfs_iunlink_change_backref(agibp->b_pag, agino,
> -				next_agino);
> +	} else {
> +		/* Point the head of the list to the next unlinked inode. */
> +		error = xfs_iunlink_update_bucket(tp, pag, agibp, bucket_index,
> +				ip->i_next_unlinked);
>  	}
>  
> -	/* Point the head of the list to the next unlinked inode. */
>  	ip->i_next_unlinked = NULLAGINO;
> -	return xfs_iunlink_update_bucket(tp, pag, agibp, bucket_index,
> -			next_agino);
> +	ip->i_prev_unlinked = NULLAGINO;
> +	return error;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 8e2a33c6cbe2..8d8cce61e5ba 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -70,6 +70,7 @@ typedef struct xfs_inode {
>  
>  	/* unlinked list pointers */
>  	xfs_agino_t		i_next_unlinked;
> +	xfs_agino_t		i_prev_unlinked;
>  
>  	/* VFS inode */
>  	struct inode		i_vnode;	/* embedded VFS inode */
> @@ -508,9 +509,6 @@ extern struct kmem_cache	*xfs_inode_cache;
>  
>  bool xfs_inode_needs_inactive(struct xfs_inode *ip);
>  
> -int xfs_iunlink_init(struct xfs_perag *pag);
> -void xfs_iunlink_destroy(struct xfs_perag *pag);
> -
>  void xfs_end_io(struct work_struct *work);
>  
>  int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 7d0f530d7e3c..8e97f4240b93 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2674,28 +2674,50 @@ xlog_recover_iunlink_bucket(
>  	struct xfs_agi		*agi,
>  	int			bucket)
>  {
> +	struct xfs_inode	*prev_ip = NULL;
>  	struct xfs_inode	*ip;
> -	xfs_agino_t		agino;
> +	xfs_agino_t		prev_agino, agino;
> +	int			error = 0;
>  
>  	agino = be32_to_cpu(agi->agi_unlinked[bucket]);
>  	while (agino != NULLAGINO) {
> -		int	error;
>  
>  		error = xfs_iget(mp, NULL,
>  				XFS_AGINO_TO_INO(mp, pag->pag_agno, agino),
>  				0, 0, &ip);
>  		if (error)
> -			return error;;
> +			break;
>  
>  		ASSERT(VFS_I(ip)->i_nlink == 0);
>  		ASSERT(VFS_I(ip)->i_mode != 0);
>  		xfs_iflags_clear(ip, XFS_IRECOVERY);
>  		agino = ip->i_next_unlinked;
>  
> -		xfs_irele(ip);
> -		cond_resched();
> +		if (prev_ip) {
> +			ip->i_prev_unlinked = prev_agino;
> +			xfs_irele(prev_ip);
> +
> +			/*
> +			 * Ensure the inode is removed from the unlinked list
> +			 * before we continue so that it race with building the
> +			 * in-memory list here. THis could be serialised with
> +			 * the agibp lock, but that just serialises via
> +			 * lockstepping and it's much simpler just to flush the
> +			 * inodegc queue and wait for it to complete.
> +			 */
> +			xfs_inodegc_flush(mp);
> +		}
> +
> +		prev_agino = agino;
> +		prev_ip = ip;
>  	}
> -	return 0;
> +
> +	if (prev_ip) {
> +		ip->i_prev_unlinked = prev_agino;
> +		xfs_irele(prev_ip);
> +	}
> +	xfs_inodegc_flush(mp);
> +	return error;
>  }
>  
>  /*
> -- 
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/9] xfs: clean up xfs_iunlink_update_inode()
  2022-06-27  0:43 ` [PATCH 6/9] xfs: clean up xfs_iunlink_update_inode() Dave Chinner
  2022-06-29  7:21   ` Christoph Hellwig
@ 2022-06-29 21:07   ` Darrick J. Wong
  1 sibling, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2022-06-29 21:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 27, 2022 at 10:43:33AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We no longer need to have this function return the previous next
> agino value from the on-disk inode as we have it in the in-core
> inode now.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_inode.c | 18 +++++-------------
>  1 file changed, 5 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index f390a91243bf..8d4edb8129b5 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1953,13 +1953,12 @@ xfs_iunlink_update_dinode(
>  }
>  
>  /* Set an in-core inode's unlinked pointer and return the old value. */
> -STATIC int
> +static int
>  xfs_iunlink_update_inode(
>  	struct xfs_trans	*tp,
>  	struct xfs_inode	*ip,
>  	struct xfs_perag	*pag,
> -	xfs_agino_t		next_agino,
> -	xfs_agino_t		*old_next_agino)
> +	xfs_agino_t		next_agino)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_dinode	*dip;
> @@ -1989,8 +1988,6 @@ xfs_iunlink_update_inode(
>  	 * current pointer is the same as the new value, unless we're
>  	 * terminating the list.
>  	 */
> -	if (old_next_agino)
> -		*old_next_agino = old_value;
>  	if (old_value == next_agino) {
>  		if (next_agino != NULLAGINO) {
>  			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
> @@ -2044,17 +2041,13 @@ xfs_iunlink_insert_inode(
>  		return error;
>  
>  	if (next_agino != NULLAGINO) {
> -		xfs_agino_t		old_agino;
> -
>  		/*
>  		 * There is already another inode in the bucket, so point this
>  		 * inode to the current head of the list.
>  		 */
> -		error = xfs_iunlink_update_inode(tp, ip, pag, next_agino,
> -				&old_agino);
> +		error = xfs_iunlink_update_inode(tp, ip, pag, next_agino);
>  		if (error)
>  			return error;
> -		ASSERT(old_agino == NULLAGINO);
>  		ip->i_next_unlinked = next_agino;
>  	}
>  
> @@ -2106,7 +2099,6 @@ xfs_iunlink_remove_inode(
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_agi		*agi = agibp->b_addr;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> -	xfs_agino_t		next_agino;
>  	xfs_agino_t		head_agino;
>  	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  	int			error;
> @@ -2127,7 +2119,7 @@ xfs_iunlink_remove_inode(
>  	 * the old pointer value so that we can update whatever was previous
>  	 * to us in the list to point to whatever was next in the list.
>  	 */
> -	error = xfs_iunlink_update_inode(tp, ip, pag, NULLAGINO, &next_agino);
> +	error = xfs_iunlink_update_inode(tp, ip, pag, NULLAGINO);
>  	if (error)
>  		return error;
>  
> @@ -2148,7 +2140,7 @@ xfs_iunlink_remove_inode(
>  			return -EFSCORRUPTED;
>  
>  		error = xfs_iunlink_update_inode(tp, prev_ip, pag,
> -				ip->i_next_unlinked, NULL);
> +				ip->i_next_unlinked);
>  		prev_ip->i_next_unlinked = ip->i_next_unlinked;
>  	} else {
>  		/* Point the head of the list to the next unlinked inode. */
> -- 
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/9] xfs: combine iunlink inode update functions
  2022-06-27  0:43 ` [PATCH 7/9] xfs: combine iunlink inode update functions Dave Chinner
  2022-06-29  7:21   ` Christoph Hellwig
@ 2022-06-29 21:07   ` Darrick J. Wong
  1 sibling, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2022-06-29 21:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 27, 2022 at 10:43:34AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Combine the logging of the inode unlink list update into the
> calling function that looks up the buffer we end up logging. These
> do not need to be separate functions as they are both short, simple
> operations and there's only a single call path through them. This
> new function will end up being the core of the iunlink log item
> processing...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_inode.c | 52 ++++++++++++++--------------------------------
>  1 file changed, 16 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 8d4edb8129b5..d6c88a27f29d 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1923,38 +1923,9 @@ xfs_iunlink_update_bucket(
>  	return 0;
>  }
>  
> -/* Set an on-disk inode's next_unlinked pointer. */
> -STATIC void
> -xfs_iunlink_update_dinode(
> -	struct xfs_trans	*tp,
> -	struct xfs_perag	*pag,
> -	xfs_agino_t		agino,
> -	struct xfs_buf		*ibp,
> -	struct xfs_dinode	*dip,
> -	struct xfs_imap		*imap,
> -	xfs_agino_t		next_agino)
> -{
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	int			offset;
> -
> -	ASSERT(xfs_verify_agino_or_null(mp, pag->pag_agno, next_agino));
> -
> -	trace_xfs_iunlink_update_dinode(mp, pag->pag_agno, agino,
> -			be32_to_cpu(dip->di_next_unlinked), next_agino);
> -
> -	dip->di_next_unlinked = cpu_to_be32(next_agino);
> -	offset = imap->im_boffset +
> -			offsetof(struct xfs_dinode, di_next_unlinked);
> -
> -	/* need to recalc the inode CRC if appropriate */
> -	xfs_dinode_calc_crc(mp, dip);
> -	xfs_trans_inode_buf(tp, ibp);
> -	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
> -}
> -
>  /* Set an in-core inode's unlinked pointer and return the old value. */
>  static int
> -xfs_iunlink_update_inode(
> +xfs_iunlink_log_inode(
>  	struct xfs_trans	*tp,
>  	struct xfs_inode	*ip,
>  	struct xfs_perag	*pag,
> @@ -1964,6 +1935,7 @@ xfs_iunlink_update_inode(
>  	struct xfs_dinode	*dip;
>  	struct xfs_buf		*ibp;
>  	xfs_agino_t		old_value;
> +	int			offset;
>  	int			error;
>  
>  	ASSERT(xfs_verify_agino_or_null(mp, pag->pag_agno, next_agino));
> @@ -1997,9 +1969,17 @@ xfs_iunlink_update_inode(
>  		goto out;
>  	}
>  
> -	/* Ok, update the new pointer. */
> -	xfs_iunlink_update_dinode(tp, pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
> -			ibp, dip, &ip->i_imap, next_agino);
> +	trace_xfs_iunlink_update_dinode(mp, pag->pag_agno,
> +			XFS_INO_TO_AGINO(mp, ip->i_ino),
> +			be32_to_cpu(dip->di_next_unlinked), next_agino);
> +
> +	dip->di_next_unlinked = cpu_to_be32(next_agino);
> +	offset = ip->i_imap.im_boffset +
> +			offsetof(struct xfs_dinode, di_next_unlinked);
> +
> +	xfs_dinode_calc_crc(mp, dip);
> +	xfs_trans_inode_buf(tp, ibp);
> +	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
>  	return 0;
>  out:
>  	xfs_trans_brelse(tp, ibp);
> @@ -2045,7 +2025,7 @@ xfs_iunlink_insert_inode(
>  		 * There is already another inode in the bucket, so point this
>  		 * inode to the current head of the list.
>  		 */
> -		error = xfs_iunlink_update_inode(tp, ip, pag, next_agino);
> +		error = xfs_iunlink_log_inode(tp, ip, pag, next_agino);
>  		if (error)
>  			return error;
>  		ip->i_next_unlinked = next_agino;
> @@ -2119,7 +2099,7 @@ xfs_iunlink_remove_inode(
>  	 * the old pointer value so that we can update whatever was previous
>  	 * to us in the list to point to whatever was next in the list.
>  	 */
> -	error = xfs_iunlink_update_inode(tp, ip, pag, NULLAGINO);
> +	error = xfs_iunlink_log_inode(tp, ip, pag, NULLAGINO);
>  	if (error)
>  		return error;
>  
> @@ -2139,7 +2119,7 @@ xfs_iunlink_remove_inode(
>  		if (!prev_ip)
>  			return -EFSCORRUPTED;
>  
> -		error = xfs_iunlink_update_inode(tp, prev_ip, pag,
> +		error = xfs_iunlink_log_inode(tp, prev_ip, pag,
>  				ip->i_next_unlinked);
>  		prev_ip->i_next_unlinked = ip->i_next_unlinked;
>  	} else {
> -- 
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 8/9] xfs: add log item precommit operation
  2022-06-27  0:43 ` [PATCH 8/9] xfs: add log item precommit operation Dave Chinner
@ 2022-06-29 21:16   ` Darrick J. Wong
  2022-06-29 21:34     ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2022-06-29 21:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 27, 2022 at 10:43:35AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For inodes that are dirty, we have an attached cluster buffer that
> we want to use to track the dirty inode through the AIL.
> Unfortunately, locking the cluster buffer and adding it to the
> transaction when the inode is first logged in a transaction leads to
> buffer lock ordering inversions.
> 
> The specific problem is ordering against the AGI buffer. When
> modifying unlinked lists, the buffer lock order is AGI -> inode
> cluster buffer as the AGI buffer lock serialises all access to the
> unlinked lists. Unfortunately, functionality like xfs_droplink()
> logs the inode before calling xfs_iunlink(), as do various directory
> manipulation functions. The inode can be logged way down in the
> stack as far as the bmapi routines and hence, without a major
> rewrite of lots of APIs there's no way we can avoid the inode being
> logged by something until after the AGI has been logged.
> 
> As we are going to be using ordered buffers for inode AIL tracking,
> there isn't a need to actually lock that buffer against modification
> as all the modifications are captured by logging the inode item
> itself. Hence we don't actually need to join the cluster buffer into
> the transaction until just before it is committed. This means we do
> not perturb any of the existing buffer lock orders in transactions,
> and the inode cluster buffer is always locked last in a transaction
> that doesn't otherwise touch inode cluster buffers.
> 
> We do this by introducing a precommit log item method. A log item
> method is used because it is likely dquots will be moved to this
> same ordered buffer tracking scheme and hence will need a similar

Oh?

> callout. This commit just introduces the mechanism; the inode item
> implementation is in followup commits.
> 
> The precommit items need to be sorted into consistent order as we
> may be locking multiple items here. Hence if we have two dirty
> inodes in cluster buffers A and B, and some other transaction has
> two separate dirty inodes in the same cluster buffers, locking them
> in different orders opens us up to ABBA deadlocks. Hence we sort the
> items on the transaction based on the presence of a sort log item
> method.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c |  1 +
>  fs/xfs/xfs_trans.c  | 91 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_trans.h  |  6 ++-
>  3 files changed, 96 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 9fc324a29535..374b3bafaeb0 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -914,6 +914,7 @@ xfs_reclaim_inode(
>  	ip->i_checked = 0;
>  	spin_unlock(&ip->i_flags_lock);
>  
> +	ASSERT(!ip->i_itemp || ip->i_itemp->ili_item.li_buf == NULL);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  
>  	XFS_STATS_INC(ip->i_mount, xs_ig_reclaims);
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 82cf0189c0db..0acb31093d9f 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -844,6 +844,90 @@ xfs_trans_committed_bulk(
>  	spin_unlock(&ailp->ail_lock);
>  }
>  
> +/*
> + * Sort transaction items prior to running precommit operations. This will
> + * attempt to order the items such that they will always be locked in the same
> + * order. Items that have no sort function are moved to the end of the list
> + * and so are locked last (XXX: need to check the logic matches the comment).
> + *
> + * This may need refinement as different types of objects add sort functions.
> + *
> + * Function is more complex than it needs to be because we are comparing 64 bit
> + * values and the function only returns 32 bit values.
> + */
> +static int
> +xfs_trans_precommit_sort(
> +	void			*unused_arg,
> +	const struct list_head	*a,
> +	const struct list_head	*b)
> +{
> +	struct xfs_log_item	*lia = container_of(a,
> +					struct xfs_log_item, li_trans);
> +	struct xfs_log_item	*lib = container_of(b,
> +					struct xfs_log_item, li_trans);
> +	int64_t			diff;
> +
> +	/*
> +	 * If both items are non-sortable, leave them alone. If only one is
> +	 * sortable, move the non-sortable item towards the end of the list.
> +	 */
> +	if (!lia->li_ops->iop_sort && !lib->li_ops->iop_sort)
> +		return 0;
> +	if (!lia->li_ops->iop_sort)
> +		return 1;
> +	if (!lib->li_ops->iop_sort)
> +		return -1;
> +
> +	diff = lia->li_ops->iop_sort(lia) - lib->li_ops->iop_sort(lib);

I'm kinda surprised the iop_sort method doesn't take both log item
pointers, like most sorting-comparator functions?  But I'll see, maybe
you're doing something clever wrt ordering of log items of differing
types, and hence the ->iop_sort implementations are required to return
some absolute priority or something.

--D

> +	if (diff < 0)
> +		return -1;
> +	if (diff > 0)
> +		return 1;
> +	return 0;
> +}
> +
> +/*
> + * Run transaction precommit functions.
> + *
> + * If there is an error in any of the callouts, then stop immediately and
> + * trigger a shutdown to abort the transaction. There is no recovery possible
> + * from errors at this point as the transaction is dirty....
> + */
> +static int
> +xfs_trans_run_precommits(
> +	struct xfs_trans	*tp)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_log_item	*lip, *n;
> +	int			error = 0;
> +
> +	/*
> +	 * Sort the item list to avoid ABBA deadlocks with other transactions
> +	 * running precommit operations that lock multiple shared items such as
> +	 * inode cluster buffers.
> +	 */
> +	list_sort(NULL, &tp->t_items, xfs_trans_precommit_sort);
> +
> +	/*
> +	 * Precommit operations can remove the log item from the transaction
> +	 * if the log item exists purely to delay modifications until they
> +	 * can be ordered against other operations. Hence we have to use
> +	 * list_for_each_entry_safe() here.
> +	 */
> +	list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
> +		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
> +			continue;
> +		if (lip->li_ops->iop_precommit) {
> +			error = lip->li_ops->iop_precommit(tp, lip);
> +			if (error)
> +				break;
> +		}
> +	}
> +	if (error)
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +	return error;
> +}
> +
>  /*
>   * Commit the given transaction to the log.
>   *
> @@ -869,6 +953,13 @@ __xfs_trans_commit(
>  
>  	trace_xfs_trans_commit(tp, _RET_IP_);
>  
> +	error = xfs_trans_run_precommits(tp);
> +	if (error) {
> +		if (tp->t_flags & XFS_TRANS_PERM_LOG_RES)
> +			xfs_defer_cancel(tp);
> +		goto out_unreserve;
> +	}
> +
>  	/*
>  	 * Finish deferred items on final commit. Only permanent transactions
>  	 * should ever have deferred ops.
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 9561f193e7e1..64062e3b788b 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -71,10 +71,12 @@ struct xfs_item_ops {
>  	void (*iop_format)(struct xfs_log_item *, struct xfs_log_vec *);
>  	void (*iop_pin)(struct xfs_log_item *);
>  	void (*iop_unpin)(struct xfs_log_item *, int remove);
> -	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
> +	uint64_t (*iop_sort)(struct xfs_log_item *lip);
> +	int (*iop_precommit)(struct xfs_trans *tp, struct xfs_log_item *lip);
>  	void (*iop_committing)(struct xfs_log_item *lip, xfs_csn_t seq);
> -	void (*iop_release)(struct xfs_log_item *);
>  	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
> +	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
> +	void (*iop_release)(struct xfs_log_item *);

Why did these get moved?

--D

>  	int (*iop_recover)(struct xfs_log_item *lip,
>  			   struct list_head *capture_list);
>  	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
> -- 
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 9/9] xfs: add in-memory iunlink log item
  2022-06-27  0:43 ` [PATCH 9/9] xfs: add in-memory iunlink log item Dave Chinner
  2022-06-29  7:25   ` Christoph Hellwig
@ 2022-06-29 21:21   ` Darrick J. Wong
  2022-06-29 21:44     ` Dave Chinner
  1 sibling, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2022-06-29 21:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 27, 2022 at 10:43:36AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have a clean operation to update the di_next_unlinked
> field of inode cluster buffers, we can easily defer this operation
> to transaction commit time so we can order the inode cluster buffer
> locking consistently.
> 
> TO do this, we introduce a new in-memory log item to track the
> unlinked list item modification that we are going to make. This
> follows the same observations as the in-memory double linked list
> used to track unlinked inodes in that the inodes on the list are
> pinned in memory and cannot go away, and hence we can simply
> reference them for the duration of the transaction without needing
> to take active references or pin them or look them up.
> 
> This allows us to pass the xfs_inode to the transaction commit code
> along with the modification to be made, and then order the logged
> modifications via the ->iop_sort and ->iop_precommit operations
> for the new log item type. As this is an in-memory log item, it
> doesn't have formatting, CIL or AIL operational hooks - it exists
> purely to run the inode unlink modifications and is then removed
> from the transaction item list and freed once the precommit
> operation has run.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/Makefile           |   1 +
>  fs/xfs/xfs_inode.c        |  64 +-------------
>  fs/xfs/xfs_iunlink_item.c | 180 ++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_iunlink_item.h |  27 ++++++
>  fs/xfs/xfs_super.c        |  10 +++
>  5 files changed, 219 insertions(+), 63 deletions(-)
>  create mode 100644 fs/xfs/xfs_iunlink_item.c
>  create mode 100644 fs/xfs/xfs_iunlink_item.h
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index b056cfc6398e..1131dd01e4fe 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -106,6 +106,7 @@ xfs-y				+= xfs_log.o \
>  				   xfs_icreate_item.o \
>  				   xfs_inode_item.o \
>  				   xfs_inode_item_recover.o \
> +				   xfs_iunlink_item.o \
>  				   xfs_refcount_item.o \
>  				   xfs_rmap_item.o \
>  				   xfs_log_recover.o \
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index d6c88a27f29d..ef1dae9dbaa4 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -20,6 +20,7 @@
>  #include "xfs_trans.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_inode_item.h"
> +#include "xfs_iunlink_item.h"
>  #include "xfs_ialloc.h"
>  #include "xfs_bmap.h"
>  #include "xfs_bmap_util.h"
> @@ -1923,69 +1924,6 @@ xfs_iunlink_update_bucket(
>  	return 0;
>  }
>  
> -/* Set an in-core inode's unlinked pointer and return the old value. */
> -static int
> -xfs_iunlink_log_inode(
> -	struct xfs_trans	*tp,
> -	struct xfs_inode	*ip,
> -	struct xfs_perag	*pag,
> -	xfs_agino_t		next_agino)
> -{
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_dinode	*dip;
> -	struct xfs_buf		*ibp;
> -	xfs_agino_t		old_value;
> -	int			offset;
> -	int			error;
> -
> -	ASSERT(xfs_verify_agino_or_null(mp, pag->pag_agno, next_agino));
> -
> -	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &ibp);
> -	if (error)
> -		return error;
> -	dip = xfs_buf_offset(ibp, ip->i_imap.im_boffset);
> -
> -	/* Make sure the old pointer isn't garbage. */
> -	old_value = be32_to_cpu(dip->di_next_unlinked);
> -	if (old_value != ip->i_next_unlinked ||
> -	    !xfs_verify_agino_or_null(mp, pag->pag_agno, old_value)) {
> -		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
> -				sizeof(*dip), __this_address);
> -		error = -EFSCORRUPTED;
> -		goto out;
> -	}
> -
> -	/*
> -	 * Since we're updating a linked list, we should never find that the
> -	 * current pointer is the same as the new value, unless we're
> -	 * terminating the list.
> -	 */
> -	if (old_value == next_agino) {
> -		if (next_agino != NULLAGINO) {
> -			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
> -					dip, sizeof(*dip), __this_address);
> -			error = -EFSCORRUPTED;
> -		}
> -		goto out;
> -	}
> -
> -	trace_xfs_iunlink_update_dinode(mp, pag->pag_agno,
> -			XFS_INO_TO_AGINO(mp, ip->i_ino),
> -			be32_to_cpu(dip->di_next_unlinked), next_agino);
> -
> -	dip->di_next_unlinked = cpu_to_be32(next_agino);
> -	offset = ip->i_imap.im_boffset +
> -			offsetof(struct xfs_dinode, di_next_unlinked);
> -
> -	xfs_dinode_calc_crc(mp, dip);
> -	xfs_trans_inode_buf(tp, ibp);
> -	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
> -	return 0;
> -out:
> -	xfs_trans_brelse(tp, ibp);
> -	return error;
> -}
> -
>  static int
>  xfs_iunlink_insert_inode(
>  	struct xfs_trans	*tp,
> diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
> new file mode 100644
> index 000000000000..fe38fc61f79e
> --- /dev/null
> +++ b/fs/xfs/xfs_iunlink_item.c
> @@ -0,0 +1,180 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2020, Red Hat, Inc.

2022?

> + * All Rights Reserved.
> + */
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_inode.h"
> +#include "xfs_trans.h"
> +#include "xfs_trans_priv.h"
> +#include "xfs_ag.h"
> +#include "xfs_iunlink_item.h"
> +#include "xfs_trace.h"
> +#include "xfs_error.h"
> +
> +struct kmem_cache	*xfs_iunlink_cache;
> +
> +static inline struct xfs_iunlink_item *IUL_ITEM(struct xfs_log_item *lip)
> +{
> +	return container_of(lip, struct xfs_iunlink_item, item);
> +}
> +
> +static void
> +xfs_iunlink_item_release(
> +	struct xfs_log_item	*lip)
> +{
> +	struct xfs_iunlink_item	*iup = IUL_ITEM(lip);
> +
> +	xfs_perag_put(iup->pag);
> +	kmem_cache_free(xfs_iunlink_cache, IUL_ITEM(lip));
> +}
> +
> +
> +static uint64_t
> +xfs_iunlink_item_sort(
> +	struct xfs_log_item	*lip)
> +{
> +	return IUL_ITEM(lip)->ip->i_ino;
> +}

Since you mentioned in-memory log items for dquots -- how should
iunlinks and dquot log items be sorted?

(On the off chance the dquot comment was made off the cuff and you don't
have a patchset ready to go in your dev tree -- I probably wouldn't have
said anything if this looked like the usual comparator function.)

> +
> +/*
> + * Look up the inode cluster buffer and log the on-disk unlinked inode change
> + * we need to make.
> + */
> +static int
> +xfs_iunlink_log_dinode(
> +	struct xfs_trans	*tp,
> +	struct xfs_iunlink_item	*iup)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_inode	*ip = iup->ip;
> +	struct xfs_dinode	*dip;
> +	struct xfs_buf		*ibp;
> +	int			offset;
> +	int			error;
> +
> +	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &ibp);
> +	if (error)
> +		return error;
> +	/*
> +	 * Don't log the unlinked field on stale buffers as this may be the
> +	 * transaction that frees the inode cluster and relogging the buffer
> +	 * here will incorrectly remove the stale state.
> +	 */
> +	if (ibp->b_flags & XBF_STALE)
> +		goto out;
> +
> +	dip = xfs_buf_offset(ibp, ip->i_imap.im_boffset);
> +
> +	/* Make sure the old pointer isn't garbage. */
> +	if (be32_to_cpu(dip->di_next_unlinked) != iup->old_agino) {
> +		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
> +				sizeof(*dip), __this_address);
> +		error = -EFSCORRUPTED;
> +		goto out;
> +	}
> +
> +	trace_xfs_iunlink_update_dinode(mp, iup->pag->pag_agno,
> +			XFS_INO_TO_AGINO(mp, ip->i_ino),
> +			be32_to_cpu(dip->di_next_unlinked), iup->next_agino);
> +
> +	dip->di_next_unlinked = cpu_to_be32(iup->next_agino);
> +	offset = ip->i_imap.im_boffset +
> +			offsetof(struct xfs_dinode, di_next_unlinked);
> +
> +	xfs_dinode_calc_crc(mp, dip);
> +	xfs_trans_inode_buf(tp, ibp);
> +	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
> +	return 0;
> +out:
> +	xfs_trans_brelse(tp, ibp);
> +	return error;
> +}
> +
> +/*
> + * On precommit, we grab the inode cluster buffer for the inode number we were
> + * passed, then update the next unlinked field for that inode in the buffer and
> + * log the buffer. This ensures that the inode cluster buffer was logged in the
> + * correct order w.r.t. other inode cluster buffers. We can then remove the
> + * iunlink item from the transaction and release it as it is has now served it's
> + * purpose.
> + */
> +static int
> +xfs_iunlink_item_precommit(
> +	struct xfs_trans	*tp,
> +	struct xfs_log_item	*lip)
> +{
> +	struct xfs_iunlink_item	*iup = IUL_ITEM(lip);
> +	int			error;
> +
> +	error = xfs_iunlink_log_dinode(tp, iup);

Hmm, so does this imply that log items can create new log items now?

(I /think/ this all looks ok.)

--D

> +	list_del(&lip->li_trans);
> +	xfs_iunlink_item_release(lip);
> +	return error;
> +}
> +
> +static const struct xfs_item_ops xfs_iunlink_item_ops = {
> +	.iop_release	= xfs_iunlink_item_release,
> +	.iop_sort	= xfs_iunlink_item_sort,
> +	.iop_precommit	= xfs_iunlink_item_precommit,
> +};
> +
> +
> +/*
> + * Initialize the inode log item for a newly allocated (in-core) inode.
> + *
> + * Inode extents can only reside within an AG. Hence specify the starting
> + * block for the inode chunk by offset within an AG as well as the
> + * length of the allocated extent.
> + *
> + * This joins the item to the transaction and marks it dirty so
> + * that we don't need a separate call to do this, nor does the
> + * caller need to know anything about the iunlink item.
> + */
> +int
> +xfs_iunlink_log_inode(
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip,
> +	struct xfs_perag	*pag,
> +	xfs_agino_t		next_agino)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_iunlink_item	*iup;
> +
> +	ASSERT(xfs_verify_agino_or_null(mp, pag->pag_agno, next_agino));
> +	ASSERT(xfs_verify_agino_or_null(mp, pag->pag_agno, ip->i_next_unlinked));
> +
> +	/*
> +	 * Since we're updating a linked list, we should never find that the
> +	 * current pointer is the same as the new value, unless we're
> +	 * terminating the list.
> +	 */
> +	if (ip->i_next_unlinked == next_agino) {
> +		if (next_agino != NULLAGINO)
> +			return -EFSCORRUPTED;
> +		return 0;
> +	}
> +
> +	iup = kmem_cache_zalloc(xfs_iunlink_cache, GFP_KERNEL | __GFP_NOFAIL);
> +	xfs_log_item_init(mp, &iup->item, XFS_LI_IUNLINK,
> +			  &xfs_iunlink_item_ops);
> +
> +	iup->ip = ip;
> +	iup->next_agino = next_agino;
> +	iup->old_agino = ip->i_next_unlinked;
> +
> +	atomic_inc(&pag->pag_ref);
> +	iup->pag = pag;
> +
> +	xfs_trans_add_item(tp, &iup->item);
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	set_bit(XFS_LI_DIRTY, &iup->item.li_flags);
> +	return 0;
> +}
> +
> diff --git a/fs/xfs/xfs_iunlink_item.h b/fs/xfs/xfs_iunlink_item.h
> new file mode 100644
> index 000000000000..280dbf533989
> --- /dev/null
> +++ b/fs/xfs/xfs_iunlink_item.h
> @@ -0,0 +1,27 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2020, Red Hat, Inc.
> + * All Rights Reserved.
> + */
> +#ifndef XFS_IUNLINK_ITEM_H
> +#define XFS_IUNLINK_ITEM_H	1
> +
> +struct xfs_trans;
> +struct xfs_inode;
> +struct xfs_perag;
> +
> +/* in memory log item structure */
> +struct xfs_iunlink_item {
> +	struct xfs_log_item	item;
> +	struct xfs_inode	*ip;
> +	struct xfs_perag	*pag;
> +	xfs_agino_t		next_agino;
> +	xfs_agino_t		old_agino;
> +};
> +
> +extern struct kmem_cache *xfs_iunlink_cache;
> +
> +int xfs_iunlink_log_inode(struct xfs_trans *tp, struct xfs_inode *ip,
> +			struct xfs_perag *pag, xfs_agino_t next_agino);
> +
> +#endif	/* XFS_IUNLINK_ITEM_H */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index ed18160e6181..a7e6b2d7686b 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -40,6 +40,7 @@
>  #include "xfs_defer.h"
>  #include "xfs_attr_item.h"
>  #include "xfs_xattr.h"
> +#include "xfs_iunlink_item.h"
>  
>  #include <linux/magic.h>
>  #include <linux/fs_context.h>
> @@ -2093,8 +2094,16 @@ xfs_init_caches(void)
>  	if (!xfs_attri_cache)
>  		goto out_destroy_attrd_cache;
>  
> +	xfs_iunlink_cache = kmem_cache_create("xfs_iul_item",
> +					     sizeof(struct xfs_iunlink_item),
> +					     0, 0, NULL);
> +	if (!xfs_iunlink_cache)
> +		goto out_destroy_attri_cache;
> +
>  	return 0;
>  
> + out_destroy_attri_cache:
> +	kmem_cache_destroy(xfs_attri_cache);
>   out_destroy_attrd_cache:
>  	kmem_cache_destroy(xfs_attrd_cache);
>   out_destroy_bui_cache:
> @@ -2145,6 +2154,7 @@ xfs_destroy_caches(void)
>  	 * destroy caches.
>  	 */
>  	rcu_barrier();
> +	kmem_cache_destroy(xfs_iunlink_cache);
>  	kmem_cache_destroy(xfs_attri_cache);
>  	kmem_cache_destroy(xfs_attrd_cache);
>  	kmem_cache_destroy(xfs_bui_cache);
> -- 
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 8/9] xfs: add log item precommit operation
  2022-06-29 21:16   ` Darrick J. Wong
@ 2022-06-29 21:34     ` Dave Chinner
  2022-06-29 21:42       ` Darrick J. Wong
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2022-06-29 21:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Jun 29, 2022 at 02:16:03PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 27, 2022 at 10:43:35AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > For inodes that are dirty, we have an attached cluster buffer that
> > we want to use to track the dirty inode through the AIL.
> > Unfortunately, locking the cluster buffer and adding it to the
> > transaction when the inode is first logged in a transaction leads to
> > buffer lock ordering inversions.
> > 
> > The specific problem is ordering against the AGI buffer. When
> > modifying unlinked lists, the buffer lock order is AGI -> inode
> > cluster buffer as the AGI buffer lock serialises all access to the
> > unlinked lists. Unfortunately, functionality like xfs_droplink()
> > logs the inode before calling xfs_iunlink(), as do various directory
> > manipulation functions. The inode can be logged way down in the
> > stack as far as the bmapi routines and hence, without a major
> > rewrite of lots of APIs there's no way we can avoid the inode being
> > logged by something until after the AGI has been logged.
> > 
> > As we are going to be using ordered buffers for inode AIL tracking,
> > there isn't a need to actually lock that buffer against modification
> > as all the modifications are captured by logging the inode item
> > itself. Hence we don't actually need to join the cluster buffer into
> > the transaction until just before it is committed. This means we do
> > not perturb any of the existing buffer lock orders in transactions,
> > and the inode cluster buffer is always locked last in a transaction
> > that doesn't otherwise touch inode cluster buffers.
> > 
> > We do this by introducing a precommit log item method. A log item
> > method is used because it is likely dquots will be moved to this
> > same ordered buffer tracking scheme and hence will need a similar
> 
> Oh?

Stale comment from the original series a couple of year ago, I
think.

> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 82cf0189c0db..0acb31093d9f 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -844,6 +844,90 @@ xfs_trans_committed_bulk(
> >  	spin_unlock(&ailp->ail_lock);
> >  }
> >  
> > +/*
> > + * Sort transaction items prior to running precommit operations. This will
> > + * attempt to order the items such that they will always be locked in the same
> > + * order. Items that have no sort function are moved to the end of the list
> > + * and so are locked last (XXX: need to check the logic matches the comment).
> > + *
> > + * This may need refinement as different types of objects add sort functions.
> > + *
> > + * Function is more complex than it needs to be because we are comparing 64 bit
> > + * values and the function only returns 32 bit values.
> > + */
> > +static int
> > +xfs_trans_precommit_sort(
> > +	void			*unused_arg,
> > +	const struct list_head	*a,
> > +	const struct list_head	*b)
> > +{
> > +	struct xfs_log_item	*lia = container_of(a,
> > +					struct xfs_log_item, li_trans);
> > +	struct xfs_log_item	*lib = container_of(b,
> > +					struct xfs_log_item, li_trans);
> > +	int64_t			diff;
> > +
> > +	/*
> > +	 * If both items are non-sortable, leave them alone. If only one is
> > +	 * sortable, move the non-sortable item towards the end of the list.
> > +	 */
> > +	if (!lia->li_ops->iop_sort && !lib->li_ops->iop_sort)
> > +		return 0;
> > +	if (!lia->li_ops->iop_sort)
> > +		return 1;
> > +	if (!lib->li_ops->iop_sort)
> > +		return -1;
> > +
> > +	diff = lia->li_ops->iop_sort(lia) - lib->li_ops->iop_sort(lib);
> 
> I'm kinda surprised the iop_sort method doesn't take both log item
> pointers, like most sorting-comparator functions?  But I'll see, maybe
> you're doing something clever wrt ordering of log items of differing
> types, and hence the ->iop_sort implementations are required to return
> some absolute priority or something.

Nope, we have to order item locking based on an unchanging
characteristic of the object. log items can come and go, we want to
lock items in consistent ascending order, so it has to be based on
some kind of physical characteristic, like inode number, block
address, etc.

e.g. If all objects are ordered by the physical location, we naturally
get a lock order that can be applied sanely across differing object
types e.g. AG headers will naturally sort and lock before buffers
in the AG itself. e.g. inode cluster buffers for unlinked list
manipulations will always get locked after the AGI....

Cheers,

Dave.
> --D
> 
> > +	if (diff < 0)
> > +		return -1;
> > +	if (diff > 0)
> > +		return 1;
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Run transaction precommit functions.
> > + *
> > + * If there is an error in any of the callouts, then stop immediately and
> > + * trigger a shutdown to abort the transaction. There is no recovery possible
> > + * from errors at this point as the transaction is dirty....
> > + */
> > +static int
> > +xfs_trans_run_precommits(
> > +	struct xfs_trans	*tp)
> > +{
> > +	struct xfs_mount	*mp = tp->t_mountp;
> > +	struct xfs_log_item	*lip, *n;
> > +	int			error = 0;
> > +
> > +	/*
> > +	 * Sort the item list to avoid ABBA deadlocks with other transactions
> > +	 * running precommit operations that lock multiple shared items such as
> > +	 * inode cluster buffers.
> > +	 */
> > +	list_sort(NULL, &tp->t_items, xfs_trans_precommit_sort);
> > +
> > +	/*
> > +	 * Precommit operations can remove the log item from the transaction
> > +	 * if the log item exists purely to delay modifications until they
> > +	 * can be ordered against other operations. Hence we have to use
> > +	 * list_for_each_entry_safe() here.
> > +	 */
> > +	list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
> > +		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
> > +			continue;
> > +		if (lip->li_ops->iop_precommit) {
> > +			error = lip->li_ops->iop_precommit(tp, lip);
> > +			if (error)
> > +				break;
> > +		}
> > +	}
> > +	if (error)
> > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > +	return error;
> > +}
> > +
> >  /*
> >   * Commit the given transaction to the log.
> >   *
> > @@ -869,6 +953,13 @@ __xfs_trans_commit(
> >  
> >  	trace_xfs_trans_commit(tp, _RET_IP_);
> >  
> > +	error = xfs_trans_run_precommits(tp);
> > +	if (error) {
> > +		if (tp->t_flags & XFS_TRANS_PERM_LOG_RES)
> > +			xfs_defer_cancel(tp);
> > +		goto out_unreserve;
> > +	}
> > +
> >  	/*
> >  	 * Finish deferred items on final commit. Only permanent transactions
> >  	 * should ever have deferred ops.
> > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > index 9561f193e7e1..64062e3b788b 100644
> > --- a/fs/xfs/xfs_trans.h
> > +++ b/fs/xfs/xfs_trans.h
> > @@ -71,10 +71,12 @@ struct xfs_item_ops {
> >  	void (*iop_format)(struct xfs_log_item *, struct xfs_log_vec *);
> >  	void (*iop_pin)(struct xfs_log_item *);
> >  	void (*iop_unpin)(struct xfs_log_item *, int remove);
> > -	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
> > +	uint64_t (*iop_sort)(struct xfs_log_item *lip);
> > +	int (*iop_precommit)(struct xfs_trans *tp, struct xfs_log_item *lip);
> >  	void (*iop_committing)(struct xfs_log_item *lip, xfs_csn_t seq);
> > -	void (*iop_release)(struct xfs_log_item *);
> >  	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
> > +	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
> > +	void (*iop_release)(struct xfs_log_item *);
> 
> Why did these get moved?

<shrug> reasons lost in the mist of times. Probably because I wanted
the sort->precommit->committing->committed to be listed in the order
they are actually called by a transaciton commit?

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 9/9] xfs: add in-memory iunlink log item
  2022-06-29  7:25   ` Christoph Hellwig
@ 2022-06-29 21:37     ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2022-06-29 21:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Wed, Jun 29, 2022 at 12:25:53AM -0700, Christoph Hellwig wrote:
> On Mon, Jun 27, 2022 at 10:43:36AM +1000, Dave Chinner wrote:
> > TO do this, we introduce a new in-memory log item to track the
> 
> s/To/To/
> 
> > This allows us to pass the xfs_inode to the transaction commit code
> > along with the modification to be made, and then order the logged
> > modifications via the ->iop_sort and ->iop_precommit operations
> > for the new log item type. As this is an in-memory log item, it
> > doesn't have formatting, CIL or AIL operational hooks - it exists
> > purely to run the inode unlink modifications and is then removed
> > from the transaction item list and freed once the precommit
> > operation has run.
> 
> Do we need to document the fact that we now have purely in-memory log
> items better somewhere?  I see how this works, but it its a little
> non-obvious.

We already have in-memory log items, just not explicitly named. That
is, the transaction delta structures for the superblock and dquot
modifications.

I eventually want to move them to in-memory log items to get all the
special case code for them out of the transaction commit path, and
then add in-memory log items for AGF and AGI header modifications so
that we can move away from AG header buffer locking to serialise AG
specific operations...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 8/9] xfs: add log item precommit operation
  2022-06-29 21:34     ` Dave Chinner
@ 2022-06-29 21:42       ` Darrick J. Wong
  2022-06-29 21:48         ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2022-06-29 21:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 30, 2022 at 07:34:37AM +1000, Dave Chinner wrote:
> On Wed, Jun 29, 2022 at 02:16:03PM -0700, Darrick J. Wong wrote:
> > On Mon, Jun 27, 2022 at 10:43:35AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > For inodes that are dirty, we have an attached cluster buffer that
> > > we want to use to track the dirty inode through the AIL.
> > > Unfortunately, locking the cluster buffer and adding it to the
> > > transaction when the inode is first logged in a transaction leads to
> > > buffer lock ordering inversions.
> > > 
> > > The specific problem is ordering against the AGI buffer. When
> > > modifying unlinked lists, the buffer lock order is AGI -> inode
> > > cluster buffer as the AGI buffer lock serialises all access to the
> > > unlinked lists. Unfortunately, functionality like xfs_droplink()
> > > logs the inode before calling xfs_iunlink(), as do various directory
> > > manipulation functions. The inode can be logged way down in the
> > > stack as far as the bmapi routines and hence, without a major
> > > rewrite of lots of APIs there's no way we can avoid the inode being
> > > logged by something until after the AGI has been logged.
> > > 
> > > As we are going to be using ordered buffers for inode AIL tracking,
> > > there isn't a need to actually lock that buffer against modification
> > > as all the modifications are captured by logging the inode item
> > > itself. Hence we don't actually need to join the cluster buffer into
> > > the transaction until just before it is committed. This means we do
> > > not perturb any of the existing buffer lock orders in transactions,
> > > and the inode cluster buffer is always locked last in a transaction
> > > that doesn't otherwise touch inode cluster buffers.
> > > 
> > > We do this by introducing a precommit log item method. A log item
> > > method is used because it is likely dquots will be moved to this
> > > same ordered buffer tracking scheme and hence will need a similar
> > 
> > Oh?
> 
> Stale comment from the original series a couple of year ago, I
> think.

aha!

> > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > index 82cf0189c0db..0acb31093d9f 100644
> > > --- a/fs/xfs/xfs_trans.c
> > > +++ b/fs/xfs/xfs_trans.c
> > > @@ -844,6 +844,90 @@ xfs_trans_committed_bulk(
> > >  	spin_unlock(&ailp->ail_lock);
> > >  }
> > >  
> > > +/*
> > > + * Sort transaction items prior to running precommit operations. This will
> > > + * attempt to order the items such that they will always be locked in the same
> > > + * order. Items that have no sort function are moved to the end of the list
> > > + * and so are locked last (XXX: need to check the logic matches the comment).
> > > + *
> > > + * This may need refinement as different types of objects add sort functions.
> > > + *
> > > + * Function is more complex than it needs to be because we are comparing 64 bit
> > > + * values and the function only returns 32 bit values.
> > > + */
> > > +static int
> > > +xfs_trans_precommit_sort(
> > > +	void			*unused_arg,
> > > +	const struct list_head	*a,
> > > +	const struct list_head	*b)
> > > +{
> > > +	struct xfs_log_item	*lia = container_of(a,
> > > +					struct xfs_log_item, li_trans);
> > > +	struct xfs_log_item	*lib = container_of(b,
> > > +					struct xfs_log_item, li_trans);
> > > +	int64_t			diff;
> > > +
> > > +	/*
> > > +	 * If both items are non-sortable, leave them alone. If only one is
> > > +	 * sortable, move the non-sortable item towards the end of the list.
> > > +	 */
> > > +	if (!lia->li_ops->iop_sort && !lib->li_ops->iop_sort)
> > > +		return 0;
> > > +	if (!lia->li_ops->iop_sort)
> > > +		return 1;
> > > +	if (!lib->li_ops->iop_sort)
> > > +		return -1;
> > > +
> > > +	diff = lia->li_ops->iop_sort(lia) - lib->li_ops->iop_sort(lib);
> > 
> > I'm kinda surprised the iop_sort method doesn't take both log item
> > pointers, like most sorting-comparator functions?  But I'll see, maybe
> > you're doing something clever wrt ordering of log items of differing
> > types, and hence the ->iop_sort implementations are required to return
> > some absolute priority or something.
> 
> Nope, we have to order item locking based on an unchanging
> characteristic of the object. log items can come and go, we want to
> lock items in consistent ascending order, so it has to be based on
> some kind of physical characteristic, like inode number, block
> address, etc.
> 
> e.g. If all objects are ordered by the physical location, we naturally
> get a lock order that can be applied sanely across differing object
> types e.g. AG headers will naturally sort and lock before buffers
> in the AG itself. e.g. inode cluster buffers for unlinked list
> manipulations will always get locked after the AGI....

<nod> So if (say) we were going to add dquots to this scheme, we'd
probably want to shift all the iop_sort functions to return (say) the
xfs_daddr_t of the associated item?

(Practically speaking, I don't know that I'd want to tie things down to
the disk address quite this soon, and since it's all incore code anyway
I don't think the precise type of the return values matter.)

Anyway, I'm satisfied--
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> Cheers,
> 
> Dave.
> > --D
> > 
> > > +	if (diff < 0)
> > > +		return -1;
> > > +	if (diff > 0)
> > > +		return 1;
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * Run transaction precommit functions.
> > > + *
> > > + * If there is an error in any of the callouts, then stop immediately and
> > > + * trigger a shutdown to abort the transaction. There is no recovery possible
> > > + * from errors at this point as the transaction is dirty....
> > > + */
> > > +static int
> > > +xfs_trans_run_precommits(
> > > +	struct xfs_trans	*tp)
> > > +{
> > > +	struct xfs_mount	*mp = tp->t_mountp;
> > > +	struct xfs_log_item	*lip, *n;
> > > +	int			error = 0;
> > > +
> > > +	/*
> > > +	 * Sort the item list to avoid ABBA deadlocks with other transactions
> > > +	 * running precommit operations that lock multiple shared items such as
> > > +	 * inode cluster buffers.
> > > +	 */
> > > +	list_sort(NULL, &tp->t_items, xfs_trans_precommit_sort);
> > > +
> > > +	/*
> > > +	 * Precommit operations can remove the log item from the transaction
> > > +	 * if the log item exists purely to delay modifications until they
> > > +	 * can be ordered against other operations. Hence we have to use
> > > +	 * list_for_each_entry_safe() here.
> > > +	 */
> > > +	list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
> > > +		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
> > > +			continue;
> > > +		if (lip->li_ops->iop_precommit) {
> > > +			error = lip->li_ops->iop_precommit(tp, lip);
> > > +			if (error)
> > > +				break;
> > > +		}
> > > +	}
> > > +	if (error)
> > > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > +	return error;
> > > +}
> > > +
> > >  /*
> > >   * Commit the given transaction to the log.
> > >   *
> > > @@ -869,6 +953,13 @@ __xfs_trans_commit(
> > >  
> > >  	trace_xfs_trans_commit(tp, _RET_IP_);
> > >  
> > > +	error = xfs_trans_run_precommits(tp);
> > > +	if (error) {
> > > +		if (tp->t_flags & XFS_TRANS_PERM_LOG_RES)
> > > +			xfs_defer_cancel(tp);
> > > +		goto out_unreserve;
> > > +	}
> > > +
> > >  	/*
> > >  	 * Finish deferred items on final commit. Only permanent transactions
> > >  	 * should ever have deferred ops.
> > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> > > index 9561f193e7e1..64062e3b788b 100644
> > > --- a/fs/xfs/xfs_trans.h
> > > +++ b/fs/xfs/xfs_trans.h
> > > @@ -71,10 +71,12 @@ struct xfs_item_ops {
> > >  	void (*iop_format)(struct xfs_log_item *, struct xfs_log_vec *);
> > >  	void (*iop_pin)(struct xfs_log_item *);
> > >  	void (*iop_unpin)(struct xfs_log_item *, int remove);
> > > -	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
> > > +	uint64_t (*iop_sort)(struct xfs_log_item *lip);
> > > +	int (*iop_precommit)(struct xfs_trans *tp, struct xfs_log_item *lip);
> > >  	void (*iop_committing)(struct xfs_log_item *lip, xfs_csn_t seq);
> > > -	void (*iop_release)(struct xfs_log_item *);
> > >  	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
> > > +	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
> > > +	void (*iop_release)(struct xfs_log_item *);
> > 
> > Why did these get moved?
> 
> <shrug> reasons lost in the mist of times. Probably because I wanted
> the sort->precommit->committing->committed to be listed in the order
> they are actually called by a transaciton commit?
> 
> -Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 9/9] xfs: add in-memory iunlink log item
  2022-06-29 21:21   ` Darrick J. Wong
@ 2022-06-29 21:44     ` Dave Chinner
  2022-06-29 21:49       ` Darrick J. Wong
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2022-06-29 21:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Jun 29, 2022 at 02:21:30PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 27, 2022 at 10:43:36AM +1000, Dave Chinner wrote:
> > diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
> > new file mode 100644
> > index 000000000000..fe38fc61f79e
> > --- /dev/null
> > +++ b/fs/xfs/xfs_iunlink_item.c
> > @@ -0,0 +1,180 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (c) 2020, Red Hat, Inc.
> 
> 2022?

2020 is correct - that's when I originally wrote this and first published it.

> > + * All Rights Reserved.
> > + */
> > +#include "xfs.h"
> > +#include "xfs_fs.h"
> > +#include "xfs_shared.h"
> > +#include "xfs_format.h"
> > +#include "xfs_log_format.h"
> > +#include "xfs_trans_resv.h"
> > +#include "xfs_mount.h"
> > +#include "xfs_inode.h"
> > +#include "xfs_trans.h"
> > +#include "xfs_trans_priv.h"
> > +#include "xfs_ag.h"
> > +#include "xfs_iunlink_item.h"
> > +#include "xfs_trace.h"
> > +#include "xfs_error.h"
> > +
> > +struct kmem_cache	*xfs_iunlink_cache;
> > +
> > +static inline struct xfs_iunlink_item *IUL_ITEM(struct xfs_log_item *lip)
> > +{
> > +	return container_of(lip, struct xfs_iunlink_item, item);
> > +}
> > +
> > +static void
> > +xfs_iunlink_item_release(
> > +	struct xfs_log_item	*lip)
> > +{
> > +	struct xfs_iunlink_item	*iup = IUL_ITEM(lip);
> > +
> > +	xfs_perag_put(iup->pag);
> > +	kmem_cache_free(xfs_iunlink_cache, IUL_ITEM(lip));
> > +}
> > +
> > +
> > +static uint64_t
> > +xfs_iunlink_item_sort(
> > +	struct xfs_log_item	*lip)
> > +{
> > +	return IUL_ITEM(lip)->ip->i_ino;
> > +}
> 
> Since you mentioned in-memory log items for dquots -- how should
> iunlinks and dquot log items be sorted?

ip->i_ino is the physical location of the inode - I'd use the
physical location of the dquot buffer if that was being logged.

> (On the off chance the dquot comment was made off the cuff and you don't
> have a patchset ready to go in your dev tree -- I probably wouldn't have
> said anything if this looked like the usual comparator function.)

No, there's nothing coming down the line for dquots right now.

> > +/*
> > + * On precommit, we grab the inode cluster buffer for the inode number we were
> > + * passed, then update the next unlinked field for that inode in the buffer and
> > + * log the buffer. This ensures that the inode cluster buffer was logged in the
> > + * correct order w.r.t. other inode cluster buffers. We can then remove the
> > + * iunlink item from the transaction and release it as it is has now served it's
> > + * purpose.
> > + */
> > +static int
> > +xfs_iunlink_item_precommit(
> > +	struct xfs_trans	*tp,
> > +	struct xfs_log_item	*lip)
> > +{
> > +	struct xfs_iunlink_item	*iup = IUL_ITEM(lip);
> > +	int			error;
> > +
> > +	error = xfs_iunlink_log_dinode(tp, iup);
> 
> Hmm, so does this imply that log items can create new log items now?

Yup, now it's been sorted, we can lock the buffer, modify the
unlinked list and log the buffer, adding the new buffer log item to
the transaction.

That's the whole point of the in-memory log item - it records the
change to be made, then delays the physical change until it is safe
to lock the object we need to change.

This minimises the length of time we have to hold the object locked
during a transaction by dissociating the in-memory change from the
on-disk format changes. I plan to use this technique a lot more in
future...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 8/9] xfs: add log item precommit operation
  2022-06-29 21:42       ` Darrick J. Wong
@ 2022-06-29 21:48         ` Dave Chinner
  2022-07-07  2:34           ` Darrick J. Wong
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2022-06-29 21:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Jun 29, 2022 at 02:42:43PM -0700, Darrick J. Wong wrote:
> On Thu, Jun 30, 2022 at 07:34:37AM +1000, Dave Chinner wrote:
> > On Wed, Jun 29, 2022 at 02:16:03PM -0700, Darrick J. Wong wrote:
> > > On Mon, Jun 27, 2022 at 10:43:35AM +1000, Dave Chinner wrote:
> > > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > > index 82cf0189c0db..0acb31093d9f 100644
> > > > --- a/fs/xfs/xfs_trans.c
> > > > +++ b/fs/xfs/xfs_trans.c
> > > > @@ -844,6 +844,90 @@ xfs_trans_committed_bulk(
> > > >  	spin_unlock(&ailp->ail_lock);
> > > >  }
> > > >  
> > > > +/*
> > > > + * Sort transaction items prior to running precommit operations. This will
> > > > + * attempt to order the items such that they will always be locked in the same
> > > > + * order. Items that have no sort function are moved to the end of the list
> > > > + * and so are locked last (XXX: need to check the logic matches the comment).
> > > > + *
> > > > + * This may need refinement as different types of objects add sort functions.
> > > > + *
> > > > + * Function is more complex than it needs to be because we are comparing 64 bit
> > > > + * values and the function only returns 32 bit values.
> > > > + */
> > > > +static int
> > > > +xfs_trans_precommit_sort(
> > > > +	void			*unused_arg,
> > > > +	const struct list_head	*a,
> > > > +	const struct list_head	*b)
> > > > +{
> > > > +	struct xfs_log_item	*lia = container_of(a,
> > > > +					struct xfs_log_item, li_trans);
> > > > +	struct xfs_log_item	*lib = container_of(b,
> > > > +					struct xfs_log_item, li_trans);
> > > > +	int64_t			diff;
> > > > +
> > > > +	/*
> > > > +	 * If both items are non-sortable, leave them alone. If only one is
> > > > +	 * sortable, move the non-sortable item towards the end of the list.
> > > > +	 */
> > > > +	if (!lia->li_ops->iop_sort && !lib->li_ops->iop_sort)
> > > > +		return 0;
> > > > +	if (!lia->li_ops->iop_sort)
> > > > +		return 1;
> > > > +	if (!lib->li_ops->iop_sort)
> > > > +		return -1;
> > > > +
> > > > +	diff = lia->li_ops->iop_sort(lia) - lib->li_ops->iop_sort(lib);
> > > 
> > > I'm kinda surprised the iop_sort method doesn't take both log item
> > > pointers, like most sorting-comparator functions?  But I'll see, maybe
> > > you're doing something clever wrt ordering of log items of differing
> > > types, and hence the ->iop_sort implementations are required to return
> > > some absolute priority or something.
> > 
> > Nope, we have to order item locking based on an unchanging
> > characteristic of the object. log items can come and go, we want to
> > lock items in consistent ascending order, so it has to be based on
> > some kind of physical characteristic, like inode number, block
> > address, etc.
> > 
> > e.g. If all objects are ordered by the physical location, we naturally
> > get a lock order that can be applied sanely across differing object
> > types e.g. AG headers will naturally sort and lock before buffers
> > in the AG itself. e.g. inode cluster buffers for unlinked list
> > manipulations will always get locked after the AGI....
> 
> <nod> So if (say) we were going to add dquots to this scheme, we'd
> probably want to shift all the iop_sort functions to return (say) the
> xfs_daddr_t of the associated item?

No, I don't want to tie it specifically to physical address.
ip->i_ino is not a daddr, but it would order just fine against one.

> (Practically speaking, I don't know that I'd want to tie things down to
> the disk address quite this soon, and since it's all incore code anyway
> I don't think the precise type of the return values matter.)

Indeed, I don't even want to tie this specifically to a 64 bit
value; it's intended that objects will return a sort key, and as
we add more object types we'll have to think harder about the
specific key values we use.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 9/9] xfs: add in-memory iunlink log item
  2022-06-29 21:44     ` Dave Chinner
@ 2022-06-29 21:49       ` Darrick J. Wong
  0 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2022-06-29 21:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 30, 2022 at 07:44:58AM +1000, Dave Chinner wrote:
> On Wed, Jun 29, 2022 at 02:21:30PM -0700, Darrick J. Wong wrote:
> > On Mon, Jun 27, 2022 at 10:43:36AM +1000, Dave Chinner wrote:
> > > diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
> > > new file mode 100644
> > > index 000000000000..fe38fc61f79e
> > > --- /dev/null
> > > +++ b/fs/xfs/xfs_iunlink_item.c
> > > @@ -0,0 +1,180 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * Copyright (c) 2020, Red Hat, Inc.
> > 
> > 2022?
> 
> 2020 is correct - that's when I originally wrote this and first published it.
> 
> > > + * All Rights Reserved.
> > > + */
> > > +#include "xfs.h"
> > > +#include "xfs_fs.h"
> > > +#include "xfs_shared.h"
> > > +#include "xfs_format.h"
> > > +#include "xfs_log_format.h"
> > > +#include "xfs_trans_resv.h"
> > > +#include "xfs_mount.h"
> > > +#include "xfs_inode.h"
> > > +#include "xfs_trans.h"
> > > +#include "xfs_trans_priv.h"
> > > +#include "xfs_ag.h"
> > > +#include "xfs_iunlink_item.h"
> > > +#include "xfs_trace.h"
> > > +#include "xfs_error.h"
> > > +
> > > +struct kmem_cache	*xfs_iunlink_cache;
> > > +
> > > +static inline struct xfs_iunlink_item *IUL_ITEM(struct xfs_log_item *lip)
> > > +{
> > > +	return container_of(lip, struct xfs_iunlink_item, item);
> > > +}
> > > +
> > > +static void
> > > +xfs_iunlink_item_release(
> > > +	struct xfs_log_item	*lip)
> > > +{
> > > +	struct xfs_iunlink_item	*iup = IUL_ITEM(lip);
> > > +
> > > +	xfs_perag_put(iup->pag);
> > > +	kmem_cache_free(xfs_iunlink_cache, IUL_ITEM(lip));
> > > +}
> > > +
> > > +
> > > +static uint64_t
> > > +xfs_iunlink_item_sort(
> > > +	struct xfs_log_item	*lip)
> > > +{
> > > +	return IUL_ITEM(lip)->ip->i_ino;
> > > +}
> > 
> > Since you mentioned in-memory log items for dquots -- how should
> > iunlinks and dquot log items be sorted?
> 
> ip->i_ino is the physical location of the inode - I'd use the
> physical location of the dquot buffer if that was being logged.
> 
> > (On the off chance the dquot comment was made off the cuff and you don't
> > have a patchset ready to go in your dev tree -- I probably wouldn't have
> > said anything if this looked like the usual comparator function.)
> 
> No, there's nothing coming down the line for dquots right now.

Ok.

> > > +/*
> > > + * On precommit, we grab the inode cluster buffer for the inode number we were
> > > + * passed, then update the next unlinked field for that inode in the buffer and
> > > + * log the buffer. This ensures that the inode cluster buffer was logged in the
> > > + * correct order w.r.t. other inode cluster buffers. We can then remove the
> > > + * iunlink item from the transaction and release it as it is has now served it's
> > > + * purpose.
> > > + */
> > > +static int
> > > +xfs_iunlink_item_precommit(
> > > +	struct xfs_trans	*tp,
> > > +	struct xfs_log_item	*lip)
> > > +{
> > > +	struct xfs_iunlink_item	*iup = IUL_ITEM(lip);
> > > +	int			error;
> > > +
> > > +	error = xfs_iunlink_log_dinode(tp, iup);
> > 
> > Hmm, so does this imply that log items can create new log items now?
> 
> Yup, now it's been sorted, we can lock the buffer, modify the
> unlinked list and log the buffer, adding the new buffer log item to
> the transaction.
> 
> That's the whole point of the in-memory log item - it records the
> change to be made, then delays the physical change until it is safe
> to lock the object we need to change.

Wheeee :)

> This minimises the length of time we have to hold the object locked
> during a transaction by dissociating the in-memory change from the
> on-disk format changes. I plan to use this technique a lot more in
> future...

Cool.  All I can think of is matrixpeople transmogrifying into Agent
Smiths :P

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 8/9] xfs: add log item precommit operation
  2022-06-29 21:48         ` Dave Chinner
@ 2022-07-07  2:34           ` Darrick J. Wong
  0 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2022-07-07  2:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 30, 2022 at 07:48:21AM +1000, Dave Chinner wrote:
> On Wed, Jun 29, 2022 at 02:42:43PM -0700, Darrick J. Wong wrote:
> > On Thu, Jun 30, 2022 at 07:34:37AM +1000, Dave Chinner wrote:
> > > On Wed, Jun 29, 2022 at 02:16:03PM -0700, Darrick J. Wong wrote:
> > > > On Mon, Jun 27, 2022 at 10:43:35AM +1000, Dave Chinner wrote:
> > > > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > > > index 82cf0189c0db..0acb31093d9f 100644
> > > > > --- a/fs/xfs/xfs_trans.c
> > > > > +++ b/fs/xfs/xfs_trans.c
> > > > > @@ -844,6 +844,90 @@ xfs_trans_committed_bulk(
> > > > >  	spin_unlock(&ailp->ail_lock);
> > > > >  }
> > > > >  
> > > > > +/*
> > > > > + * Sort transaction items prior to running precommit operations. This will
> > > > > + * attempt to order the items such that they will always be locked in the same
> > > > > + * order. Items that have no sort function are moved to the end of the list
> > > > > + * and so are locked last (XXX: need to check the logic matches the comment).
> > > > > + *
> > > > > + * This may need refinement as different types of objects add sort functions.
> > > > > + *
> > > > > + * Function is more complex than it needs to be because we are comparing 64 bit
> > > > > + * values and the function only returns 32 bit values.
> > > > > + */
> > > > > +static int
> > > > > +xfs_trans_precommit_sort(
> > > > > +	void			*unused_arg,
> > > > > +	const struct list_head	*a,
> > > > > +	const struct list_head	*b)
> > > > > +{
> > > > > +	struct xfs_log_item	*lia = container_of(a,
> > > > > +					struct xfs_log_item, li_trans);
> > > > > +	struct xfs_log_item	*lib = container_of(b,
> > > > > +					struct xfs_log_item, li_trans);
> > > > > +	int64_t			diff;
> > > > > +
> > > > > +	/*
> > > > > +	 * If both items are non-sortable, leave them alone. If only one is
> > > > > +	 * sortable, move the non-sortable item towards the end of the list.
> > > > > +	 */
> > > > > +	if (!lia->li_ops->iop_sort && !lib->li_ops->iop_sort)
> > > > > +		return 0;
> > > > > +	if (!lia->li_ops->iop_sort)
> > > > > +		return 1;
> > > > > +	if (!lib->li_ops->iop_sort)
> > > > > +		return -1;
> > > > > +
> > > > > +	diff = lia->li_ops->iop_sort(lia) - lib->li_ops->iop_sort(lib);
> > > > 
> > > > I'm kinda surprised the iop_sort method doesn't take both log item
> > > > pointers, like most sorting-comparator functions?  But I'll see, maybe
> > > > you're doing something clever wrt ordering of log items of differing
> > > > types, and hence the ->iop_sort implementations are required to return
> > > > some absolute priority or something.
> > > 
> > > Nope, we have to order item locking based on an unchanging
> > > characteristic of the object. log items can come and go, we want to
> > > lock items in consistent ascending order, so it has to be based on
> > > some kind of physical characteristic, like inode number, block
> > > address, etc.
> > > 
> > > e.g. If all objects are ordered by the physical location, we naturally
> > > get a lock order that can be applied sanely across differing object
> > > types e.g. AG headers will naturally sort and lock before buffers
> > > in the AG itself. e.g. inode cluster buffers for unlinked list
> > > manipulations will always get locked after the AGI....
> > 
> > <nod> So if (say) we were going to add dquots to this scheme, we'd
> > probably want to shift all the iop_sort functions to return (say) the
> > xfs_daddr_t of the associated item?
> 
> No, I don't want to tie it specifically to physical address.
> ip->i_ino is not a daddr, but it would order just fine against one.
> 
> > (Practically speaking, I don't know that I'd want to tie things down to
> > the disk address quite this soon, and since it's all incore code anyway
> > I don't think the precise type of the return values matter.)
> 
> Indeed, I don't even want to tie this specifically to a 64 bit
> value; it's intended that objects will return a sort key, and as
> we add more object types we'll have to think harder about the
> specific key values we use.

Ok.  As there currently are no other ->iop_sort implementations and the
return values aren't encoded on disk, I'll go along with this:

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

That just leaves the hch's comments scattered throughout, and my own
comments about patch 3 and 4.  Any chance this series will get a new rev
in time for 5.20?

--D

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2022-07-07  2:34 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-27  0:43 [PATCH 0/9 v3] xfs: in-memory iunlink items Dave Chinner
2022-06-27  0:43 ` [PATCH 1/9] xfs: factor the xfs_iunlink functions Dave Chinner
2022-06-29  7:05   ` Christoph Hellwig
2022-06-29 20:48   ` Darrick J. Wong
2022-06-27  0:43 ` [PATCH 2/9] xfs: track the iunlink list pointer in the xfs_inode Dave Chinner
2022-06-29  7:08   ` Christoph Hellwig
2022-06-29 20:50   ` Darrick J. Wong
2022-06-27  0:43 ` [PATCH 3/9] xfs: refactor xlog_recover_process_iunlinks() Dave Chinner
2022-06-29  7:12   ` Christoph Hellwig
2022-06-29 20:56   ` Darrick J. Wong
2022-06-27  0:43 ` [PATCH 4/9] xfs: introduce xfs_iunlink_lookup Dave Chinner
2022-06-29  7:17   ` Christoph Hellwig
2022-06-29 21:01   ` Darrick J. Wong
2022-06-27  0:43 ` [PATCH 5/9] xfs: double link the unlinked inode list Dave Chinner
2022-06-29  7:20   ` Christoph Hellwig
2022-06-29 20:02     ` Darrick J. Wong
2022-06-29 21:06   ` Darrick J. Wong
2022-06-27  0:43 ` [PATCH 6/9] xfs: clean up xfs_iunlink_update_inode() Dave Chinner
2022-06-29  7:21   ` Christoph Hellwig
2022-06-29 21:07   ` Darrick J. Wong
2022-06-27  0:43 ` [PATCH 7/9] xfs: combine iunlink inode update functions Dave Chinner
2022-06-29  7:21   ` Christoph Hellwig
2022-06-29 21:07   ` Darrick J. Wong
2022-06-27  0:43 ` [PATCH 8/9] xfs: add log item precommit operation Dave Chinner
2022-06-29 21:16   ` Darrick J. Wong
2022-06-29 21:34     ` Dave Chinner
2022-06-29 21:42       ` Darrick J. Wong
2022-06-29 21:48         ` Dave Chinner
2022-07-07  2:34           ` Darrick J. Wong
2022-06-27  0:43 ` [PATCH 9/9] xfs: add in-memory iunlink log item Dave Chinner
2022-06-29  7:25   ` Christoph Hellwig
2022-06-29 21:37     ` Dave Chinner
2022-06-29 21:21   ` Darrick J. Wong
2022-06-29 21:44     ` Dave Chinner
2022-06-29 21:49       ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.