linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] [RFC] xfs: in memory inode unlink log items
@ 2020-06-23  9:50 Dave Chinner
  2020-06-23  9:50 ` [PATCH 1/4] xfs: xfs_iflock is no longer a completion Dave Chinner
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Dave Chinner @ 2020-06-23  9:50 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

Inode cluster buffer pinning by dirty inodes allows us to improve
dirty inode tracking efficiency in the log by logging the inode
cluster buffer as an ordered transaction. However, this brings with
it some new issues, namely the order in which we lock  inode cluster
buffers.

That is, transactions that dirty and commit multiple inodes in a
transaction will now need to locking multiple inode cluster buffers
in each transaction (e.g. create, rename, etc). This introduces new 
lock ordering constraints in these operations. It also introduces
lock ordering constraints between the AGI and inode cluster buffers
as a result of allocation/freeing being serialised by the AGI
buffer lock. And then there is unlinked inode list logging, which
currently has no fixed order of inode cluster buffer locking.

It's a bit messy.

Locking pure inode modifications in order is relatively easy. We
don't actually need to attach and log the buffer to the transaction
until the last moment. We have all the inodes locked, so nothing
other than unlinked inode list modification can race with the
transaction modifying inodes. Hence we can safely move the
attachment of the inodes to the cluster buffer from when we first
dirty them in xfs_trans_log_inode to just before we commit the
transaction.

At this point, all the inodes that have been dirtied in the
transaction have already been locked, modified, logged and attached
to the transaction. Hence if we add a hook into xfs_trans_commit()
to run a "precommit" operation on these log items, we can use this
operation to attach the inodes to the cluster buffer at commit time
instead of in xfs_trans_log_inode().

This, by itself, doesn't solve the lock ordering problem. What it
does do, however, is give us a place where we can -order- all the
dirty items in the transaction list. Hence before we call the
precommit operation on each log item, we sort them. This allows us
to sort all the inode items so that the pre-commit functions that
locks and logs the cluster buffers are run in a deterministic order.
This solves the lock order problem for pure inode modifications.

The unlinked inode list buffer locking is more complex. The unlinked
list is unordered - we add to the tail, remove from where-ever the
inode is in the list. Hence we might need to lock two inode buffers
here (previous inode in list and the one being removed). While we
can order the locking of these buffers correctly within the confines
of the unlinked list, there may be other inodes that need buffer
locking in the same transaction. e.g. O_TMPFILE being linked into a
directory also modifies the directory inode.

Hence we need a mechanism for defering unlinked inode list updates
to the pre-commit operation where it can be sorted into the correct
order. We can do this by first observing that we serialise unlinked
list modifications by holding the AGI buffer lock. IOWs, the AGI is
going to be locked until the transaction commits any time we modify
the unlinked list. Hence it doesn't matter when in the transaction
we actually load, lock and modify the inode cluster buffer.

IOWs, what we need is an unlinked inode log item to defer the inode
cluster buffer update to transaction commit time where it can be
ordered with all the other inode cluster operations. Essentially all
we need to do is record the inodes that need to have their unlinked
list pointer updated in a new log item that we attached to the
transaction.

This log item exists purely for the purpose of delaying the update
of the unlinked list pointer until the inode cluster buffer can be
locked in the correct order around the other inode cluster buffers.
It plays no part in the actual commit, and there's no change to
anything that is written to the log. i.e. the inode cluster buffers
still have to be fully logged here (not just ordered) as log
recovery depedends on this to replay mods to the unlinked inode
list.

To make this unlinked inode list processing simpler and easier to
implement as a log item, we need to change the way we track the
unlinked list in memory. Starting from the observation that an inode
on the unlinked list is pinned in memory by the VFS, we can use the
xfs_inode itself to track the unlinked list. To do this efficiently,
we want the unlinked list to be a double linked list. The current
implementation takes the approach of minimising the memory footprint
of this list in case we don't want to burn 16 bytes of memory per
inode for a largely unused list head. [*]

We can get this down to 8 bytes per inode because the unlinked list
is per-ag, and hence we only need to store the agino portion of the
inode number as list pointers. We can then use these for lockless
inode cache lookups to retreive the inode. The aginos in the inode
are modified only under the AGI lock, just like the cluster buffer
pointers, so we don't need any extra locking here.  The
i_next_unlinked field tracks the on-disk value of the unlinked list,
and the i_prev_unlinked is a purely in-memory pointer that enables
us to efficiently remove inodes from the middle of the list.

IOWs, we burn a bit more CPU to resolve the unlinked list pointers
to save 8 bytes of memory per inode. If we decide that 8 bytes of
memory isn't a big code, we can convert this to a list_head and just
link the inodes directly to a unlinked list head in the perag.[**]

This gets rid of the entire unlinked list reference hash table that
is used to track this back pointer relationship, greatly simplifying
the unlinked list modification code.

Comments, flames, thoughts all welcome.

-Dave.

[*] An in-memory double linked list removes the need for keeping
lists short to minimise previous inode lookup overhead when removing
from the list. The current backref hash has this function, but it's
not obvious that it can do this and it's a kinda complex way of
implementing a double linked list.

Once we've removed the need for keeping the lists short, we no
longer need the on-disk hash for unlinked lists, so we can put all
the inodes in a single list....

[**] A single unlinked list in the per-ag then leads to a mutex in
the per-ag to protect the list, removing the AGI lock from needing
to be held to modify the unlinked list unless the head of the list
is being modified. We can then add to the tail of the list instead
of the head, hence largely removing the AGI from the unlinked list
processing entirely when there is more than one inode on the
unlinked list.[***]

This is another advantage of moving to single unlinked list - we are
much more likely to have multiple inodes on a single unlinked list
than when they are spread across 64 lists. Hence we are more likely
to be able to elide AGI locking for the unlinked list modifications
the more pressure we put on the unlinked list...

[***] Taking the AGI out of the unlinked list processing means the
only thing it "protects" is the contents of the AGI itself. This is
basically updating accounting and tracking btree root pointers. We
could add another in-memory log item for AGI updates such that the
AGI only needs to be locked, updated and logged in the precommit
function, greatly reducing the time it spends locked for inode
unlink processing [*^4. This will improve performance of inode
alloc/freeing on AG constrained filessytems as we spend less time
serialising on the AGI lock.....

[*^4] This is how superblock updates work, except it's not by a
generic in-memory SB log item - the changes to accounting are stored
directly in the struct xfs_trans as deltas and then applied in
xfs_trans_commit() via xfs_trans_apply_sb_deltas() which locks,
applies and logs the superblock buffer. This could be converted to a
precommit operation, too. [*^5]

Note that this superblock locking is elided for the freespace and
inode accounting when lazy superblock updates are enabled. This
prevents the superblock buffer lock for transactional accounting
update from being a major global contention point.

[*^5] dquots also use a delta accounting structure hard coded into
the struct xfs_trans - the xfs_dquot_acct structure. This gets
allocated when dquot modifications are reserved, and then updated
with each quota modification that is made in the transaction.

Then, in xfs_trans_commit(), it calls xfs_trans_apply_dquot_deltas()
which then orders the locking of the dquots correct, reads, loads
and locks the dquots, modifies the in-memory on-disk dquots and logs
them. This could also be converted to pre-commit operations. [*^6]

[*^6] It should be obvious by now that the pattern of "pre-commit
processing" for "delayed object modification" is not a new idea.
It's been in the code for 25-odd years and copy-pasta'd through the
ages as needed. It's never been turned into a useful, formalised
infrastructure mechanism - that's what this patchset starts us down
the path of. It kinda reminds me of the btree infrastructure
abstraction I did years ago to get rid fo the the 15,000 lines of
copy-pastad btree code and set us on the path to the (relatively)
easy addition of more btrees....



Dave Chinner (4):
  xfs: xfs_iflock is no longer a completion
  xfs: add log item precommit operation
  xfs: track unlinked inodes in core inode
  xfs: introduce inode unlink log item

 fs/xfs/Makefile               |   1 +
 fs/xfs/libxfs/xfs_inode_buf.c |   1 +
 fs/xfs/xfs_icache.c           |  22 +-
 fs/xfs/xfs_inode.c            | 719 +++++++++-------------------------
 fs/xfs/xfs_inode.h            |  38 +-
 fs/xfs/xfs_inode_item.c       |   9 +-
 fs/xfs/xfs_inode_item.h       |   4 +-
 fs/xfs/xfs_iunlink_item.c     | 141 +++++++
 fs/xfs/xfs_iunlink_item.h     |  24 ++
 fs/xfs/xfs_log_recover.c      | 166 ++++----
 fs/xfs/xfs_mount.c            |   5 -
 fs/xfs/xfs_super.c            |  10 +
 fs/xfs/xfs_trans.c            |  90 +++++
 fs/xfs/xfs_trans.h            |   6 +-
 14 files changed, 575 insertions(+), 661 deletions(-)
 create mode 100644 fs/xfs/xfs_iunlink_item.c
 create mode 100644 fs/xfs/xfs_iunlink_item.h

-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] xfs: xfs_iflock is no longer a completion
  2020-06-23  9:50 [PATCH 0/4] [RFC] xfs: in memory inode unlink log items Dave Chinner
@ 2020-06-23  9:50 ` Dave Chinner
  2020-06-24 15:36   ` Brian Foster
  2020-06-23  9:50 ` [PATCH 2/4] xfs: add log item precommit operation Dave Chinner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2020-06-23  9:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

With the recent rework of the inode cluster flushing, we no longer
ever wait on the the inode flush "lock". It was never a lock in the
first place, just a completion to allow callers to wait for inode IO
to complete. We now never wait for flush completion as all inode
flushing is non-blocking. Hence we can get rid of all the iflock
infrastructure and instead just set and check a state flag.

Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
xfs_iflock_nowait() test-and-set operations on that flag, and
replace all the xfs_ifunlock() calls to clear operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c     | 19 ++++++------
 fs/xfs/xfs_inode.c      | 67 +++++++++++++++--------------------------
 fs/xfs/xfs_inode.h      | 33 +-------------------
 fs/xfs/xfs_inode_item.c |  6 ++--
 fs/xfs/xfs_inode_item.h |  4 +--
 5 files changed, 39 insertions(+), 90 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index a973f180c6cd..0d73559f2d58 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -54,7 +54,6 @@ xfs_inode_alloc(
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
-	ASSERT(!xfs_isiflocked(ip));
 	ASSERT(ip->i_ino == 0);
 
 	/* initialise the xfs inode */
@@ -125,7 +124,7 @@ void
 xfs_inode_free(
 	struct xfs_inode	*ip)
 {
-	ASSERT(!xfs_isiflocked(ip));
+	ASSERT(!xfs_iflags_test(ip, XFS_IFLUSHING));
 
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always
@@ -999,7 +998,7 @@ xfs_reclaim_inode_grab(
 	 * Do unlocked checks to see if the inode already is being flushed or in
 	 * reclaim to avoid lock traffic.
 	 */
-	if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
+	if (__xfs_iflags_test(ip, XFS_IFLUSHING | XFS_IRECLAIM))
 		return 1;
 
 	/*
@@ -1045,23 +1044,23 @@ xfs_reclaim_inode(
 
 	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
 		goto out;
-	if (!xfs_iflock_nowait(ip))
+	if (xfs_iflags_test_and_set(ip, XFS_IFLUSHING))
 		goto out_iunlock;
 
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
 		xfs_iunpin_wait(ip);
 		/* xfs_iflush_abort() drops the flush lock */
 		xfs_iflush_abort(ip);
+		ASSERT(!xfs_iflags_test(ip, XFS_IFLUSHING));
 		goto reclaim;
 	}
 	if (xfs_ipincount(ip))
-		goto out_ifunlock;
+		goto out_clear_flush;
 	if (!xfs_inode_clean(ip))
-		goto out_ifunlock;
+		goto out_clear_flush;
 
-	xfs_ifunlock(ip);
+	xfs_iflags_clear(ip, XFS_IFLUSHING);
 reclaim:
-	ASSERT(!xfs_isiflocked(ip));
 
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always appears
@@ -1111,8 +1110,8 @@ xfs_reclaim_inode(
 	__xfs_inode_free(ip);
 	return true;
 
-out_ifunlock:
-	xfs_ifunlock(ip);
+out_clear_flush:
+	xfs_iflags_clear(ip, XFS_IFLUSHING);
 out_iunlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 out:
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index bae84b3eeb9a..ba86d27b5226 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -598,22 +598,6 @@ xfs_lock_two_inodes(
 	}
 }
 
-void
-__xfs_iflock(
-	struct xfs_inode	*ip)
-{
-	wait_queue_head_t *wq = bit_waitqueue(&ip->i_flags, __XFS_IFLOCK_BIT);
-	DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_IFLOCK_BIT);
-
-	do {
-		prepare_to_wait_exclusive(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (xfs_isiflocked(ip))
-			io_schedule();
-	} while (!xfs_iflock_nowait(ip));
-
-	finish_wait(wq, &wait.wq_entry);
-}
-
 STATIC uint
 _xfs_dic2xflags(
 	uint16_t		di_flags,
@@ -2547,11 +2531,8 @@ xfs_ifree_mark_inode_stale(
 	 * valid, the wrong inode or stale.
 	 */
 	spin_lock(&ip->i_flags_lock);
-	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
-		spin_unlock(&ip->i_flags_lock);
-		rcu_read_unlock();
-		return;
-	}
+	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE))
+		goto out_iflags_unlock;
 
 	/*
 	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
@@ -2568,16 +2549,14 @@ xfs_ifree_mark_inode_stale(
 		}
 	}
 	ip->i_flags |= XFS_ISTALE;
-	spin_unlock(&ip->i_flags_lock);
-	rcu_read_unlock();
 
 	/*
-	 * If we can't get the flush lock, the inode is already attached.  All
+	 * If the inode is flushing, it is already attached to the buffer.  All
 	 * we needed to do here is mark the inode stale so buffer IO completion
 	 * will remove it from the AIL.
 	 */
 	iip = ip->i_itemp;
-	if (!xfs_iflock_nowait(ip)) {
+	if (__xfs_iflags_test(ip, XFS_IFLUSHING)) {
 		ASSERT(!list_empty(&iip->ili_item.li_bio_list));
 		ASSERT(iip->ili_last_fields);
 		goto out_iunlock;
@@ -2589,10 +2568,12 @@ xfs_ifree_mark_inode_stale(
 	 * commit as the flock synchronises removal of the inode from the
 	 * cluster buffer against inode reclaim.
 	 */
-	if (!iip || list_empty(&iip->ili_item.li_bio_list)) {
-		xfs_ifunlock(ip);
+	if (!iip || list_empty(&iip->ili_item.li_bio_list))
 		goto out_iunlock;
-	}
+
+	__xfs_iflags_set(ip, XFS_IFLUSHING);
+	spin_unlock(&ip->i_flags_lock);
+	rcu_read_unlock();
 
 	/* we have a dirty inode in memory that has not yet been flushed. */
 	spin_lock(&iip->ili_lock);
@@ -2602,9 +2583,16 @@ xfs_ifree_mark_inode_stale(
 	spin_unlock(&iip->ili_lock);
 	ASSERT(iip->ili_last_fields);
 
+	if (ip != free_ip)
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return;
+
 out_iunlock:
 	if (ip != free_ip)
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out_iflags_unlock:
+	spin_unlock(&ip->i_flags_lock);
+	rcu_read_unlock();
 }
 
 /*
@@ -3459,7 +3447,7 @@ xfs_iflush(
 	int			error;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
-	ASSERT(xfs_isiflocked(ip));
+	ASSERT(xfs_iflags_test(ip, XFS_IFLUSHING));
 	ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_BTREE ||
 	       ip->i_df.if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
 	ASSERT(iip->ili_item.li_buf == bp);
@@ -3629,7 +3617,7 @@ xfs_iflush_cluster(
 		/*
 		 * Quick and dirty check to avoid locks if possible.
 		 */
-		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK))
+		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLUSHING))
 			continue;
 		if (xfs_ipincount(ip))
 			continue;
@@ -3643,7 +3631,7 @@ xfs_iflush_cluster(
 		 */
 		spin_lock(&ip->i_flags_lock);
 		ASSERT(!__xfs_iflags_test(ip, XFS_ISTALE));
-		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK)) {
+		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLUSHING)) {
 			spin_unlock(&ip->i_flags_lock);
 			continue;
 		}
@@ -3651,23 +3639,16 @@ xfs_iflush_cluster(
 		/*
 		 * ILOCK will pin the inode against reclaim and prevent
 		 * concurrent transactions modifying the inode while we are
-		 * flushing the inode.
+		 * flushing the inode. If we get the lock, set the flushing
+		 * state before we drop the i_flags_lock.
 		 */
 		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) {
 			spin_unlock(&ip->i_flags_lock);
 			continue;
 		}
+		__xfs_iflags_set(ip, XFS_IFLUSHING);
 		spin_unlock(&ip->i_flags_lock);
 
-		/*
-		 * Skip inodes that are already flush locked as they have
-		 * already been written to the buffer.
-		 */
-		if (!xfs_iflock_nowait(ip)) {
-			xfs_iunlock(ip, XFS_ILOCK_SHARED);
-			continue;
-		}
-
 		/*
 		 * Abort flushing this inode if we are shut down because the
 		 * inode may not currently be in the AIL. This can occur when
@@ -3686,7 +3667,7 @@ xfs_iflush_cluster(
 
 		/* don't block waiting on a log force to unpin dirty inodes */
 		if (xfs_ipincount(ip)) {
-			xfs_ifunlock(ip);
+			xfs_iflags_clear(ip, XFS_IFLUSHING);
 			xfs_iunlock(ip, XFS_ILOCK_SHARED);
 			continue;
 		}
@@ -3694,7 +3675,7 @@ xfs_iflush_cluster(
 		if (!xfs_inode_clean(ip))
 			error = xfs_iflush(ip, bp);
 		else
-			xfs_ifunlock(ip);
+			xfs_iflags_clear(ip, XFS_IFLUSHING);
 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		if (error)
 			break;
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 7a8adb76c17f..991ef00370d5 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -211,8 +211,7 @@ static inline bool xfs_inode_has_cow_data(struct xfs_inode *ip)
 #define XFS_INEW		(1 << __XFS_INEW_BIT)
 #define XFS_ITRUNCATED		(1 << 5) /* truncated down so flush-on-close */
 #define XFS_IDIRTY_RELEASE	(1 << 6) /* dirty release already seen */
-#define __XFS_IFLOCK_BIT	7	 /* inode is being flushed right now */
-#define XFS_IFLOCK		(1 << __XFS_IFLOCK_BIT)
+#define XFS_IFLUSHING		(1 << 7) /* inode is being flushed */
 #define __XFS_IPINNED_BIT	8	 /* wakeup key for zero pin count */
 #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
 #define XFS_IEOFBLOCKS		(1 << 9) /* has the preallocblocks tag set */
@@ -233,36 +232,6 @@ static inline bool xfs_inode_has_cow_data(struct xfs_inode *ip)
 	(XFS_IRECLAIMABLE | XFS_IRECLAIM | \
 	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED)
 
-/*
- * Synchronize processes attempting to flush the in-core inode back to disk.
- */
-
-static inline int xfs_isiflocked(struct xfs_inode *ip)
-{
-	return xfs_iflags_test(ip, XFS_IFLOCK);
-}
-
-extern void __xfs_iflock(struct xfs_inode *ip);
-
-static inline int xfs_iflock_nowait(struct xfs_inode *ip)
-{
-	return !xfs_iflags_test_and_set(ip, XFS_IFLOCK);
-}
-
-static inline void xfs_iflock(struct xfs_inode *ip)
-{
-	if (!xfs_iflock_nowait(ip))
-		__xfs_iflock(ip);
-}
-
-static inline void xfs_ifunlock(struct xfs_inode *ip)
-{
-	ASSERT(xfs_isiflocked(ip));
-	xfs_iflags_clear(ip, XFS_IFLOCK);
-	smp_mb();
-	wake_up_bit(&ip->i_flags, __XFS_IFLOCK_BIT);
-}
-
 /*
  * Flags for inode locking.
  * Bit ranges:	1<<1  - 1<<16-1 -- iolock/ilock modes (bitfield)
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 3840117f8a5e..0494b907c63d 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -492,7 +492,7 @@ xfs_inode_item_push(
 		return XFS_ITEM_PINNED;
 
 	/* If the inode is already flush locked, we're already flushing. */
-	if (xfs_isiflocked(ip))
+	if (xfs_iflags_test(ip, XFS_IFLUSHING))
 		return XFS_ITEM_FLUSHING;
 
 	if (!xfs_buf_trylock(bp))
@@ -702,7 +702,7 @@ xfs_iflush_finish(
 		iip->ili_last_fields = 0;
 		iip->ili_flush_lsn = 0;
 		spin_unlock(&iip->ili_lock);
-		xfs_ifunlock(iip->ili_inode);
+		xfs_iflags_clear(iip->ili_inode, XFS_IFLUSHING);
 		if (drop_buffer)
 			xfs_buf_rele(bp);
 	}
@@ -789,7 +789,7 @@ xfs_iflush_abort(
 		list_del_init(&iip->ili_item.li_bio_list);
 		spin_unlock(&iip->ili_lock);
 	}
-	xfs_ifunlock(ip);
+	xfs_iflags_clear(ip, XFS_IFLUSHING);
 	if (bp)
 		xfs_buf_rele(bp);
 }
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 048b5e7dee90..23a7b4928727 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -25,8 +25,8 @@ struct xfs_inode_log_item {
 	 *
 	 * We need atomic changes between inode dirtying, inode flushing and
 	 * inode completion, but these all hold different combinations of
-	 * ILOCK and iflock and hence we need some other method of serialising
-	 * updates to the flush state.
+	 * ILOCK and IFLUSHING and hence we need some other method of
+	 * serialising updates to the flush state.
 	 */
 	spinlock_t		ili_lock;	   /* flush state lock */
 	unsigned int		ili_last_fields;   /* fields when flushed */
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] xfs: add log item precommit operation
  2020-06-23  9:50 [PATCH 0/4] [RFC] xfs: in memory inode unlink log items Dave Chinner
  2020-06-23  9:50 ` [PATCH 1/4] xfs: xfs_iflock is no longer a completion Dave Chinner
@ 2020-06-23  9:50 ` Dave Chinner
  2020-06-30 18:06   ` Darrick J. Wong
  2020-07-01 14:30   ` Brian Foster
  2020-06-23  9:50 ` [PATCH 3/4] xfs: track unlinked inodes in core inode Dave Chinner
  2020-06-23  9:50 ` [PATCH 4/4] xfs: introduce inode unlink log item Dave Chinner
  3 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2020-06-23  9:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

For inodes that are dirty, we have an attached cluster buffer that
we want to use to track the dirty inode through the AIL.
Unfortunately, locking the cluster buffer and adding it to the
transaction when the inode is first logged in a transaction leads to
buffer lock ordering inversions.

The specific problem is ordering against the AGI buffer. When
modifying unlinked lists, the buffer lock order is AGI -> inode
cluster buffer as the AGI buffer lock serialises all access to the
unlinked lists. Unfortunately, functionality like xfs_droplink()
logs the inode before calling xfs_iunlink(), as do various directory
manipulation functions. The inode can be logged way down in the
stack as far as the bmapi routines and hence, without a major
rewrite of lots of APIs there's no way we can avoid the inode being
logged by something until after the AGI has been logged.

As we are going to be using ordered buffers for inode AIL tracking,
there isn't a need to actually lock that buffer against modification
as all the modifications are captured by logging the inode item
itself. Hence we don't actually need to join the cluster buffer into
the transaction until just before it is committed. This means we do
not perturb any of the existing buffer lock orders in transactions,
and the inode cluster buffer is always locked last in a transaction
that doesn't otherwise touch inode cluster buffers.

We do this by introducing a precommit log item method. A log item
method is used because it is likely dquots will be moved to this
same ordered buffer tracking scheme and hence will need a similar
callout. This commit just introduces the mechanism; the inode item
implementation is in followup commits.

The precommit items need to be sorted into consistent order as we
may be locking multiple items here. Hence if we have two dirty
inodes in cluster buffers A and B, and some other transaction has
two separate dirty inodes in the same cluster buffers, locking them
in different orders opens us up to ABBA deadlocks. Hence we sort the
items on the transaction based on the presence of a sort log item
method.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c |  1 +
 fs/xfs/xfs_trans.c  | 90 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trans.h  |  6 ++-
 3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0d73559f2d58..1c744dbb313f 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1077,6 +1077,7 @@ xfs_reclaim_inode(
 	ip->i_ino = 0;
 	spin_unlock(&ip->i_flags_lock);
 
+	ASSERT(!ip->i_itemp || ip->i_itemp->ili_item.li_buf == NULL);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
 	XFS_STATS_INC(ip->i_mount, xs_ig_reclaims);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 3c94e5ff4316..6f350490f84b 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -799,6 +799,89 @@ xfs_trans_committed_bulk(
 	spin_unlock(&ailp->ail_lock);
 }
 
+/*
+ * Sort transaction items prior to running precommit operations. This will
+ * attempt to order the items such that they will always be locked in the same
+ * order. Items that have no sort function are moved to the end of the list
+ * and so are locked last (XXX: need to check the logic matches the comment).
+ *
+ * This may need refinement as different types of objects add sort functions.
+ *
+ * Function is more complex than it needs to be because we are comparing 64 bit
+ * values and the function only returns 32 bit values.
+ */
+static int
+xfs_trans_precommit_sort(
+	void			*unused_arg,
+	struct list_head	*a,
+	struct list_head	*b)
+{
+	struct xfs_log_item	*lia = container_of(a,
+					struct xfs_log_item, li_trans);
+	struct xfs_log_item	*lib = container_of(b,
+					struct xfs_log_item, li_trans);
+	int64_t			diff;
+
+	if (!lia->li_ops->iop_sort && !lib->li_ops->iop_sort)
+		return 0;
+	if (!lia->li_ops->iop_sort)
+		return 1;
+	if (!lib->li_ops->iop_sort)
+		return -1;
+
+	diff = lia->li_ops->iop_sort(lia) - lib->li_ops->iop_sort(lib);
+	if (diff < 0)
+		return -1;
+	if (diff > 0)
+		return 1;
+	return 0;
+}
+
+/*
+ * Run transaction precommit functions.
+ *
+ * If there is an error in any of the callouts, then stop immediately and
+ * trigger a shutdown to abort the transaction. There is no recovery possible
+ * from errors at this point as the transaction is dirty....
+ */
+static int
+xfs_trans_run_precommits(
+	struct xfs_trans	*tp)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_log_item	*lip, *n;
+	int			error = 0;
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
+	/*
+	 * Sort the item list to avoid ABBA deadlocks with other transactions
+	 * running precommit operations that lock multiple shared items such as
+	 * inode cluster buffers.
+	 */
+	list_sort(NULL, &tp->t_items, xfs_trans_precommit_sort);
+
+	/*
+	 * Precommit operations can remove the log item from the transaction
+	 * if the log item exists purely to delay modifications until they
+	 * can be ordered against other operations. Hence we have to use
+	 * list_for_each_entry_safe() here.
+	 */
+	list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
+		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
+			continue;
+		if (lip->li_ops->iop_precommit) {
+			error = lip->li_ops->iop_precommit(tp, lip);
+			if (error)
+				break;
+		}
+	}
+	if (error)
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+	return error;
+}
+
 /*
  * Commit the given transaction to the log.
  *
@@ -823,6 +906,13 @@ __xfs_trans_commit(
 
 	trace_xfs_trans_commit(tp, _RET_IP_);
 
+	error = xfs_trans_run_precommits(tp);
+	if (error) {
+		if (tp->t_flags & XFS_TRANS_PERM_LOG_RES)
+			xfs_defer_cancel(tp);
+		goto out_unreserve;
+	}
+
 	/*
 	 * Finish deferred items on final commit. Only permanent transactions
 	 * should ever have deferred ops.
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index b752501818d2..26ea19bd0621 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -70,10 +70,12 @@ struct xfs_item_ops {
 	void (*iop_format)(struct xfs_log_item *, struct xfs_log_vec *);
 	void (*iop_pin)(struct xfs_log_item *);
 	void (*iop_unpin)(struct xfs_log_item *, int remove);
-	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
+	uint64_t (*iop_sort)(struct xfs_log_item *);
+	int (*iop_precommit)(struct xfs_trans *, struct xfs_log_item *);
 	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
-	void (*iop_release)(struct xfs_log_item *);
 	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
+	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
+	void (*iop_release)(struct xfs_log_item *);
 	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
 	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
 };
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] xfs: track unlinked inodes in core inode
  2020-06-23  9:50 [PATCH 0/4] [RFC] xfs: in memory inode unlink log items Dave Chinner
  2020-06-23  9:50 ` [PATCH 1/4] xfs: xfs_iflock is no longer a completion Dave Chinner
  2020-06-23  9:50 ` [PATCH 2/4] xfs: add log item precommit operation Dave Chinner
@ 2020-06-23  9:50 ` Dave Chinner
  2020-07-01  8:59   ` Gao Xiang
                     ` (2 more replies)
  2020-06-23  9:50 ` [PATCH 4/4] xfs: introduce inode unlink log item Dave Chinner
  3 siblings, 3 replies; 22+ messages in thread
From: Dave Chinner @ 2020-06-23  9:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Currently we cache unlinked inode list backrefs through a separate
cache which has to be maintained via memory allocation and a hash
table. When the inode is on the unlinked list, we have an existence
guarantee for the inode in memory.

That is, if the inode is on the unlinked list, there must be a
reference to the inode from the core VFS because dropping the last
reference to the inode causes it to be removed from the unlinked
list. Hence if we hold the AGI locked, we guarantee that any inode
on the unlinked list is pinned in memory. That means we can actually
track the entire unlinked list on the inode itself and use
unreferenced inode cache lookups to update the list pointers as
needed.

However, we don't use this relationship because log recovery has
no in memory state and so has to work directly from buffers.
However, because unlink recovery only removes from the head of the
list, we can easily fake this in memory state as the inode we read
in from the AGI bucket has a pointer to the next inode. Hence we can
play reference leapfrog in the recovery loop always reading the
second inode on the list and updating pointers before dropping the
reference to the first inode. Hence the in-memory state is always
valid for recovery, too.

This means we can tear out the old inode unlinked list cache and
update mechanisms and replace it with a much simpler "insert" and
"remove" functions that use in-memory inode state rather than on
buffer state to track the list. The diffstat speaks for itself.

Food for thought: This obliviates the need for the on-disk AGI
unlinked hash - we because we track as a double linked list in
memory, we don't need to keep hash chains on disk short to minimise
previous inode lookup overhead on list removal. Hence we probably
should just convert the unlinked list code to use a single list
on disk...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c |   1 +
 fs/xfs/xfs_icache.c           |   2 +
 fs/xfs/xfs_inode.c            | 672 ++++++++++------------------------
 fs/xfs/xfs_inode.h            |   5 +-
 fs/xfs/xfs_log_recover.c      | 166 +++++----
 fs/xfs/xfs_mount.c            |   5 -
 6 files changed, 297 insertions(+), 554 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 6b6f67595bf4..74f713e4a6b5 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -225,6 +225,7 @@ xfs_inode_from_disk(
 	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
 	to->di_dmstate	= be16_to_cpu(from->di_dmstate);
 	to->di_flags	= be16_to_cpu(from->di_flags);
+	ip->i_next_unlinked = be32_to_cpu(from->di_next_unlinked);
 
 	if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) {
 		inode_set_iversion_queried(inode,
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 1c744dbb313f..dadf417fb9b4 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -71,6 +71,8 @@ xfs_inode_alloc(
 	INIT_WORK(&ip->i_ioend_work, xfs_end_io);
 	INIT_LIST_HEAD(&ip->i_ioend_list);
 	spin_lock_init(&ip->i_ioend_lock);
+	ip->i_next_unlinked = NULLAGINO;
+	ip->i_prev_unlinked = NULLAGINO;
 
 	return ip;
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ba86d27b5226..1f1c8819330b 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1876,211 +1876,123 @@ xfs_inactive(
 }
 
 /*
- * In-Core Unlinked List Lookups
- * =============================
- *
- * Every inode is supposed to be reachable from some other piece of metadata
- * with the exception of the root directory.  Inodes with a connection to a
- * file descriptor but not linked from anywhere in the on-disk directory tree
- * are collectively known as unlinked inodes, though the filesystem itself
- * maintains links to these inodes so that on-disk metadata are consistent.
- *
- * XFS implements a per-AG on-disk hash table of unlinked inodes.  The AGI
- * header contains a number of buckets that point to an inode, and each inode
- * record has a pointer to the next inode in the hash chain.  This
- * singly-linked list causes scaling problems in the iunlink remove function
- * because we must walk that list to find the inode that points to the inode
- * being removed from the unlinked hash bucket list.
- *
- * What if we modelled the unlinked list as a collection of records capturing
- * "X.next_unlinked = Y" relations?  If we indexed those records on Y, we'd
- * have a fast way to look up unlinked list predecessors, which avoids the
- * slow list walk.  That's exactly what we do here (in-core) with a per-AG
- * rhashtable.
- *
- * Because this is a backref cache, we ignore operational failures since the
- * iunlink code can fall back to the slow bucket walk.  The only errors that
- * should bubble out are for obviously incorrect situations.
- *
- * All users of the backref cache MUST hold the AGI buffer lock to serialize
- * access or have otherwise provided for concurrency control.
- */
-
-/* Capture a "X.next_unlinked = Y" relationship. */
-struct xfs_iunlink {
-	struct rhash_head	iu_rhash_head;
-	xfs_agino_t		iu_agino;		/* X */
-	xfs_agino_t		iu_next_unlinked;	/* Y */
-};
-
-/* Unlinked list predecessor lookup hashtable construction */
-static int
-xfs_iunlink_obj_cmpfn(
-	struct rhashtable_compare_arg	*arg,
-	const void			*obj)
-{
-	const xfs_agino_t		*key = arg->key;
-	const struct xfs_iunlink	*iu = obj;
-
-	if (iu->iu_next_unlinked != *key)
-		return 1;
-	return 0;
-}
-
-static const struct rhashtable_params xfs_iunlink_hash_params = {
-	.min_size		= XFS_AGI_UNLINKED_BUCKETS,
-	.key_len		= sizeof(xfs_agino_t),
-	.key_offset		= offsetof(struct xfs_iunlink,
-					   iu_next_unlinked),
-	.head_offset		= offsetof(struct xfs_iunlink, iu_rhash_head),
-	.automatic_shrinking	= true,
-	.obj_cmpfn		= xfs_iunlink_obj_cmpfn,
-};
-
-/*
- * Return X, where X.next_unlinked == @agino.  Returns NULLAGINO if no such
- * relation is found.
+ * Find an inode on the unlinked list. This does not take references to the
+ * inode, we have existence guarantees by holding the AGI buffer lock and
+ * that only unlinked, referenced inodes can be on the unlinked inode list.
+ * If we don't find the inode in cache, then let the caller handle the
+ * situation.
  */
-static xfs_agino_t
-xfs_iunlink_lookup_backref(
+static struct xfs_inode *
+xfs_iunlink_ilookup(
 	struct xfs_perag	*pag,
 	xfs_agino_t		agino)
 {
-	struct xfs_iunlink	*iu;
+	struct xfs_mount	*mp = pag->pag_mount;
+	struct xfs_inode	*ip;
 
-	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
-			xfs_iunlink_hash_params);
-	return iu ? iu->iu_agino : NULLAGINO;
+	rcu_read_lock();
+	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
+
+	/* Inode not in memory, nothing to do */
+	if (!ip) {
+		rcu_read_unlock();
+		return NULL;
+	}
+	spin_lock(&ip->i_flags_lock);
+	if (ip->i_ino != XFS_AGINO_TO_INO(mp, pag->pag_agno, agino) ||
+	    __xfs_iflags_test(ip, XFS_IRECLAIM)) {
+		/* Uh-oh! */
+		ip = NULL;
+	}
+	spin_unlock(&ip->i_flags_lock);
+	rcu_read_unlock();
+	return ip;
 }
 
 /*
- * Take ownership of an iunlink cache entry and insert it into the hash table.
- * If successful, the entry will be owned by the cache; if not, it is freed.
- * Either way, the caller does not own @iu after this call.
+ * Return the inode before @ip on the unlinked list without taking a reference
+ * to it. Caller must hold the AGI buffer locked to guarantee existence of the
+ * inode. Returns the inode or NULL if at the head of the list.
+ * If a lookup fails, return corruption error.
  */
-static int
-xfs_iunlink_insert_backref(
+static struct xfs_inode *
+xfs_iunlink_lookup_prev(
 	struct xfs_perag	*pag,
-	struct xfs_iunlink	*iu)
+	struct xfs_inode	*ip)
 {
-	int			error;
-
-	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
-			&iu->iu_rhash_head, xfs_iunlink_hash_params);
-	/*
-	 * Fail loudly if there already was an entry because that's a sign of
-	 * corruption of in-memory data.  Also fail loudly if we see an error
-	 * code we didn't anticipate from the rhashtable code.  Currently we
-	 * only anticipate ENOMEM.
-	 */
-	if (error) {
-		WARN(error != -ENOMEM, "iunlink cache insert error %d", error);
-		kmem_free(iu);
-	}
-	/*
-	 * Absorb any runtime errors that aren't a result of corruption because
-	 * this is a cache and we can always fall back to bucket list scanning.
-	 */
-	if (error != 0 && error != -EEXIST)
-		error = 0;
-	return error;
+	struct xfs_inode	*prev_ip;
+
+	if (ip->i_prev_unlinked == NULLAGINO)
+		return NULL;
+	prev_ip = xfs_iunlink_ilookup(pag, ip->i_prev_unlinked);
+	if (!prev_ip)
+		return ERR_PTR(-EFSCORRUPTED);
+	return prev_ip;
 }
 
-/* Remember that @prev_agino.next_unlinked = @this_agino. */
-static int
-xfs_iunlink_add_backref(
+static struct xfs_inode *
+xfs_iunlink_lookup_next(
 	struct xfs_perag	*pag,
-	xfs_agino_t		prev_agino,
-	xfs_agino_t		this_agino)
+	struct xfs_inode	*ip)
 {
-	struct xfs_iunlink	*iu;
-
-	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
-		return 0;
-
-	iu = kmem_zalloc(sizeof(*iu), KM_NOFS);
-	iu->iu_agino = prev_agino;
-	iu->iu_next_unlinked = this_agino;
-
-	return xfs_iunlink_insert_backref(pag, iu);
+	struct xfs_inode	*next_ip;
+
+	if (ip->i_next_unlinked == NULLAGINO)
+		return NULL;
+	next_ip = xfs_iunlink_ilookup(pag, ip->i_next_unlinked);
+	if (!next_ip)
+		return ERR_PTR(-EFSCORRUPTED);
+	return next_ip;
 }
 
-/*
- * Replace X.next_unlinked = @agino with X.next_unlinked = @next_unlinked.
- * If @next_unlinked is NULLAGINO, we drop the backref and exit.  If there
- * wasn't any such entry then we don't bother.
- */
-static int
-xfs_iunlink_change_backref(
-	struct xfs_perag	*pag,
+/* Set an on-disk inode's next_unlinked pointer. */
+STATIC void
+xfs_iunlink_update_dinode(
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
 	xfs_agino_t		agino,
-	xfs_agino_t		next_unlinked)
+	struct xfs_buf		*ibp,
+	struct xfs_dinode	*dip,
+	struct xfs_imap		*imap,
+	xfs_agino_t		next_agino)
 {
-	struct xfs_iunlink	*iu;
-	int			error;
-
-	/* Look up the old entry; if there wasn't one then exit. */
-	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
-			xfs_iunlink_hash_params);
-	if (!iu)
-		return 0;
-
-	/*
-	 * Remove the entry.  This shouldn't ever return an error, but if we
-	 * couldn't remove the old entry we don't want to add it again to the
-	 * hash table, and if the entry disappeared on us then someone's
-	 * violated the locking rules and we need to fail loudly.  Either way
-	 * we cannot remove the inode because internal state is or would have
-	 * been corrupt.
-	 */
-	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
-			&iu->iu_rhash_head, xfs_iunlink_hash_params);
-	if (error)
-		return error;
-
-	/* If there is no new next entry just free our item and return. */
-	if (next_unlinked == NULLAGINO) {
-		kmem_free(iu);
-		return 0;
-	}
+	struct xfs_mount	*mp = tp->t_mountp;
+	int			offset;
 
-	/* Update the entry and re-add it to the hash table. */
-	iu->iu_next_unlinked = next_unlinked;
-	return xfs_iunlink_insert_backref(pag, iu);
-}
+	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
 
-/* Set up the in-core predecessor structures. */
-int
-xfs_iunlink_init(
-	struct xfs_perag	*pag)
-{
-	return rhashtable_init(&pag->pagi_unlinked_hash,
-			&xfs_iunlink_hash_params);
-}
+	trace_xfs_iunlink_update_dinode(mp, agno, agino,
+			be32_to_cpu(dip->di_next_unlinked), next_agino);
 
-/* Free the in-core predecessor structures. */
-static void
-xfs_iunlink_free_item(
-	void			*ptr,
-	void			*arg)
-{
-	struct xfs_iunlink	*iu = ptr;
-	bool			*freed_anything = arg;
+	dip->di_next_unlinked = cpu_to_be32(next_agino);
+	offset = imap->im_boffset +
+			offsetof(struct xfs_dinode, di_next_unlinked);
 
-	*freed_anything = true;
-	kmem_free(iu);
+	/* need to recalc the inode CRC if appropriate */
+	xfs_dinode_calc_crc(mp, dip);
+	xfs_trans_inode_buf(tp, ibp);
+	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
 }
 
-void
-xfs_iunlink_destroy(
-	struct xfs_perag	*pag)
+/* Set an in-core inode's unlinked pointer and return the old value. */
+static int
+xfs_iunlink_update_inode(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
 {
-	bool			freed_anything = false;
+	struct xfs_mount	*mp = tp->t_mountp;
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_dinode	*dip;
+	struct xfs_buf		*ibp;
+	int			error;
 
-	rhashtable_free_and_destroy(&pag->pagi_unlinked_hash,
-			xfs_iunlink_free_item, &freed_anything);
+	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp, 0);
+	if (error)
+		return error;
 
-	ASSERT(freed_anything == false || XFS_FORCED_SHUTDOWN(pag->pag_mount));
+	xfs_iunlink_update_dinode(tp, agno, XFS_INO_TO_AGINO(mp, ip->i_ino),
+			ibp, dip, &ip->i_imap, ip->i_next_unlinked);
+	return 0;
 }
 
 /*
@@ -2122,120 +2034,25 @@ xfs_iunlink_update_bucket(
 	return 0;
 }
 
-/* Set an on-disk inode's next_unlinked pointer. */
-STATIC void
-xfs_iunlink_update_dinode(
-	struct xfs_trans	*tp,
-	xfs_agnumber_t		agno,
-	xfs_agino_t		agino,
-	struct xfs_buf		*ibp,
-	struct xfs_dinode	*dip,
-	struct xfs_imap		*imap,
-	xfs_agino_t		next_agino)
-{
-	struct xfs_mount	*mp = tp->t_mountp;
-	int			offset;
-
-	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
-
-	trace_xfs_iunlink_update_dinode(mp, agno, agino,
-			be32_to_cpu(dip->di_next_unlinked), next_agino);
-
-	dip->di_next_unlinked = cpu_to_be32(next_agino);
-	offset = imap->im_boffset +
-			offsetof(struct xfs_dinode, di_next_unlinked);
-
-	/* need to recalc the inode CRC if appropriate */
-	xfs_dinode_calc_crc(mp, dip);
-	xfs_trans_inode_buf(tp, ibp);
-	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
-}
-
-/* Set an in-core inode's unlinked pointer and return the old value. */
-STATIC int
-xfs_iunlink_update_inode(
-	struct xfs_trans	*tp,
-	struct xfs_inode	*ip,
-	xfs_agnumber_t		agno,
-	xfs_agino_t		next_agino,
-	xfs_agino_t		*old_next_agino)
-{
-	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_dinode	*dip;
-	struct xfs_buf		*ibp;
-	xfs_agino_t		old_value;
-	int			error;
-
-	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
-
-	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp, 0);
-	if (error)
-		return error;
-
-	/* Make sure the old pointer isn't garbage. */
-	old_value = be32_to_cpu(dip->di_next_unlinked);
-	if (!xfs_verify_agino_or_null(mp, agno, old_value)) {
-		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
-				sizeof(*dip), __this_address);
-		error = -EFSCORRUPTED;
-		goto out;
-	}
-
-	/*
-	 * Since we're updating a linked list, we should never find that the
-	 * current pointer is the same as the new value, unless we're
-	 * terminating the list.
-	 */
-	*old_next_agino = old_value;
-	if (old_value == next_agino) {
-		if (next_agino != NULLAGINO) {
-			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
-					dip, sizeof(*dip), __this_address);
-			error = -EFSCORRUPTED;
-		}
-		goto out;
-	}
-
-	/* Ok, update the new pointer. */
-	xfs_iunlink_update_dinode(tp, agno, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			ibp, dip, &ip->i_imap, next_agino);
-	return 0;
-out:
-	xfs_trans_brelse(tp, ibp);
-	return error;
-}
-
 /*
- * This is called when the inode's link count has gone to 0 or we are creating
- * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
- *
- * We place the on-disk inode on a list in the AGI.  It will be pulled from this
- * list when the inode is freed.
+ * Always insert at the head, so we only have to do a next inode lookup to
+ * update it's prev pointer. The AGI bucket will point at the one we are
+ * inserting.
  */
-STATIC int
-xfs_iunlink(
+static int
+xfs_iunlink_insert_inode(
 	struct xfs_trans	*tp,
+	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi;
-	struct xfs_buf		*agibp;
-	xfs_agino_t		next_agino;
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_agi		*agi = agibp->b_addr;
+	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
+	xfs_agino_t		next_agino;
 	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
-	ASSERT(VFS_I(ip)->i_nlink == 0);
-	ASSERT(VFS_I(ip)->i_mode != 0);
-	trace_xfs_iunlink(ip);
-
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, agno, &agibp);
-	if (error)
-		return error;
-	agi = agibp->b_addr;
-
 	/*
 	 * Get the index into the agi hash table for the list this inode will
 	 * go on.  Make sure the pointer isn't garbage and that this inode
@@ -2248,27 +2065,24 @@ xfs_iunlink(
 		return -EFSCORRUPTED;
 	}
 
-	if (next_agino != NULLAGINO) {
-		struct xfs_perag	*pag;
-		xfs_agino_t		old_agino;
+	ip->i_prev_unlinked = NULLAGINO;
+	ip->i_next_unlinked = next_agino;
+	if (ip->i_next_unlinked != NULLAGINO) {
+		struct xfs_inode	*nip;
 
-		/*
-		 * There is already another inode in the bucket, so point this
-		 * inode to the current head of the list.
-		 */
-		error = xfs_iunlink_update_inode(tp, ip, agno, next_agino,
-				&old_agino);
-		if (error)
-			return error;
-		ASSERT(old_agino == NULLAGINO);
+		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
+		if (IS_ERR_OR_NULL(nip))
+			return -EFSCORRUPTED;
 
-		/*
-		 * agino has been unlinked, add a backref from the next inode
-		 * back to agino.
-		 */
-		pag = xfs_perag_get(mp, agno);
-		error = xfs_iunlink_add_backref(pag, agino, next_agino);
-		xfs_perag_put(pag);
+		if (nip->i_prev_unlinked != NULLAGINO) {
+			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
+						NULL, 0, __this_address);
+			return -EFSCORRUPTED;
+		}
+		nip->i_prev_unlinked = agino;
+
+		/* update the on disk inode now */
+		error = xfs_iunlink_update_inode(tp, ip);
 		if (error)
 			return error;
 	}
@@ -2277,118 +2091,110 @@ xfs_iunlink(
 	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
 }
 
-/* Return the imap, dinode pointer, and buffer for an inode. */
-STATIC int
-xfs_iunlink_map_ino(
+/*
+ * Remove can be from anywhere in the list, so we have to do two adjacent inode
+ * lookups here so we can update list pointers. We may be at the head or the
+ * tail of the list, so we have to handle those cases as well.
+ */
+static int
+xfs_iunlink_remove_inode(
 	struct xfs_trans	*tp,
-	xfs_agnumber_t		agno,
-	xfs_agino_t		agino,
-	struct xfs_imap		*imap,
-	struct xfs_dinode	**dipp,
-	struct xfs_buf		**bpp)
+	struct xfs_buf		*agibp,
+	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_agi		*agi = agibp->b_addr;
+	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
+	xfs_agino_t		next_agino = ip->i_next_unlinked;
+	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
-	imap->im_blkno = 0;
-	error = xfs_imap(mp, tp, XFS_AGINO_TO_INO(mp, agno, agino), imap, 0);
-	if (error) {
-		xfs_warn(mp, "%s: xfs_imap returned error %d.",
-				__func__, error);
-		return error;
+	if (ip->i_prev_unlinked == NULLAGINO) {
+		/* remove from head of list */
+		if (be32_to_cpu(agi->agi_unlinked[bucket_index]) != agino) {
+			xfs_buf_mark_corrupt(agibp);
+			return -EFSCORRUPTED;
+		}
+		if (next_agino == agino ||
+		    !xfs_verify_agino_or_null(mp, agno, next_agino))
+			return -EFSCORRUPTED;
+
+		error = xfs_iunlink_update_bucket(tp, agno, agibp,
+					bucket_index, next_agino);
+		if (error)
+			return -EFSCORRUPTED;
+	} else {
+		/* lookup previous inode and update to point at next */
+		struct xfs_inode	*pip;
+
+		pip = xfs_iunlink_lookup_prev(agibp->b_pag, ip);
+		if (IS_ERR_OR_NULL(pip))
+			return -EFSCORRUPTED;
+
+		if (pip->i_next_unlinked != agino) {
+			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
+						NULL, 0, __this_address);
+			return -EFSCORRUPTED;
+		}
+
+		/* update the on disk inode now */
+		pip->i_next_unlinked = next_agino;
+		error = xfs_iunlink_update_inode(tp, pip);
+		if (error)
+			return error;
 	}
 
-	error = xfs_imap_to_bp(mp, tp, imap, dipp, bpp, 0);
-	if (error) {
-		xfs_warn(mp, "%s: xfs_imap_to_bp returned error %d.",
-				__func__, error);
-		return error;
+	/* lookup the next inode and update to point at prev */
+	if (ip->i_next_unlinked != NULLAGINO) {
+		struct xfs_inode	*nip;
+
+		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
+		if (IS_ERR_OR_NULL(nip))
+			return -EFSCORRUPTED;
+
+		if (nip->i_prev_unlinked != agino) {
+			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
+						NULL, 0, __this_address);
+			return -EFSCORRUPTED;
+		}
+		/* in memory update only */
+		nip->i_prev_unlinked = ip->i_prev_unlinked;
 	}
 
-	return 0;
+	/* now clear prev/next from this inode and update on disk */
+	ip->i_prev_unlinked = NULLAGINO;
+	ip->i_next_unlinked = NULLAGINO;
+	return xfs_iunlink_update_inode(tp, ip);
 }
 
 /*
- * Walk the unlinked chain from @head_agino until we find the inode that
- * points to @target_agino.  Return the inode number, map, dinode pointer,
- * and inode cluster buffer of that inode as @agino, @imap, @dipp, and @bpp.
- *
- * @tp, @pag, @head_agino, and @target_agino are input parameters.
- * @agino, @imap, @dipp, and @bpp are all output parameters.
+ * This is called when the inode's link count has gone to 0 or we are creating
+ * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
  *
- * Do not call this function if @target_agino is the head of the list.
+ * We place the on-disk inode on a list in the AGI.  It will be pulled from this
+ * list when the inode is freed.
  */
 STATIC int
-xfs_iunlink_map_prev(
+xfs_iunlink(
 	struct xfs_trans	*tp,
-	xfs_agnumber_t		agno,
-	xfs_agino_t		head_agino,
-	xfs_agino_t		target_agino,
-	xfs_agino_t		*agino,
-	struct xfs_imap		*imap,
-	struct xfs_dinode	**dipp,
-	struct xfs_buf		**bpp,
-	struct xfs_perag	*pag)
+	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	xfs_agino_t		next_agino;
+	struct xfs_buf		*agibp;
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	int			error;
 
-	ASSERT(head_agino != target_agino);
-	*bpp = NULL;
-
-	/* See if our backref cache can find it faster. */
-	*agino = xfs_iunlink_lookup_backref(pag, target_agino);
-	if (*agino != NULLAGINO) {
-		error = xfs_iunlink_map_ino(tp, agno, *agino, imap, dipp, bpp);
-		if (error)
-			return error;
-
-		if (be32_to_cpu((*dipp)->di_next_unlinked) == target_agino)
-			return 0;
-
-		/*
-		 * If we get here the cache contents were corrupt, so drop the
-		 * buffer and fall back to walking the bucket list.
-		 */
-		xfs_trans_brelse(tp, *bpp);
-		*bpp = NULL;
-		WARN_ON_ONCE(1);
-	}
-
-	trace_xfs_iunlink_map_prev_fallback(mp, agno);
-
-	/* Otherwise, walk the entire bucket until we find it. */
-	next_agino = head_agino;
-	while (next_agino != target_agino) {
-		xfs_agino_t	unlinked_agino;
-
-		if (*bpp)
-			xfs_trans_brelse(tp, *bpp);
-
-		*agino = next_agino;
-		error = xfs_iunlink_map_ino(tp, agno, next_agino, imap, dipp,
-				bpp);
-		if (error)
-			return error;
+	ASSERT(VFS_I(ip)->i_nlink == 0);
+	ASSERT(VFS_I(ip)->i_mode != 0);
+	trace_xfs_iunlink(ip);
 
-		unlinked_agino = be32_to_cpu((*dipp)->di_next_unlinked);
-		/*
-		 * Make sure this pointer is valid and isn't an obvious
-		 * infinite loop.
-		 */
-		if (!xfs_verify_agino(mp, agno, unlinked_agino) ||
-		    next_agino == unlinked_agino) {
-			XFS_CORRUPTION_ERROR(__func__,
-					XFS_ERRLEVEL_LOW, mp,
-					*dipp, sizeof(**dipp));
-			error = -EFSCORRUPTED;
-			return error;
-		}
-		next_agino = unlinked_agino;
-	}
+	/* Get the agi buffer first.  It ensures lock ordering on the list. */
+	error = xfs_read_agi(mp, tp, agno, &agibp);
+	if (error)
+		return error;
 
-	return 0;
+	return xfs_iunlink_insert_inode(tp, agibp, ip);
 }
 
 /*
@@ -2400,16 +2206,8 @@ xfs_iunlink_remove(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi;
 	struct xfs_buf		*agibp;
-	struct xfs_buf		*last_ibp;
-	struct xfs_dinode	*last_dip = NULL;
-	struct xfs_perag	*pag = NULL;
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
-	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
-	xfs_agino_t		next_agino;
-	xfs_agino_t		head_agino;
-	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
 	trace_xfs_iunlink_remove(ip);
@@ -2418,84 +2216,8 @@ xfs_iunlink_remove(
 	error = xfs_read_agi(mp, tp, agno, &agibp);
 	if (error)
 		return error;
-	agi = agibp->b_addr;
-
-	/*
-	 * Get the index into the agi hash table for the list this inode will
-	 * go on.  Make sure the head pointer isn't garbage.
-	 */
-	head_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
-	if (!xfs_verify_agino(mp, agno, head_agino)) {
-		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
-				agi, sizeof(*agi));
-		return -EFSCORRUPTED;
-	}
-
-	/*
-	 * Set our inode's next_unlinked pointer to NULL and then return
-	 * the old pointer value so that we can update whatever was previous
-	 * to us in the list to point to whatever was next in the list.
-	 */
-	error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO, &next_agino);
-	if (error)
-		return error;
-
-	/*
-	 * If there was a backref pointing from the next inode back to this
-	 * one, remove it because we've removed this inode from the list.
-	 *
-	 * Later, if this inode was in the middle of the list we'll update
-	 * this inode's backref to point from the next inode.
-	 */
-	if (next_agino != NULLAGINO) {
-		pag = xfs_perag_get(mp, agno);
-		error = xfs_iunlink_change_backref(pag, next_agino,
-				NULLAGINO);
-		if (error)
-			goto out;
-	}
-
-	if (head_agino == agino) {
-		/* Point the head of the list to the next unlinked inode. */
-		error = xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
-				next_agino);
-		if (error)
-			goto out;
-	} else {
-		struct xfs_imap	imap;
-		xfs_agino_t	prev_agino;
-
-		if (!pag)
-			pag = xfs_perag_get(mp, agno);
-
-		/* We need to search the list for the inode being freed. */
-		error = xfs_iunlink_map_prev(tp, agno, head_agino, agino,
-				&prev_agino, &imap, &last_dip, &last_ibp,
-				pag);
-		if (error)
-			goto out;
 
-		/* Point the previous inode on the list to the next inode. */
-		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
-				last_dip, &imap, next_agino);
-
-		/*
-		 * Now we deal with the backref for this inode.  If this inode
-		 * pointed at a real inode, change the backref that pointed to
-		 * us to point to our old next.  If this inode was the end of
-		 * the list, delete the backref that pointed to us.  Note that
-		 * change_backref takes care of deleting the backref if
-		 * next_agino is NULLAGINO.
-		 */
-		error = xfs_iunlink_change_backref(pag, agino, next_agino);
-		if (error)
-			goto out;
-	}
-
-out:
-	if (pag)
-		xfs_perag_put(pag);
-	return error;
+	return xfs_iunlink_remove_inode(tp, agibp, ip);
 }
 
 /*
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 991ef00370d5..c3fb62fbdeb1 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -56,6 +56,8 @@ typedef struct xfs_inode {
 	uint64_t		i_delayed_blks;	/* count of delay alloc blks */
 
 	struct xfs_icdinode	i_d;		/* most of ondisk inode */
+	xfs_agino_t		i_prev_unlinked;
+	xfs_agino_t		i_next_unlinked;
 
 	/* VFS inode */
 	struct inode		i_vnode;	/* embedded VFS inode */
@@ -463,9 +465,6 @@ extern struct kmem_zone	*xfs_inode_zone;
 /* The default CoW extent size hint. */
 #define XFS_DEFAULT_COWEXTSZ_HINT 32
 
-int xfs_iunlink_init(struct xfs_perag *pag);
-void xfs_iunlink_destroy(struct xfs_perag *pag);
-
 void xfs_end_io(struct work_struct *work);
 
 #endif	/* __XFS_INODE_H__ */
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 52a65a74208f..d47eea31c165 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2682,15 +2682,13 @@ xlog_recover_clear_agi_bucket(
 	return;
 }
 
-STATIC xfs_agino_t
-xlog_recover_process_one_iunlink(
+static struct xfs_inode *
+xlog_recover_get_one_iunlink(
 	struct xfs_mount		*mp,
 	xfs_agnumber_t			agno,
 	xfs_agino_t			agino,
 	int				bucket)
 {
-	struct xfs_buf			*ibp;
-	struct xfs_dinode		*dip;
 	struct xfs_inode		*ip;
 	xfs_ino_t			ino;
 	int				error;
@@ -2698,45 +2696,20 @@ xlog_recover_process_one_iunlink(
 	ino = XFS_AGINO_TO_INO(mp, agno, agino);
 	error = xfs_iget(mp, NULL, ino, 0, 0, &ip);
 	if (error)
-		goto fail;
-
-	/*
-	 * Get the on disk inode to find the next inode in the bucket.
-	 */
-	error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &ibp, 0);
-	if (error)
-		goto fail_iput;
+		return NULL;
 
 	xfs_iflags_clear(ip, XFS_IRECOVERY);
 	ASSERT(VFS_I(ip)->i_nlink == 0);
 	ASSERT(VFS_I(ip)->i_mode != 0);
 
-	/* setup for the next pass */
-	agino = be32_to_cpu(dip->di_next_unlinked);
-	xfs_buf_relse(ibp);
-
 	/*
 	 * Prevent any DMAPI event from being sent when the reference on
 	 * the inode is dropped.
 	 */
 	ip->i_d.di_dmevmask = 0;
 
-	xfs_irele(ip);
-	return agino;
+	return ip;
 
- fail_iput:
-	xfs_irele(ip);
- fail:
-	/*
-	 * We can't read in the inode this bucket points to, or this inode
-	 * is messed up.  Just ditch this bucket of inodes.  We will lose
-	 * some inodes and space, but at least we won't hang.
-	 *
-	 * Call xlog_recover_clear_agi_bucket() to perform a transaction to
-	 * clear the inode pointer in the bucket.
-	 */
-	xlog_recover_clear_agi_bucket(mp, agno, bucket);
-	return NULLAGINO;
 }
 
 /*
@@ -2762,56 +2735,107 @@ xlog_recover_process_one_iunlink(
  * scheduled on this CPU to ensure other scheduled work can run without undue
  * latency.
  */
-STATIC void
-xlog_recover_process_iunlinks(
-	struct xlog	*log)
+static int
+xlog_recover_iunlink_ag(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
 {
-	xfs_mount_t	*mp;
-	xfs_agnumber_t	agno;
-	xfs_agi_t	*agi;
-	xfs_buf_t	*agibp;
-	xfs_agino_t	agino;
-	int		bucket;
-	int		error;
+	struct xfs_agi		*agi;
+	struct xfs_buf		*agibp;
+	int			bucket;
+	int			error;
 
-	mp = log->l_mp;
+	/*
+	 * Find the agi for this ag.
+	 */
+	error = xfs_read_agi(mp, NULL, agno, &agibp);
+	if (error) {
 
-	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
 		/*
-		 * Find the agi for this ag.
+		 * AGI is b0rked. Don't process it.
+		 *
+		 * We should probably mark the filesystem as corrupt after we've
+		 * recovered all the ag's we can....
 		 */
-		error = xfs_read_agi(mp, NULL, agno, &agibp);
+		return 0;
+	}
+
+	/*
+	 * Unlock the buffer so that it can be acquired in the normal course of
+	 * the transaction to truncate and free each inode.  Because we are not
+	 * racing with anyone else here for the AGI buffer, we don't even need
+	 * to hold it locked to read the initial unlinked bucket entries out of
+	 * the buffer. We keep buffer reference though, so that it stays pinned
+	 * in memory while we need the buffer.
+	 */
+	agi = agibp->b_addr;
+	xfs_buf_unlock(agibp);
+
+	/*
+	 * The unlinked inode list is maintained on incore inodes as a double
+	 * linked list. We don't have any of that state in memory, so we have to
+	 * create it as we go. This is simple as we are only removing from the
+	 * head of the list and that means we only need to pull the current
+	 * inode in and the next inode.  Inodes are unlinked when their
+	 * reference count goes to zero, so we can overlap the xfs_iget() and
+	 * xfs_irele() calls so we always have the first two inodes on the list
+	 * in memory. Hence we can fake up the necessary in memory state for the
+	 * unlink to "just work".
+	 */
+	for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
+		struct xfs_inode	*ip, *prev_ip = NULL;
+		xfs_agino_t		agino, prev_agino = NULLAGINO;
+
+		agino = be32_to_cpu(agi->agi_unlinked[bucket]);
+		while (agino != NULLAGINO) {
+			ip = xlog_recover_get_one_iunlink(mp, agno, agino,
+							  bucket);
+			if (!ip) {
+				/*
+				 * something busted, but still got to release
+				 * prev_ip, so make it look like it's at the end
+				 * of the list before it gets released.
+				 */
+				error = -EFSCORRUPTED;
+				if (prev_ip)
+					prev_ip->i_next_unlinked = NULLAGINO;
+				break;
+			}
+			if (prev_ip) {
+				ip->i_prev_unlinked = prev_agino;
+				xfs_irele(prev_ip);
+			}
+			prev_agino = agino;
+			prev_ip = ip;
+			agino = ip->i_next_unlinked;
+			cond_resched();
+		}
+		if (prev_ip)
+			xfs_irele(prev_ip);
 		if (error) {
 			/*
-			 * AGI is b0rked. Don't process it.
-			 *
-			 * We should probably mark the filesystem as corrupt
-			 * after we've recovered all the ag's we can....
+			 * We can't read an inode this bucket points to, or an
+			 * inode is messed up.  Just ditch this bucket of
+			 * inodes.  We will lose some inodes and space, but at
+			 * least we won't hang.
 			 */
-			continue;
-		}
-		/*
-		 * Unlock the buffer so that it can be acquired in the normal
-		 * course of the transaction to truncate and free each inode.
-		 * Because we are not racing with anyone else here for the AGI
-		 * buffer, we don't even need to hold it locked to read the
-		 * initial unlinked bucket entries out of the buffer. We keep
-		 * buffer reference though, so that it stays pinned in memory
-		 * while we need the buffer.
-		 */
-		agi = agibp->b_addr;
-		xfs_buf_unlock(agibp);
-
-		for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
-			agino = be32_to_cpu(agi->agi_unlinked[bucket]);
-			while (agino != NULLAGINO) {
-				agino = xlog_recover_process_one_iunlink(mp,
-							agno, agino, bucket);
-				cond_resched();
-			}
+			xlog_recover_clear_agi_bucket(mp, agno, bucket);
+			break;
 		}
-		xfs_buf_rele(agibp);
 	}
+	xfs_buf_rele(agibp);
+	return error;
+}
+
+STATIC void
+xlog_recover_process_iunlinks(
+	struct xlog		*log)
+{
+	struct xfs_mount	*mp = log->l_mp;
+	xfs_agnumber_t		agno;
+
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++)
+		xlog_recover_iunlink_ag(mp, agno);
 }
 
 STATIC void
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index c8ae49a1e99c..031e96ff022d 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -146,7 +146,6 @@ xfs_free_perag(
 		spin_unlock(&mp->m_perag_lock);
 		ASSERT(pag);
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
-		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
 	}
@@ -223,9 +222,6 @@ xfs_initialize_perag(
 		/* first new pag is fully initialized */
 		if (first_initialised == NULLAGNUMBER)
 			first_initialised = index;
-		error = xfs_iunlink_init(pag);
-		if (error)
-			goto out_hash_destroy;
 		spin_lock_init(&pag->pag_state_lock);
 	}
 
@@ -248,7 +244,6 @@ xfs_initialize_perag(
 		if (!pag)
 			break;
 		xfs_buf_hash_destroy(pag);
-		xfs_iunlink_destroy(pag);
 		kmem_free(pag);
 	}
 	return error;
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] xfs: introduce inode unlink log item
  2020-06-23  9:50 [PATCH 0/4] [RFC] xfs: in memory inode unlink log items Dave Chinner
                   ` (2 preceding siblings ...)
  2020-06-23  9:50 ` [PATCH 3/4] xfs: track unlinked inodes in core inode Dave Chinner
@ 2020-06-23  9:50 ` Dave Chinner
  2020-06-30 18:19   ` Darrick J. Wong
  2020-07-01 14:32   ` Brian Foster
  3 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2020-06-23  9:50 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Tracking dirty inodes via cluster buffers creates lock ordering
issues with logging unlinked inode updates direct to the cluster
buffer. The unlinked inode list is unordered, so we can lock cluster
buffers in random orders and that causes deadlocks.

To solve this problem, we really want to dealy locking the cluster
buffers until the pre-commit phase where we can order the buffers
correctly along with all the other inode cluster buffers that are
locked by the transaction. However, to do this we need to be able to
tell the transaction which inodes need to have there unlinked list
updated and what it should be updated to.

We can delay the buffer update to the pre-commit phase based on the
fact taht all unlinked inode list updates are serialised by the AGI
buffer. It will be locked into the transaction before the list
update starts, and will remain locked until the transaction commits.
Hence we can lock and update the cluster buffers safely any time
during the transaction and we are still safe from other racing
unlinked list updates.

The iunlink log item currently only exists in memory. we need a log
item to attach information to the transaction, but it's context
is completely owned by the transaction. Hence it is never formatted
or inserted into the CIL, nor is it seen by the journal, the AIL or
log recovery.

This makes it a very simple log item, and the changes makes results
in adding addition buffer log items to the transaction. Hence once
the iunlink log item has run it's pre-commit operation, it can be
dropped by the transaction and released.

The creation of this in-memory intent does not prevent us from
extending it in future to the journal to replace buffer based
logging of the unlinked list. Changing the format of the items we
write to the on disk journal is beyond the scope of this patchset,
hence we limit it to being in-memory only.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/Makefile           |   1 +
 fs/xfs/xfs_inode.c        |  70 +++----------------
 fs/xfs/xfs_inode_item.c   |   3 +-
 fs/xfs/xfs_iunlink_item.c | 141 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iunlink_item.h |  24 +++++++
 fs/xfs/xfs_super.c        |  10 +++
 6 files changed, 189 insertions(+), 60 deletions(-)
 create mode 100644 fs/xfs/xfs_iunlink_item.c
 create mode 100644 fs/xfs/xfs_iunlink_item.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 04611a1068b4..febdf034ca94 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -105,6 +105,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_icreate_item.o \
 				   xfs_inode_item.o \
 				   xfs_inode_item_recover.o \
+				   xfs_iunlink_item.o \
 				   xfs_refcount_item.o \
 				   xfs_rmap_item.o \
 				   xfs_log_recover.o \
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 1f1c8819330b..ab288424764c 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -35,6 +35,7 @@
 #include "xfs_log.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
+#include "xfs_iunlink_item.h"
 
 kmem_zone_t *xfs_inode_zone;
 
@@ -1945,56 +1946,6 @@ xfs_iunlink_lookup_next(
 	return next_ip;
 }
 
-/* Set an on-disk inode's next_unlinked pointer. */
-STATIC void
-xfs_iunlink_update_dinode(
-	struct xfs_trans	*tp,
-	xfs_agnumber_t		agno,
-	xfs_agino_t		agino,
-	struct xfs_buf		*ibp,
-	struct xfs_dinode	*dip,
-	struct xfs_imap		*imap,
-	xfs_agino_t		next_agino)
-{
-	struct xfs_mount	*mp = tp->t_mountp;
-	int			offset;
-
-	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
-
-	trace_xfs_iunlink_update_dinode(mp, agno, agino,
-			be32_to_cpu(dip->di_next_unlinked), next_agino);
-
-	dip->di_next_unlinked = cpu_to_be32(next_agino);
-	offset = imap->im_boffset +
-			offsetof(struct xfs_dinode, di_next_unlinked);
-
-	/* need to recalc the inode CRC if appropriate */
-	xfs_dinode_calc_crc(mp, dip);
-	xfs_trans_inode_buf(tp, ibp);
-	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
-}
-
-/* Set an in-core inode's unlinked pointer and return the old value. */
-static int
-xfs_iunlink_update_inode(
-	struct xfs_trans	*tp,
-	struct xfs_inode	*ip)
-{
-	struct xfs_mount	*mp = tp->t_mountp;
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
-	struct xfs_dinode	*dip;
-	struct xfs_buf		*ibp;
-	int			error;
-
-	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp, 0);
-	if (error)
-		return error;
-
-	xfs_iunlink_update_dinode(tp, agno, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			ibp, dip, &ip->i_imap, ip->i_next_unlinked);
-	return 0;
-}
-
 /*
  * Point the AGI unlinked bucket at an inode and log the results.  The caller
  * is responsible for validating the old value.
@@ -2051,7 +2002,6 @@ xfs_iunlink_insert_inode(
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
 	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
-	int			error;
 
 	/*
 	 * Get the index into the agi hash table for the list this inode will
@@ -2082,9 +2032,7 @@ xfs_iunlink_insert_inode(
 		nip->i_prev_unlinked = agino;
 
 		/* update the on disk inode now */
-		error = xfs_iunlink_update_inode(tp, ip);
-		if (error)
-			return error;
+		xfs_iunlink_log(tp, ip);
 	}
 
 	/* Point the head of the list to point to this inode. */
@@ -2140,9 +2088,7 @@ xfs_iunlink_remove_inode(
 
 		/* update the on disk inode now */
 		pip->i_next_unlinked = next_agino;
-		error = xfs_iunlink_update_inode(tp, pip);
-		if (error)
-			return error;
+		xfs_iunlink_log(tp, pip);
 	}
 
 	/* lookup the next inode and update to point at prev */
@@ -2162,10 +2108,15 @@ xfs_iunlink_remove_inode(
 		nip->i_prev_unlinked = ip->i_prev_unlinked;
 	}
 
-	/* now clear prev/next from this inode and update on disk */
+	/*
+	 * Now clear prev/next from this inode and update on disk if we
+	 * need to clear the on-disk link.
+	 */
 	ip->i_prev_unlinked = NULLAGINO;
 	ip->i_next_unlinked = NULLAGINO;
-	return xfs_iunlink_update_inode(tp, ip);
+	if (next_agino != NULLAGINO)
+		xfs_iunlink_log(tp, ip);
+	return 0;
 }
 
 /*
@@ -2185,6 +2136,7 @@ xfs_iunlink(
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	int			error;
 
+	ASSERT(ip->i_next_unlinked == NULLAGINO);
 	ASSERT(VFS_I(ip)->i_nlink == 0);
 	ASSERT(VFS_I(ip)->i_mode != 0);
 	trace_xfs_iunlink(ip);
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 0494b907c63d..bc1970c37edc 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -488,8 +488,9 @@ xfs_inode_item_push(
 	ASSERT(iip->ili_item.li_buf);
 
 	if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp) ||
-	    (ip->i_flags & XFS_ISTALE))
+	    (ip->i_flags & XFS_ISTALE)) {
 		return XFS_ITEM_PINNED;
+	}
 
 	/* If the inode is already flush locked, we're already flushing. */
 	if (xfs_iflags_test(ip, XFS_IFLUSHING))
diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
new file mode 100644
index 000000000000..83f1dc81133b
--- /dev/null
+++ b/fs/xfs/xfs_iunlink_item.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2020, Red Hat, Inc.
+ * All Rights Reserved.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_iunlink_item.h"
+#include "xfs_trace.h"
+
+kmem_zone_t	*xfs_iunlink_zone;
+
+static inline struct xfs_iunlink_item *IUL_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_iunlink_item, iu_item);
+}
+
+static void
+xfs_iunlink_item_release(
+	struct xfs_log_item	*lip)
+{
+	kmem_cache_free(xfs_iunlink_zone, IUL_ITEM(lip));
+}
+
+
+static uint64_t
+xfs_iunlink_item_sort(
+	struct xfs_log_item	*lip)
+{
+	return IUL_ITEM(lip)->iu_ino;
+}
+
+/*
+ * On precommit, we grab the inode cluster buffer for the inode number
+ * we were passed, then update the next unlinked field for that inode in
+ * the buffer and log the buffer. This ensures that the inode cluster buffer
+ * was logged in the correct order w.r.t. other inode cluster buffers.
+ *
+ * Note: if the inode cluster buffer is marked stale, this transaction is
+ * actually freeing the inode cluster. In that case, do not relog the buffer
+ * as this removes the stale state from it. That then causes the post-commit
+ * processing that is dependent on the cluster buffer being stale to go wrong
+ * and we'll leave stale inodes in the AIL that cannot be removed, hanging the
+ * log.
+ */
+static int
+xfs_iunlink_item_precommit(
+	struct xfs_trans	*tp,
+	struct xfs_log_item	*lip)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_iunlink_item	*iup = IUL_ITEM(lip);
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, iup->iu_ino);
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, iup->iu_ino);
+	struct xfs_dinode	*dip;
+	struct xfs_buf		*bp;
+	int			offset;
+	int			error;
+
+	error = xfs_imap_to_bp(mp, tp, &iup->iu_imap, &dip, &bp, 0);
+	if (error)
+		goto out_remove;
+
+	trace_xfs_iunlink_update_dinode(mp, agno, agino,
+			be32_to_cpu(dip->di_next_unlinked),
+			iup->iu_next_unlinked);
+
+	/*
+	 * Don't bother updating the unlinked field on stale buffers as
+	 * it will never get to disk anyway.
+	 */
+	if (bp->b_flags & XBF_STALE)
+		goto out_remove;
+
+	dip->di_next_unlinked = cpu_to_be32(iup->iu_next_unlinked);
+	offset = iup->iu_imap.im_boffset +
+			offsetof(struct xfs_dinode, di_next_unlinked);
+
+	/* need to recalc the inode CRC if appropriate */
+	xfs_dinode_calc_crc(mp, dip);
+	xfs_trans_inode_buf(tp, bp);
+	xfs_trans_log_buf(tp, bp, offset, offset + sizeof(xfs_agino_t) - 1);
+
+out_remove:
+	/*
+	 * This log item only exists to perform this action. We now remove
+	 * it from the transaction and free it as it should never reach the
+	 * CIL.
+	 */
+	list_del(&lip->li_trans);
+	xfs_iunlink_item_release(lip);
+	return error;
+}
+
+static const struct xfs_item_ops xfs_iunlink_item_ops = {
+	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
+	.iop_release	= xfs_iunlink_item_release,
+	.iop_sort	= xfs_iunlink_item_sort,
+	.iop_precommit	= xfs_iunlink_item_precommit,
+};
+
+
+/*
+ * Initialize the inode log item for a newly allocated (in-core) inode.
+ *
+ * Inode extents can only reside within an AG. Hence specify the starting
+ * block for the inode chunk by offset within an AG as well as the
+ * length of the allocated extent.
+ *
+ * This joins the item to the transaction and marks it dirty so
+ * that we don't need a separate call to do this, nor does the
+ * caller need to know anything about the iunlink item.
+ */
+void
+xfs_iunlink_log(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	struct xfs_iunlink_item	*iup;
+
+	iup = kmem_zone_zalloc(xfs_iunlink_zone, 0);
+
+	xfs_log_item_init(tp->t_mountp, &iup->iu_item, XFS_LI_IUNLINK,
+			  &xfs_iunlink_item_ops);
+
+	iup->iu_ino = ip->i_ino;
+	iup->iu_next_unlinked = ip->i_next_unlinked;
+	iup->iu_imap = ip->i_imap;
+
+	xfs_trans_add_item(tp, &iup->iu_item);
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	set_bit(XFS_LI_DIRTY, &iup->iu_item.li_flags);
+}
diff --git a/fs/xfs/xfs_iunlink_item.h b/fs/xfs/xfs_iunlink_item.h
new file mode 100644
index 000000000000..c9e58acf4ccf
--- /dev/null
+++ b/fs/xfs/xfs_iunlink_item.h
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2020, Red Hat, Inc.
+ * All Rights Reserved.
+ */
+#ifndef XFS_IUNLINK_ITEM_H
+#define XFS_IUNLINK_ITEM_H	1
+
+struct xfs_trans;
+struct xfs_inode;
+
+/* in memory log item structure */
+struct xfs_iunlink_item {
+	struct xfs_log_item	iu_item;
+	struct xfs_imap		iu_imap;
+	xfs_ino_t		iu_ino;
+	xfs_agino_t		iu_next_unlinked;
+};
+
+extern kmem_zone_t *xfs_iunlink_zone;
+
+void xfs_iunlink_log(struct xfs_trans *tp, struct xfs_inode *ip);
+
+#endif	/* XFS_IUNLINK_ITEM_H */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 5a5d9453cf51..a36dfb0e7e5b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -35,6 +35,7 @@
 #include "xfs_refcount_item.h"
 #include "xfs_bmap_item.h"
 #include "xfs_reflink.h"
+#include "xfs_iunlink_item.h"
 
 #include <linux/magic.h>
 #include <linux/fs_context.h>
@@ -1955,8 +1956,16 @@ xfs_init_zones(void)
 	if (!xfs_bui_zone)
 		goto out_destroy_bud_zone;
 
+	xfs_iunlink_zone = kmem_cache_create("xfs_iul_item",
+					     sizeof(struct xfs_iunlink_item),
+					     0, 0, NULL);
+	if (!xfs_iunlink_zone)
+		goto out_destroy_bui_zone;
+
 	return 0;
 
+ out_destroy_bui_zone:
+	kmem_cache_destroy(xfs_bui_zone);
  out_destroy_bud_zone:
 	kmem_cache_destroy(xfs_bud_zone);
  out_destroy_cui_zone:
@@ -2003,6 +2012,7 @@ xfs_destroy_zones(void)
 	 * destroy caches.
 	 */
 	rcu_barrier();
+	kmem_cache_destroy(xfs_iunlink_zone);
 	kmem_cache_destroy(xfs_bui_zone);
 	kmem_cache_destroy(xfs_bud_zone);
 	kmem_cache_destroy(xfs_cui_zone);
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] xfs: xfs_iflock is no longer a completion
  2020-06-23  9:50 ` [PATCH 1/4] xfs: xfs_iflock is no longer a completion Dave Chinner
@ 2020-06-24 15:36   ` Brian Foster
  2020-07-01  5:48     ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Brian Foster @ 2020-06-24 15:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 23, 2020 at 07:50:12PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> With the recent rework of the inode cluster flushing, we no longer
> ever wait on the the inode flush "lock". It was never a lock in the
> first place, just a completion to allow callers to wait for inode IO
> to complete. We now never wait for flush completion as all inode
> flushing is non-blocking. Hence we can get rid of all the iflock
> infrastructure and instead just set and check a state flag.
> 
> Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
> xfs_iflock_nowait() test-and-set operations on that flag, and
> replace all the xfs_ifunlock() calls to clear operations.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

I'd call it IFLUSHED vs. IFLUSHING, but I'm not going to harp on that.
Just a few nits, otherwise looks like a nice cleanup.

>  fs/xfs/xfs_icache.c     | 19 ++++++------
>  fs/xfs/xfs_inode.c      | 67 +++++++++++++++--------------------------
>  fs/xfs/xfs_inode.h      | 33 +-------------------
>  fs/xfs/xfs_inode_item.c |  6 ++--
>  fs/xfs/xfs_inode_item.h |  4 +--
>  5 files changed, 39 insertions(+), 90 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index a973f180c6cd..0d73559f2d58 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
...
> @@ -1045,23 +1044,23 @@ xfs_reclaim_inode(
>  
>  	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
>  		goto out;
> -	if (!xfs_iflock_nowait(ip))
> +	if (xfs_iflags_test_and_set(ip, XFS_IFLUSHING))
>  		goto out_iunlock;
>  
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
>  		xfs_iunpin_wait(ip);
>  		/* xfs_iflush_abort() drops the flush lock */

The flush what? ;P

>  		xfs_iflush_abort(ip);
> +		ASSERT(!xfs_iflags_test(ip, XFS_IFLUSHING));

Seems a bit superfluous right after the abort.

>  		goto reclaim;
>  	}
>  	if (xfs_ipincount(ip))
> -		goto out_ifunlock;
> +		goto out_clear_flush;
>  	if (!xfs_inode_clean(ip))
> -		goto out_ifunlock;
> +		goto out_clear_flush;
>  
> -	xfs_ifunlock(ip);
> +	xfs_iflags_clear(ip, XFS_IFLUSHING);
>  reclaim:
> -	ASSERT(!xfs_isiflocked(ip));
>  
>  	/*
>  	 * Because we use RCU freeing we need to ensure the inode always appears
...
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 3840117f8a5e..0494b907c63d 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -492,7 +492,7 @@ xfs_inode_item_push(
>  		return XFS_ITEM_PINNED;
>  
>  	/* If the inode is already flush locked, we're already flushing. */

Another "flush lock" instance. Perhaps:

"If the inode is already flushing, we're already flushing"

(i.e. kill the comment) ;)

Brian

> -	if (xfs_isiflocked(ip))
> +	if (xfs_iflags_test(ip, XFS_IFLUSHING))
>  		return XFS_ITEM_FLUSHING;
>  
>  	if (!xfs_buf_trylock(bp))
> @@ -702,7 +702,7 @@ xfs_iflush_finish(
>  		iip->ili_last_fields = 0;
>  		iip->ili_flush_lsn = 0;
>  		spin_unlock(&iip->ili_lock);
> -		xfs_ifunlock(iip->ili_inode);
> +		xfs_iflags_clear(iip->ili_inode, XFS_IFLUSHING);
>  		if (drop_buffer)
>  			xfs_buf_rele(bp);
>  	}
> @@ -789,7 +789,7 @@ xfs_iflush_abort(
>  		list_del_init(&iip->ili_item.li_bio_list);
>  		spin_unlock(&iip->ili_lock);
>  	}
> -	xfs_ifunlock(ip);
> +	xfs_iflags_clear(ip, XFS_IFLUSHING);
>  	if (bp)
>  		xfs_buf_rele(bp);
>  }
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 048b5e7dee90..23a7b4928727 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -25,8 +25,8 @@ struct xfs_inode_log_item {
>  	 *
>  	 * We need atomic changes between inode dirtying, inode flushing and
>  	 * inode completion, but these all hold different combinations of
> -	 * ILOCK and iflock and hence we need some other method of serialising
> -	 * updates to the flush state.
> +	 * ILOCK and IFLUSHING and hence we need some other method of
> +	 * serialising updates to the flush state.
>  	 */
>  	spinlock_t		ili_lock;	   /* flush state lock */
>  	unsigned int		ili_last_fields;   /* fields when flushed */
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] xfs: add log item precommit operation
  2020-06-23  9:50 ` [PATCH 2/4] xfs: add log item precommit operation Dave Chinner
@ 2020-06-30 18:06   ` Darrick J. Wong
  2020-07-01 14:30   ` Brian Foster
  1 sibling, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-06-30 18:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 23, 2020 at 07:50:13PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For inodes that are dirty, we have an attached cluster buffer that
> we want to use to track the dirty inode through the AIL.
> Unfortunately, locking the cluster buffer and adding it to the
> transaction when the inode is first logged in a transaction leads to
> buffer lock ordering inversions.
> 
> The specific problem is ordering against the AGI buffer. When
> modifying unlinked lists, the buffer lock order is AGI -> inode
> cluster buffer as the AGI buffer lock serialises all access to the
> unlinked lists. Unfortunately, functionality like xfs_droplink()
> logs the inode before calling xfs_iunlink(), as do various directory
> manipulation functions. The inode can be logged way down in the
> stack as far as the bmapi routines and hence, without a major
> rewrite of lots of APIs there's no way we can avoid the inode being
> logged by something until after the AGI has been logged.
> 
> As we are going to be using ordered buffers for inode AIL tracking,
> there isn't a need to actually lock that buffer against modification
> as all the modifications are captured by logging the inode item
> itself. Hence we don't actually need to join the cluster buffer into
> the transaction until just before it is committed. This means we do
> not perturb any of the existing buffer lock orders in transactions,
> and the inode cluster buffer is always locked last in a transaction
> that doesn't otherwise touch inode cluster buffers.
> 
> We do this by introducing a precommit log item method. A log item
> method is used because it is likely dquots will be moved to this
> same ordered buffer tracking scheme and hence will need a similar
> callout. This commit just introduces the mechanism; the inode item
> implementation is in followup commits.
> 
> The precommit items need to be sorted into consistent order as we
> may be locking multiple items here. Hence if we have two dirty
> inodes in cluster buffers A and B, and some other transaction has
> two separate dirty inodes in the same cluster buffers, locking them
> in different orders opens us up to ABBA deadlocks. Hence we sort the
> items on the transaction based on the presence of a sort log item
> method.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c |  1 +
>  fs/xfs/xfs_trans.c  | 90 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_trans.h  |  6 ++-
>  3 files changed, 95 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 0d73559f2d58..1c744dbb313f 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1077,6 +1077,7 @@ xfs_reclaim_inode(
>  	ip->i_ino = 0;
>  	spin_unlock(&ip->i_flags_lock);
>  
> +	ASSERT(!ip->i_itemp || ip->i_itemp->ili_item.li_buf == NULL);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  
>  	XFS_STATS_INC(ip->i_mount, xs_ig_reclaims);
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 3c94e5ff4316..6f350490f84b 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -799,6 +799,89 @@ xfs_trans_committed_bulk(
>  	spin_unlock(&ailp->ail_lock);
>  }
>  
> +/*
> + * Sort transaction items prior to running precommit operations. This will
> + * attempt to order the items such that they will always be locked in the same
> + * order. Items that have no sort function are moved to the end of the list
> + * and so are locked last (XXX: need to check the logic matches the comment).
> + *
> + * This may need refinement as different types of objects add sort functions.
> + *
> + * Function is more complex than it needs to be because we are comparing 64 bit
> + * values and the function only returns 32 bit values.
> + */
> +static int
> +xfs_trans_precommit_sort(
> +	void			*unused_arg,
> +	struct list_head	*a,
> +	struct list_head	*b)
> +{
> +	struct xfs_log_item	*lia = container_of(a,
> +					struct xfs_log_item, li_trans);
> +	struct xfs_log_item	*lib = container_of(b,
> +					struct xfs_log_item, li_trans);
> +	int64_t			diff;
> +
> +	if (!lia->li_ops->iop_sort && !lib->li_ops->iop_sort)
> +		return 0;
> +	if (!lia->li_ops->iop_sort)
> +		return 1;
> +	if (!lib->li_ops->iop_sort)
> +		return -1;
> +
> +	diff = lia->li_ops->iop_sort(lia) - lib->li_ops->iop_sort(lib);
> +	if (diff < 0)
> +		return -1;
> +	if (diff > 0)
> +		return 1;
> +	return 0;
> +}
> +
> +/*
> + * Run transaction precommit functions.
> + *
> + * If there is an error in any of the callouts, then stop immediately and
> + * trigger a shutdown to abort the transaction. There is no recovery possible
> + * from errors at this point as the transaction is dirty....
> + */
> +static int
> +xfs_trans_run_precommits(
> +	struct xfs_trans	*tp)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_log_item	*lip, *n;
> +	int			error = 0;
> +
> +	if (XFS_FORCED_SHUTDOWN(mp))
> +		return -EIO;
> +
> +	/*
> +	 * Sort the item list to avoid ABBA deadlocks with other transactions
> +	 * running precommit operations that lock multiple shared items such as
> +	 * inode cluster buffers.
> +	 */
> +	list_sort(NULL, &tp->t_items, xfs_trans_precommit_sort);
> +
> +	/*
> +	 * Precommit operations can remove the log item from the transaction
> +	 * if the log item exists purely to delay modifications until they
> +	 * can be ordered against other operations. Hence we have to use
> +	 * list_for_each_entry_safe() here.
> +	 */
> +	list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
> +		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
> +			continue;
> +		if (lip->li_ops->iop_precommit) {
> +			error = lip->li_ops->iop_precommit(tp, lip);
> +			if (error)
> +				break;
> +		}
> +	}
> +	if (error)
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +	return error;
> +}
> +
>  /*
>   * Commit the given transaction to the log.
>   *
> @@ -823,6 +906,13 @@ __xfs_trans_commit(
>  
>  	trace_xfs_trans_commit(tp, _RET_IP_);
>  
> +	error = xfs_trans_run_precommits(tp);
> +	if (error) {
> +		if (tp->t_flags & XFS_TRANS_PERM_LOG_RES)
> +			xfs_defer_cancel(tp);
> +		goto out_unreserve;
> +	}
> +
>  	/*
>  	 * Finish deferred items on final commit. Only permanent transactions
>  	 * should ever have deferred ops.
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index b752501818d2..26ea19bd0621 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -70,10 +70,12 @@ struct xfs_item_ops {
>  	void (*iop_format)(struct xfs_log_item *, struct xfs_log_vec *);
>  	void (*iop_pin)(struct xfs_log_item *);
>  	void (*iop_unpin)(struct xfs_log_item *, int remove);
> -	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
> +	uint64_t (*iop_sort)(struct xfs_log_item *);

Shouldn't the return value be signed, since this is a comparison
function...?  Or does the ->iop_sort function itself do sorting??

--D

> +	int (*iop_precommit)(struct xfs_trans *, struct xfs_log_item *);
>  	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
> -	void (*iop_release)(struct xfs_log_item *);
>  	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
> +	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
> +	void (*iop_release)(struct xfs_log_item *);
>  	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
>  	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
>  };
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] xfs: introduce inode unlink log item
  2020-06-23  9:50 ` [PATCH 4/4] xfs: introduce inode unlink log item Dave Chinner
@ 2020-06-30 18:19   ` Darrick J. Wong
  2020-06-30 22:31     ` Gao Xiang
  2020-07-01 14:32   ` Brian Foster
  1 sibling, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2020-06-30 18:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 23, 2020 at 07:50:15PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Tracking dirty inodes via cluster buffers creates lock ordering
> issues with logging unlinked inode updates direct to the cluster
> buffer. The unlinked inode list is unordered, so we can lock cluster
> buffers in random orders and that causes deadlocks.
> 
> To solve this problem, we really want to dealy locking the cluster

"delay"

> buffers until the pre-commit phase where we can order the buffers
> correctly along with all the other inode cluster buffers that are
> locked by the transaction. However, to do this we need to be able to
> tell the transaction which inodes need to have there unlinked list

"their"

> updated and what it should be updated to.
> 
> We can delay the buffer update to the pre-commit phase based on the
> fact taht all unlinked inode list updates are serialised by the AGI

"that"

> buffer. It will be locked into the transaction before the list
> update starts, and will remain locked until the transaction commits.
> Hence we can lock and update the cluster buffers safely any time
> during the transaction and we are still safe from other racing
> unlinked list updates.
> 
> The iunlink log item currently only exists in memory. we need a log
> item to attach information to the transaction, but it's context

"its"

> is completely owned by the transaction. Hence it is never formatted
> or inserted into the CIL, nor is it seen by the journal, the AIL or
> log recovery.
> 
> This makes it a very simple log item, and the changes makes results
> in adding addition buffer log items to the transaction. Hence once
> the iunlink log item has run it's pre-commit operation, it can be

"its"

> dropped by the transaction and released.
> 
> The creation of this in-memory intent does not prevent us from
> extending it in future to the journal to replace buffer based
> logging of the unlinked list. Changing the format of the items we
> write to the on disk journal is beyond the scope of this patchset,
> hence we limit it to being in-memory only.

<nod> Looks pretty straightforward...

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/Makefile           |   1 +
>  fs/xfs/xfs_inode.c        |  70 +++----------------
>  fs/xfs/xfs_inode_item.c   |   3 +-
>  fs/xfs/xfs_iunlink_item.c | 141 ++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_iunlink_item.h |  24 +++++++
>  fs/xfs/xfs_super.c        |  10 +++
>  6 files changed, 189 insertions(+), 60 deletions(-)
>  create mode 100644 fs/xfs/xfs_iunlink_item.c
>  create mode 100644 fs/xfs/xfs_iunlink_item.h
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 04611a1068b4..febdf034ca94 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -105,6 +105,7 @@ xfs-y				+= xfs_log.o \
>  				   xfs_icreate_item.o \
>  				   xfs_inode_item.o \
>  				   xfs_inode_item_recover.o \
> +				   xfs_iunlink_item.o \
>  				   xfs_refcount_item.o \
>  				   xfs_rmap_item.o \
>  				   xfs_log_recover.o \
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 1f1c8819330b..ab288424764c 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -35,6 +35,7 @@
>  #include "xfs_log.h"
>  #include "xfs_bmap_btree.h"
>  #include "xfs_reflink.h"
> +#include "xfs_iunlink_item.h"
>  
>  kmem_zone_t *xfs_inode_zone;
>  
> @@ -1945,56 +1946,6 @@ xfs_iunlink_lookup_next(
>  	return next_ip;
>  }
>  
> -/* Set an on-disk inode's next_unlinked pointer. */
> -STATIC void
> -xfs_iunlink_update_dinode(
> -	struct xfs_trans	*tp,
> -	xfs_agnumber_t		agno,
> -	xfs_agino_t		agino,
> -	struct xfs_buf		*ibp,
> -	struct xfs_dinode	*dip,
> -	struct xfs_imap		*imap,
> -	xfs_agino_t		next_agino)
> -{
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	int			offset;
> -
> -	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
> -
> -	trace_xfs_iunlink_update_dinode(mp, agno, agino,
> -			be32_to_cpu(dip->di_next_unlinked), next_agino);
> -
> -	dip->di_next_unlinked = cpu_to_be32(next_agino);
> -	offset = imap->im_boffset +
> -			offsetof(struct xfs_dinode, di_next_unlinked);
> -
> -	/* need to recalc the inode CRC if appropriate */
> -	xfs_dinode_calc_crc(mp, dip);
> -	xfs_trans_inode_buf(tp, ibp);
> -	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
> -}
> -
> -/* Set an in-core inode's unlinked pointer and return the old value. */
> -static int
> -xfs_iunlink_update_inode(
> -	struct xfs_trans	*tp,
> -	struct xfs_inode	*ip)
> -{
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> -	struct xfs_dinode	*dip;
> -	struct xfs_buf		*ibp;
> -	int			error;
> -
> -	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp, 0);
> -	if (error)
> -		return error;
> -
> -	xfs_iunlink_update_dinode(tp, agno, XFS_INO_TO_AGINO(mp, ip->i_ino),
> -			ibp, dip, &ip->i_imap, ip->i_next_unlinked);
> -	return 0;
> -}
> -
>  /*
>   * Point the AGI unlinked bucket at an inode and log the results.  The caller
>   * is responsible for validating the old value.
> @@ -2051,7 +2002,6 @@ xfs_iunlink_insert_inode(
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agino_t		next_agino;
>  	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
> -	int			error;
>  
>  	/*
>  	 * Get the index into the agi hash table for the list this inode will
> @@ -2082,9 +2032,7 @@ xfs_iunlink_insert_inode(
>  		nip->i_prev_unlinked = agino;
>  
>  		/* update the on disk inode now */
> -		error = xfs_iunlink_update_inode(tp, ip);
> -		if (error)
> -			return error;
> +		xfs_iunlink_log(tp, ip);
>  	}
>  
>  	/* Point the head of the list to point to this inode. */
> @@ -2140,9 +2088,7 @@ xfs_iunlink_remove_inode(
>  
>  		/* update the on disk inode now */
>  		pip->i_next_unlinked = next_agino;
> -		error = xfs_iunlink_update_inode(tp, pip);
> -		if (error)
> -			return error;
> +		xfs_iunlink_log(tp, pip);
>  	}
>  
>  	/* lookup the next inode and update to point at prev */
> @@ -2162,10 +2108,15 @@ xfs_iunlink_remove_inode(
>  		nip->i_prev_unlinked = ip->i_prev_unlinked;
>  	}
>  
> -	/* now clear prev/next from this inode and update on disk */
> +	/*
> +	 * Now clear prev/next from this inode and update on disk if we
> +	 * need to clear the on-disk link.
> +	 */
>  	ip->i_prev_unlinked = NULLAGINO;
>  	ip->i_next_unlinked = NULLAGINO;
> -	return xfs_iunlink_update_inode(tp, ip);
> +	if (next_agino != NULLAGINO)
> +		xfs_iunlink_log(tp, ip);
> +	return 0;
>  }
>  
>  /*
> @@ -2185,6 +2136,7 @@ xfs_iunlink(
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	int			error;
>  
> +	ASSERT(ip->i_next_unlinked == NULLAGINO);
>  	ASSERT(VFS_I(ip)->i_nlink == 0);
>  	ASSERT(VFS_I(ip)->i_mode != 0);
>  	trace_xfs_iunlink(ip);
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 0494b907c63d..bc1970c37edc 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -488,8 +488,9 @@ xfs_inode_item_push(
>  	ASSERT(iip->ili_item.li_buf);
>  
>  	if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp) ||
> -	    (ip->i_flags & XFS_ISTALE))
> +	    (ip->i_flags & XFS_ISTALE)) {
>  		return XFS_ITEM_PINNED;
> +	}
>  
>  	/* If the inode is already flush locked, we're already flushing. */
>  	if (xfs_iflags_test(ip, XFS_IFLUSHING))
> diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
> new file mode 100644
> index 000000000000..83f1dc81133b
> --- /dev/null
> +++ b/fs/xfs/xfs_iunlink_item.c
> @@ -0,0 +1,141 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2020, Red Hat, Inc.
> + * All Rights Reserved.
> + */
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_inode.h"
> +#include "xfs_trans.h"
> +#include "xfs_trans_priv.h"
> +#include "xfs_iunlink_item.h"
> +#include "xfs_trace.h"
> +
> +kmem_zone_t	*xfs_iunlink_zone;
> +
> +static inline struct xfs_iunlink_item *IUL_ITEM(struct xfs_log_item *lip)
> +{
> +	return container_of(lip, struct xfs_iunlink_item, iu_item);
> +}
> +
> +static void
> +xfs_iunlink_item_release(
> +	struct xfs_log_item	*lip)
> +{
> +	kmem_cache_free(xfs_iunlink_zone, IUL_ITEM(lip));
> +}
> +
> +
> +static uint64_t
> +xfs_iunlink_item_sort(
> +	struct xfs_log_item	*lip)
> +{
> +	return IUL_ITEM(lip)->iu_ino;
> +}

Oh, I see, ->iop_sort is supposed to return a sorting key for each log
item so that we can reorder the iunlink items to take locks in the
correct order.

> +
> +/*
> + * On precommit, we grab the inode cluster buffer for the inode number
> + * we were passed, then update the next unlinked field for that inode in
> + * the buffer and log the buffer. This ensures that the inode cluster buffer
> + * was logged in the correct order w.r.t. other inode cluster buffers.
> + *
> + * Note: if the inode cluster buffer is marked stale, this transaction is
> + * actually freeing the inode cluster. In that case, do not relog the buffer
> + * as this removes the stale state from it. That then causes the post-commit
> + * processing that is dependent on the cluster buffer being stale to go wrong
> + * and we'll leave stale inodes in the AIL that cannot be removed, hanging the
> + * log.
> + */
> +static int
> +xfs_iunlink_item_precommit(
> +	struct xfs_trans	*tp,
> +	struct xfs_log_item	*lip)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_iunlink_item	*iup = IUL_ITEM(lip);
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, iup->iu_ino);
> +	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, iup->iu_ino);
> +	struct xfs_dinode	*dip;
> +	struct xfs_buf		*bp;
> +	int			offset;
> +	int			error;
> +
> +	error = xfs_imap_to_bp(mp, tp, &iup->iu_imap, &dip, &bp, 0);
> +	if (error)
> +		goto out_remove;
> +
> +	trace_xfs_iunlink_update_dinode(mp, agno, agino,
> +			be32_to_cpu(dip->di_next_unlinked),
> +			iup->iu_next_unlinked);
> +
> +	/*
> +	 * Don't bother updating the unlinked field on stale buffers as
> +	 * it will never get to disk anyway.
> +	 */
> +	if (bp->b_flags & XBF_STALE)
> +		goto out_remove;
> +
> +	dip->di_next_unlinked = cpu_to_be32(iup->iu_next_unlinked);
> +	offset = iup->iu_imap.im_boffset +
> +			offsetof(struct xfs_dinode, di_next_unlinked);
> +
> +	/* need to recalc the inode CRC if appropriate */
> +	xfs_dinode_calc_crc(mp, dip);
> +	xfs_trans_inode_buf(tp, bp);
> +	xfs_trans_log_buf(tp, bp, offset, offset + sizeof(xfs_agino_t) - 1);
> +
> +out_remove:
> +	/*
> +	 * This log item only exists to perform this action. We now remove
> +	 * it from the transaction and free it as it should never reach the
> +	 * CIL.
> +	 */
> +	list_del(&lip->li_trans);
> +	xfs_iunlink_item_release(lip);
> +	return error;
> +}
> +
> +static const struct xfs_item_ops xfs_iunlink_item_ops = {
> +	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
> +	.iop_release	= xfs_iunlink_item_release,
> +	.iop_sort	= xfs_iunlink_item_sort,
> +	.iop_precommit	= xfs_iunlink_item_precommit,
> +};
> +
> +

Hm, ok, I see how these pieces fit together. This looks like a clever
combination of the incore inode prev_unlinked field and log items, and I
agree that it's less messy than the backref hashtable.

> +/*
> + * Initialize the inode log item for a newly allocated (in-core) inode.
> + *
> + * Inode extents can only reside within an AG. Hence specify the starting
> + * block for the inode chunk by offset within an AG as well as the
> + * length of the allocated extent.
> + *
> + * This joins the item to the transaction and marks it dirty so
> + * that we don't need a separate call to do this, nor does the
> + * caller need to know anything about the iunlink item.
> + */
> +void
> +xfs_iunlink_log(
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_iunlink_item	*iup;
> +
> +	iup = kmem_zone_zalloc(xfs_iunlink_zone, 0);
> +
> +	xfs_log_item_init(tp->t_mountp, &iup->iu_item, XFS_LI_IUNLINK,
> +			  &xfs_iunlink_item_ops);
> +
> +	iup->iu_ino = ip->i_ino;
> +	iup->iu_next_unlinked = ip->i_next_unlinked;
> +	iup->iu_imap = ip->i_imap;
> +
> +	xfs_trans_add_item(tp, &iup->iu_item);
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	set_bit(XFS_LI_DIRTY, &iup->iu_item.li_flags);
> +}
> diff --git a/fs/xfs/xfs_iunlink_item.h b/fs/xfs/xfs_iunlink_item.h
> new file mode 100644
> index 000000000000..c9e58acf4ccf
> --- /dev/null
> +++ b/fs/xfs/xfs_iunlink_item.h
> @@ -0,0 +1,24 @@
> +// SPDX-License-Identifier: GPL-2.0

Nitpick: Insane SPDX rules say that this should be a C-style comment
because it's a header file (whereas C++ style comments are fine for C
source files because IDFGI.

--D

> +/*
> + * Copyright (c) 2020, Red Hat, Inc.
> + * All Rights Reserved.
> + */
> +#ifndef XFS_IUNLINK_ITEM_H
> +#define XFS_IUNLINK_ITEM_H	1
> +
> +struct xfs_trans;
> +struct xfs_inode;
> +
> +/* in memory log item structure */
> +struct xfs_iunlink_item {
> +	struct xfs_log_item	iu_item;
> +	struct xfs_imap		iu_imap;
> +	xfs_ino_t		iu_ino;
> +	xfs_agino_t		iu_next_unlinked;
> +};
> +
> +extern kmem_zone_t *xfs_iunlink_zone;
> +
> +void xfs_iunlink_log(struct xfs_trans *tp, struct xfs_inode *ip);
> +
> +#endif	/* XFS_IUNLINK_ITEM_H */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 5a5d9453cf51..a36dfb0e7e5b 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -35,6 +35,7 @@
>  #include "xfs_refcount_item.h"
>  #include "xfs_bmap_item.h"
>  #include "xfs_reflink.h"
> +#include "xfs_iunlink_item.h"
>  
>  #include <linux/magic.h>
>  #include <linux/fs_context.h>
> @@ -1955,8 +1956,16 @@ xfs_init_zones(void)
>  	if (!xfs_bui_zone)
>  		goto out_destroy_bud_zone;
>  
> +	xfs_iunlink_zone = kmem_cache_create("xfs_iul_item",
> +					     sizeof(struct xfs_iunlink_item),
> +					     0, 0, NULL);
> +	if (!xfs_iunlink_zone)
> +		goto out_destroy_bui_zone;
> +
>  	return 0;
>  
> + out_destroy_bui_zone:
> +	kmem_cache_destroy(xfs_bui_zone);
>   out_destroy_bud_zone:
>  	kmem_cache_destroy(xfs_bud_zone);
>   out_destroy_cui_zone:
> @@ -2003,6 +2012,7 @@ xfs_destroy_zones(void)
>  	 * destroy caches.
>  	 */
>  	rcu_barrier();
> +	kmem_cache_destroy(xfs_iunlink_zone);
>  	kmem_cache_destroy(xfs_bui_zone);
>  	kmem_cache_destroy(xfs_bud_zone);
>  	kmem_cache_destroy(xfs_cui_zone);
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] xfs: introduce inode unlink log item
  2020-06-30 18:19   ` Darrick J. Wong
@ 2020-06-30 22:31     ` Gao Xiang
  2020-07-01  6:26       ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Gao Xiang @ 2020-06-30 22:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Tue, Jun 30, 2020 at 11:19:42AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 23, 2020 at 07:50:15PM +1000, Dave Chinner wrote:

...

> > +
> > +static uint64_t
> > +xfs_iunlink_item_sort(
> > +	struct xfs_log_item	*lip)
> > +{
> > +	return IUL_ITEM(lip)->iu_ino;
> > +}
> 
> Oh, I see, ->iop_sort is supposed to return a sorting key for each log
> item so that we can reorder the iunlink items to take locks in the
> correct order.

Yes, so I'm not sure the naming of ->iop_sort... When I first saw the name,
I thought it would be a compare function. (but after I read the code of
xfs_trans_precommit_sort(), I found I'm wrong...)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] xfs: xfs_iflock is no longer a completion
  2020-06-24 15:36   ` Brian Foster
@ 2020-07-01  5:48     ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2020-07-01  5:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Jun 24, 2020 at 11:36:52AM -0400, Brian Foster wrote:
> On Tue, Jun 23, 2020 at 07:50:12PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > With the recent rework of the inode cluster flushing, we no longer
> > ever wait on the the inode flush "lock". It was never a lock in the
> > first place, just a completion to allow callers to wait for inode IO
> > to complete. We now never wait for flush completion as all inode
> > flushing is non-blocking. Hence we can get rid of all the iflock
> > infrastructure and instead just set and check a state flag.
> > 
> > Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
> > xfs_iflock_nowait() test-and-set operations on that flag, and
> > replace all the xfs_ifunlock() calls to clear operations.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> I'd call it IFLUSHED vs. IFLUSHING, but I'm not going to harp on that.
> Just a few nits, otherwise looks like a nice cleanup.

I thought about "flushed" but that implies something that has
alerady happened in the past (e.g. XFS_ITRUNCATED) rather than
something that is currently happening. i.e. we are wanting to know
if a flush is in progress right now, not whether a flush has been
done in the past...

> >  fs/xfs/xfs_icache.c     | 19 ++++++------
> >  fs/xfs/xfs_inode.c      | 67 +++++++++++++++--------------------------
> >  fs/xfs/xfs_inode.h      | 33 +-------------------
> >  fs/xfs/xfs_inode_item.c |  6 ++--
> >  fs/xfs/xfs_inode_item.h |  4 +--
> >  5 files changed, 39 insertions(+), 90 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index a973f180c6cd..0d73559f2d58 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> ...
> > @@ -1045,23 +1044,23 @@ xfs_reclaim_inode(
> >  
> >  	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> >  		goto out;
> > -	if (!xfs_iflock_nowait(ip))
> > +	if (xfs_iflags_test_and_set(ip, XFS_IFLUSHING))
> >  		goto out_iunlock;
> >  
> >  	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> >  		xfs_iunpin_wait(ip);
> >  		/* xfs_iflush_abort() drops the flush lock */
> 
> The flush what? ;P
> 
> >  		xfs_iflush_abort(ip);
> > +		ASSERT(!xfs_iflags_test(ip, XFS_IFLUSHING));
> 
> Seems a bit superfluous right after the abort.

Yup, I'll that clean up and the comments the search and replace I
did missed.

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] xfs: introduce inode unlink log item
  2020-06-30 22:31     ` Gao Xiang
@ 2020-07-01  6:26       ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2020-07-01  6:26 UTC (permalink / raw)
  To: Gao Xiang; +Cc: Darrick J. Wong, linux-xfs

On Wed, Jul 01, 2020 at 06:31:59AM +0800, Gao Xiang wrote:
> On Tue, Jun 30, 2020 at 11:19:42AM -0700, Darrick J. Wong wrote:
> > On Tue, Jun 23, 2020 at 07:50:15PM +1000, Dave Chinner wrote:
> 
> ...
> 
> > > +
> > > +static uint64_t
> > > +xfs_iunlink_item_sort(
> > > +	struct xfs_log_item	*lip)
> > > +{
> > > +	return IUL_ITEM(lip)->iu_ino;
> > > +}
> > 
> > Oh, I see, ->iop_sort is supposed to return a sorting key for each log
> > item so that we can reorder the iunlink items to take locks in the
> > correct order.
> 
> Yes, so I'm not sure the naming of ->iop_sort... When I first saw the name,
> I thought it would be a compare function. (but after I read the code of
> xfs_trans_precommit_sort(), I found I'm wrong...)

Yeah, it's intended to return a sorting key, not do a sort. Naming
is hard, and this is an RFC so it's expected that the naming will
weird and need improvement. :)

As it is, I suspect that a generic filesystem object wide sorting
mechanism needs to be more complex than just returning a single 64
bit key. I think it will work generically if the key reference frame
is the same for all object - I kinda just chose the object's location
on disk for the reference frame and hence the use of the inode
number as the sort key value here.

However, I'm not 100% sure how that would end up working if we have
the object we are trying to sort doesn't have a defined physical
location.  That doesn't seem to be an issue right now - all of the
objects I want to sort have phsyical locations - so I've largely
ignored it this side of the problem. If we need to, we can change
how the sorting works when sorting by physical location is no longer
sufficient as this is purely an in-memory ordering issue...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] xfs: track unlinked inodes in core inode
  2020-06-23  9:50 ` [PATCH 3/4] xfs: track unlinked inodes in core inode Dave Chinner
@ 2020-07-01  8:59   ` Gao Xiang
  2020-07-01 22:06     ` Dave Chinner
  2020-07-01 14:31   ` Brian Foster
  2020-07-07 14:39   ` Gao Xiang
  2 siblings, 1 reply; 22+ messages in thread
From: Gao Xiang @ 2020-07-01  8:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 23, 2020 at 07:50:14PM +1000, Dave Chinner wrote:

...

> +static struct xfs_inode *
> +xfs_iunlink_ilookup(
>  	struct xfs_perag	*pag,
>  	xfs_agino_t		agino)
>  {
> -	struct xfs_iunlink	*iu;
> +	struct xfs_mount	*mp = pag->pag_mount;
> +	struct xfs_inode	*ip;
>  
> -	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
> -			xfs_iunlink_hash_params);

Just notice that when working on this patchset. Since pagi_unlinked_hash
is unused now, let's kill pagi_unlinked_hash in xfs_perag as well.

typedef struct xfs_perag {
...
struct rhashtable       pagi_unlinked_hash;
...
};

(I will try to add a mutex and a tail xfs_inode for this later
 in this structure...)

Also I noticed xfs_iunlink_insert_inode and xfs_iunlink_remove_inode
are used once now, maybe folding into the caller would be better...
(Just my personal thought...)

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] xfs: add log item precommit operation
  2020-06-23  9:50 ` [PATCH 2/4] xfs: add log item precommit operation Dave Chinner
  2020-06-30 18:06   ` Darrick J. Wong
@ 2020-07-01 14:30   ` Brian Foster
  2020-07-01 22:02     ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Brian Foster @ 2020-07-01 14:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 23, 2020 at 07:50:13PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For inodes that are dirty, we have an attached cluster buffer that
> we want to use to track the dirty inode through the AIL.
> Unfortunately, locking the cluster buffer and adding it to the
> transaction when the inode is first logged in a transaction leads to
> buffer lock ordering inversions.
> 
> The specific problem is ordering against the AGI buffer. When
> modifying unlinked lists, the buffer lock order is AGI -> inode
> cluster buffer as the AGI buffer lock serialises all access to the
> unlinked lists. Unfortunately, functionality like xfs_droplink()
> logs the inode before calling xfs_iunlink(), as do various directory
> manipulation functions. The inode can be logged way down in the
> stack as far as the bmapi routines and hence, without a major
> rewrite of lots of APIs there's no way we can avoid the inode being
> logged by something until after the AGI has been logged.
> 
> As we are going to be using ordered buffers for inode AIL tracking,
> there isn't a need to actually lock that buffer against modification
> as all the modifications are captured by logging the inode item
> itself. Hence we don't actually need to join the cluster buffer into
> the transaction until just before it is committed. This means we do
> not perturb any of the existing buffer lock orders in transactions,
> and the inode cluster buffer is always locked last in a transaction
> that doesn't otherwise touch inode cluster buffers.
> 
> We do this by introducing a precommit log item method. A log item
> method is used because it is likely dquots will be moved to this
> same ordered buffer tracking scheme and hence will need a similar
> callout. This commit just introduces the mechanism; the inode item
> implementation is in followup commits.
> 
> The precommit items need to be sorted into consistent order as we
> may be locking multiple items here. Hence if we have two dirty
> inodes in cluster buffers A and B, and some other transaction has
> two separate dirty inodes in the same cluster buffers, locking them
> in different orders opens us up to ABBA deadlocks. Hence we sort the
> items on the transaction based on the presence of a sort log item
> method.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Seems like a nice abstraction, particularly when you consider the other
use cases you described that should fall into place over time. A couple
minor comments..

>  fs/xfs/xfs_icache.c |  1 +
>  fs/xfs/xfs_trans.c  | 90 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_trans.h  |  6 ++-
>  3 files changed, 95 insertions(+), 2 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 3c94e5ff4316..6f350490f84b 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -799,6 +799,89 @@ xfs_trans_committed_bulk(
>  	spin_unlock(&ailp->ail_lock);
>  }
>  
> +/*
> + * Sort transaction items prior to running precommit operations. This will
> + * attempt to order the items such that they will always be locked in the same
> + * order. Items that have no sort function are moved to the end of the list
> + * and so are locked last (XXX: need to check the logic matches the comment).
> + *

Heh, I was going to ask what the expected behavior was with the various
!iop_sort() cases and whether we can really expect those items to be
isolated at the end of the list.

> + * This may need refinement as different types of objects add sort functions.
> + *
> + * Function is more complex than it needs to be because we are comparing 64 bit
> + * values and the function only returns 32 bit values.
> + */
> +static int
> +xfs_trans_precommit_sort(
> +	void			*unused_arg,
> +	struct list_head	*a,
> +	struct list_head	*b)
> +{
> +	struct xfs_log_item	*lia = container_of(a,
> +					struct xfs_log_item, li_trans);
> +	struct xfs_log_item	*lib = container_of(b,
> +					struct xfs_log_item, li_trans);
> +	int64_t			diff;
> +
> +	if (!lia->li_ops->iop_sort && !lib->li_ops->iop_sort)
> +		return 0;
> +	if (!lia->li_ops->iop_sort)
> +		return 1;
> +	if (!lib->li_ops->iop_sort)
> +		return -1;

I'm a little confused on what these values are supposed to mean if one
of the two items is non-sortable. Is the purpose of this simply to move
sortable items to the head and non-sortable toward the tail, as noted
above?

> +
> +	diff = lia->li_ops->iop_sort(lia) - lib->li_ops->iop_sort(lib);
> +	if (diff < 0)
> +		return -1;
> +	if (diff > 0)
> +		return 1;
> +	return 0;
> +}
> +
> +/*
> + * Run transaction precommit functions.
> + *
> + * If there is an error in any of the callouts, then stop immediately and
> + * trigger a shutdown to abort the transaction. There is no recovery possible
> + * from errors at this point as the transaction is dirty....
> + */
> +static int
> +xfs_trans_run_precommits(
> +	struct xfs_trans	*tp)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_log_item	*lip, *n;
> +	int			error = 0;
> +
> +	if (XFS_FORCED_SHUTDOWN(mp))
> +		return -EIO;
> +

I'd rather not change behavior here. This effectively overrides the
shutdown check in the caller because we get here regardless of whether
the transaction has any pre-commit callouts or not. It seems like this
is unnecessary, at least for the time being, if the precommit is
primarily focused on sorting.

Brian

> +	/*
> +	 * Sort the item list to avoid ABBA deadlocks with other transactions
> +	 * running precommit operations that lock multiple shared items such as
> +	 * inode cluster buffers.
> +	 */
> +	list_sort(NULL, &tp->t_items, xfs_trans_precommit_sort);
> +
> +	/*
> +	 * Precommit operations can remove the log item from the transaction
> +	 * if the log item exists purely to delay modifications until they
> +	 * can be ordered against other operations. Hence we have to use
> +	 * list_for_each_entry_safe() here.
> +	 */
> +	list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
> +		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
> +			continue;
> +		if (lip->li_ops->iop_precommit) {
> +			error = lip->li_ops->iop_precommit(tp, lip);
> +			if (error)
> +				break;
> +		}
> +	}
> +	if (error)
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +	return error;
> +}
> +
>  /*
>   * Commit the given transaction to the log.
>   *
> @@ -823,6 +906,13 @@ __xfs_trans_commit(
>  
>  	trace_xfs_trans_commit(tp, _RET_IP_);
>  
> +	error = xfs_trans_run_precommits(tp);
> +	if (error) {
> +		if (tp->t_flags & XFS_TRANS_PERM_LOG_RES)
> +			xfs_defer_cancel(tp);
> +		goto out_unreserve;
> +	}
> +
>  	/*
>  	 * Finish deferred items on final commit. Only permanent transactions
>  	 * should ever have deferred ops.
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index b752501818d2..26ea19bd0621 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -70,10 +70,12 @@ struct xfs_item_ops {
>  	void (*iop_format)(struct xfs_log_item *, struct xfs_log_vec *);
>  	void (*iop_pin)(struct xfs_log_item *);
>  	void (*iop_unpin)(struct xfs_log_item *, int remove);
> -	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
> +	uint64_t (*iop_sort)(struct xfs_log_item *);
> +	int (*iop_precommit)(struct xfs_trans *, struct xfs_log_item *);
>  	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
> -	void (*iop_release)(struct xfs_log_item *);
>  	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
> +	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
> +	void (*iop_release)(struct xfs_log_item *);
>  	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
>  	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
>  };
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] xfs: track unlinked inodes in core inode
  2020-06-23  9:50 ` [PATCH 3/4] xfs: track unlinked inodes in core inode Dave Chinner
  2020-07-01  8:59   ` Gao Xiang
@ 2020-07-01 14:31   ` Brian Foster
  2020-07-01 22:18     ` Dave Chinner
  2020-07-07 14:39   ` Gao Xiang
  2 siblings, 1 reply; 22+ messages in thread
From: Brian Foster @ 2020-07-01 14:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 23, 2020 at 07:50:14PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Currently we cache unlinked inode list backrefs through a separate
> cache which has to be maintained via memory allocation and a hash
> table. When the inode is on the unlinked list, we have an existence
> guarantee for the inode in memory.
> 
> That is, if the inode is on the unlinked list, there must be a
> reference to the inode from the core VFS because dropping the last
> reference to the inode causes it to be removed from the unlinked
> list. Hence if we hold the AGI locked, we guarantee that any inode
> on the unlinked list is pinned in memory. That means we can actually
> track the entire unlinked list on the inode itself and use
> unreferenced inode cache lookups to update the list pointers as
> needed.
> 
> However, we don't use this relationship because log recovery has
> no in memory state and so has to work directly from buffers.
> However, because unlink recovery only removes from the head of the
> list, we can easily fake this in memory state as the inode we read
> in from the AGI bucket has a pointer to the next inode. Hence we can
> play reference leapfrog in the recovery loop always reading the
> second inode on the list and updating pointers before dropping the
> reference to the first inode. Hence the in-memory state is always
> valid for recovery, too.
> 
> This means we can tear out the old inode unlinked list cache and
> update mechanisms and replace it with a much simpler "insert" and
> "remove" functions that use in-memory inode state rather than on
> buffer state to track the list. The diffstat speaks for itself.
> 
> Food for thought: This obliviates the need for the on-disk AGI
> unlinked hash - we because we track as a double linked list in
> memory, we don't need to keep hash chains on disk short to minimise
> previous inode lookup overhead on list removal. Hence we probably
> should just convert the unlinked list code to use a single list
> on disk...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks interesting, but are you planning to break this up into smaller
pieces? E.g., perhaps add the new inode pointers and set them in one
patch, then replace the whole backref thing with the in-core pointers,
then update the insert/remove, then log recovery, etc. I'm sure there's
various ways it can or cannot be split, but regardless this patch looks
like it could be a series in and of itself.

Brian

>  fs/xfs/libxfs/xfs_inode_buf.c |   1 +
>  fs/xfs/xfs_icache.c           |   2 +
>  fs/xfs/xfs_inode.c            | 672 ++++++++++------------------------
>  fs/xfs/xfs_inode.h            |   5 +-
>  fs/xfs/xfs_log_recover.c      | 166 +++++----
>  fs/xfs/xfs_mount.c            |   5 -
>  6 files changed, 297 insertions(+), 554 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 6b6f67595bf4..74f713e4a6b5 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -225,6 +225,7 @@ xfs_inode_from_disk(
>  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
>  	to->di_dmstate	= be16_to_cpu(from->di_dmstate);
>  	to->di_flags	= be16_to_cpu(from->di_flags);
> +	ip->i_next_unlinked = be32_to_cpu(from->di_next_unlinked);
>  
>  	if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) {
>  		inode_set_iversion_queried(inode,
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 1c744dbb313f..dadf417fb9b4 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -71,6 +71,8 @@ xfs_inode_alloc(
>  	INIT_WORK(&ip->i_ioend_work, xfs_end_io);
>  	INIT_LIST_HEAD(&ip->i_ioend_list);
>  	spin_lock_init(&ip->i_ioend_lock);
> +	ip->i_next_unlinked = NULLAGINO;
> +	ip->i_prev_unlinked = NULLAGINO;
>  
>  	return ip;
>  }
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ba86d27b5226..1f1c8819330b 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1876,211 +1876,123 @@ xfs_inactive(
>  }
>  
>  /*
> - * In-Core Unlinked List Lookups
> - * =============================
> - *
> - * Every inode is supposed to be reachable from some other piece of metadata
> - * with the exception of the root directory.  Inodes with a connection to a
> - * file descriptor but not linked from anywhere in the on-disk directory tree
> - * are collectively known as unlinked inodes, though the filesystem itself
> - * maintains links to these inodes so that on-disk metadata are consistent.
> - *
> - * XFS implements a per-AG on-disk hash table of unlinked inodes.  The AGI
> - * header contains a number of buckets that point to an inode, and each inode
> - * record has a pointer to the next inode in the hash chain.  This
> - * singly-linked list causes scaling problems in the iunlink remove function
> - * because we must walk that list to find the inode that points to the inode
> - * being removed from the unlinked hash bucket list.
> - *
> - * What if we modelled the unlinked list as a collection of records capturing
> - * "X.next_unlinked = Y" relations?  If we indexed those records on Y, we'd
> - * have a fast way to look up unlinked list predecessors, which avoids the
> - * slow list walk.  That's exactly what we do here (in-core) with a per-AG
> - * rhashtable.
> - *
> - * Because this is a backref cache, we ignore operational failures since the
> - * iunlink code can fall back to the slow bucket walk.  The only errors that
> - * should bubble out are for obviously incorrect situations.
> - *
> - * All users of the backref cache MUST hold the AGI buffer lock to serialize
> - * access or have otherwise provided for concurrency control.
> - */
> -
> -/* Capture a "X.next_unlinked = Y" relationship. */
> -struct xfs_iunlink {
> -	struct rhash_head	iu_rhash_head;
> -	xfs_agino_t		iu_agino;		/* X */
> -	xfs_agino_t		iu_next_unlinked;	/* Y */
> -};
> -
> -/* Unlinked list predecessor lookup hashtable construction */
> -static int
> -xfs_iunlink_obj_cmpfn(
> -	struct rhashtable_compare_arg	*arg,
> -	const void			*obj)
> -{
> -	const xfs_agino_t		*key = arg->key;
> -	const struct xfs_iunlink	*iu = obj;
> -
> -	if (iu->iu_next_unlinked != *key)
> -		return 1;
> -	return 0;
> -}
> -
> -static const struct rhashtable_params xfs_iunlink_hash_params = {
> -	.min_size		= XFS_AGI_UNLINKED_BUCKETS,
> -	.key_len		= sizeof(xfs_agino_t),
> -	.key_offset		= offsetof(struct xfs_iunlink,
> -					   iu_next_unlinked),
> -	.head_offset		= offsetof(struct xfs_iunlink, iu_rhash_head),
> -	.automatic_shrinking	= true,
> -	.obj_cmpfn		= xfs_iunlink_obj_cmpfn,
> -};
> -
> -/*
> - * Return X, where X.next_unlinked == @agino.  Returns NULLAGINO if no such
> - * relation is found.
> + * Find an inode on the unlinked list. This does not take references to the
> + * inode, we have existence guarantees by holding the AGI buffer lock and
> + * that only unlinked, referenced inodes can be on the unlinked inode list.
> + * If we don't find the inode in cache, then let the caller handle the
> + * situation.
>   */
> -static xfs_agino_t
> -xfs_iunlink_lookup_backref(
> +static struct xfs_inode *
> +xfs_iunlink_ilookup(
>  	struct xfs_perag	*pag,
>  	xfs_agino_t		agino)
>  {
> -	struct xfs_iunlink	*iu;
> +	struct xfs_mount	*mp = pag->pag_mount;
> +	struct xfs_inode	*ip;
>  
> -	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
> -			xfs_iunlink_hash_params);
> -	return iu ? iu->iu_agino : NULLAGINO;
> +	rcu_read_lock();
> +	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
> +
> +	/* Inode not in memory, nothing to do */
> +	if (!ip) {
> +		rcu_read_unlock();
> +		return NULL;
> +	}
> +	spin_lock(&ip->i_flags_lock);
> +	if (ip->i_ino != XFS_AGINO_TO_INO(mp, pag->pag_agno, agino) ||
> +	    __xfs_iflags_test(ip, XFS_IRECLAIM)) {
> +		/* Uh-oh! */
> +		ip = NULL;
> +	}
> +	spin_unlock(&ip->i_flags_lock);
> +	rcu_read_unlock();
> +	return ip;
>  }
>  
>  /*
> - * Take ownership of an iunlink cache entry and insert it into the hash table.
> - * If successful, the entry will be owned by the cache; if not, it is freed.
> - * Either way, the caller does not own @iu after this call.
> + * Return the inode before @ip on the unlinked list without taking a reference
> + * to it. Caller must hold the AGI buffer locked to guarantee existence of the
> + * inode. Returns the inode or NULL if at the head of the list.
> + * If a lookup fails, return corruption error.
>   */
> -static int
> -xfs_iunlink_insert_backref(
> +static struct xfs_inode *
> +xfs_iunlink_lookup_prev(
>  	struct xfs_perag	*pag,
> -	struct xfs_iunlink	*iu)
> +	struct xfs_inode	*ip)
>  {
> -	int			error;
> -
> -	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> -			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> -	/*
> -	 * Fail loudly if there already was an entry because that's a sign of
> -	 * corruption of in-memory data.  Also fail loudly if we see an error
> -	 * code we didn't anticipate from the rhashtable code.  Currently we
> -	 * only anticipate ENOMEM.
> -	 */
> -	if (error) {
> -		WARN(error != -ENOMEM, "iunlink cache insert error %d", error);
> -		kmem_free(iu);
> -	}
> -	/*
> -	 * Absorb any runtime errors that aren't a result of corruption because
> -	 * this is a cache and we can always fall back to bucket list scanning.
> -	 */
> -	if (error != 0 && error != -EEXIST)
> -		error = 0;
> -	return error;
> +	struct xfs_inode	*prev_ip;
> +
> +	if (ip->i_prev_unlinked == NULLAGINO)
> +		return NULL;
> +	prev_ip = xfs_iunlink_ilookup(pag, ip->i_prev_unlinked);
> +	if (!prev_ip)
> +		return ERR_PTR(-EFSCORRUPTED);
> +	return prev_ip;
>  }
>  
> -/* Remember that @prev_agino.next_unlinked = @this_agino. */
> -static int
> -xfs_iunlink_add_backref(
> +static struct xfs_inode *
> +xfs_iunlink_lookup_next(
>  	struct xfs_perag	*pag,
> -	xfs_agino_t		prev_agino,
> -	xfs_agino_t		this_agino)
> +	struct xfs_inode	*ip)
>  {
> -	struct xfs_iunlink	*iu;
> -
> -	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
> -		return 0;
> -
> -	iu = kmem_zalloc(sizeof(*iu), KM_NOFS);
> -	iu->iu_agino = prev_agino;
> -	iu->iu_next_unlinked = this_agino;
> -
> -	return xfs_iunlink_insert_backref(pag, iu);
> +	struct xfs_inode	*next_ip;
> +
> +	if (ip->i_next_unlinked == NULLAGINO)
> +		return NULL;
> +	next_ip = xfs_iunlink_ilookup(pag, ip->i_next_unlinked);
> +	if (!next_ip)
> +		return ERR_PTR(-EFSCORRUPTED);
> +	return next_ip;
>  }
>  
> -/*
> - * Replace X.next_unlinked = @agino with X.next_unlinked = @next_unlinked.
> - * If @next_unlinked is NULLAGINO, we drop the backref and exit.  If there
> - * wasn't any such entry then we don't bother.
> - */
> -static int
> -xfs_iunlink_change_backref(
> -	struct xfs_perag	*pag,
> +/* Set an on-disk inode's next_unlinked pointer. */
> +STATIC void
> +xfs_iunlink_update_dinode(
> +	struct xfs_trans	*tp,
> +	xfs_agnumber_t		agno,
>  	xfs_agino_t		agino,
> -	xfs_agino_t		next_unlinked)
> +	struct xfs_buf		*ibp,
> +	struct xfs_dinode	*dip,
> +	struct xfs_imap		*imap,
> +	xfs_agino_t		next_agino)
>  {
> -	struct xfs_iunlink	*iu;
> -	int			error;
> -
> -	/* Look up the old entry; if there wasn't one then exit. */
> -	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
> -			xfs_iunlink_hash_params);
> -	if (!iu)
> -		return 0;
> -
> -	/*
> -	 * Remove the entry.  This shouldn't ever return an error, but if we
> -	 * couldn't remove the old entry we don't want to add it again to the
> -	 * hash table, and if the entry disappeared on us then someone's
> -	 * violated the locking rules and we need to fail loudly.  Either way
> -	 * we cannot remove the inode because internal state is or would have
> -	 * been corrupt.
> -	 */
> -	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
> -			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> -	if (error)
> -		return error;
> -
> -	/* If there is no new next entry just free our item and return. */
> -	if (next_unlinked == NULLAGINO) {
> -		kmem_free(iu);
> -		return 0;
> -	}
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	int			offset;
>  
> -	/* Update the entry and re-add it to the hash table. */
> -	iu->iu_next_unlinked = next_unlinked;
> -	return xfs_iunlink_insert_backref(pag, iu);
> -}
> +	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
>  
> -/* Set up the in-core predecessor structures. */
> -int
> -xfs_iunlink_init(
> -	struct xfs_perag	*pag)
> -{
> -	return rhashtable_init(&pag->pagi_unlinked_hash,
> -			&xfs_iunlink_hash_params);
> -}
> +	trace_xfs_iunlink_update_dinode(mp, agno, agino,
> +			be32_to_cpu(dip->di_next_unlinked), next_agino);
>  
> -/* Free the in-core predecessor structures. */
> -static void
> -xfs_iunlink_free_item(
> -	void			*ptr,
> -	void			*arg)
> -{
> -	struct xfs_iunlink	*iu = ptr;
> -	bool			*freed_anything = arg;
> +	dip->di_next_unlinked = cpu_to_be32(next_agino);
> +	offset = imap->im_boffset +
> +			offsetof(struct xfs_dinode, di_next_unlinked);
>  
> -	*freed_anything = true;
> -	kmem_free(iu);
> +	/* need to recalc the inode CRC if appropriate */
> +	xfs_dinode_calc_crc(mp, dip);
> +	xfs_trans_inode_buf(tp, ibp);
> +	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
>  }
>  
> -void
> -xfs_iunlink_destroy(
> -	struct xfs_perag	*pag)
> +/* Set an in-core inode's unlinked pointer and return the old value. */
> +static int
> +xfs_iunlink_update_inode(
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip)
>  {
> -	bool			freed_anything = false;
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	struct xfs_dinode	*dip;
> +	struct xfs_buf		*ibp;
> +	int			error;
>  
> -	rhashtable_free_and_destroy(&pag->pagi_unlinked_hash,
> -			xfs_iunlink_free_item, &freed_anything);
> +	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp, 0);
> +	if (error)
> +		return error;
>  
> -	ASSERT(freed_anything == false || XFS_FORCED_SHUTDOWN(pag->pag_mount));
> +	xfs_iunlink_update_dinode(tp, agno, XFS_INO_TO_AGINO(mp, ip->i_ino),
> +			ibp, dip, &ip->i_imap, ip->i_next_unlinked);
> +	return 0;
>  }
>  
>  /*
> @@ -2122,120 +2034,25 @@ xfs_iunlink_update_bucket(
>  	return 0;
>  }
>  
> -/* Set an on-disk inode's next_unlinked pointer. */
> -STATIC void
> -xfs_iunlink_update_dinode(
> -	struct xfs_trans	*tp,
> -	xfs_agnumber_t		agno,
> -	xfs_agino_t		agino,
> -	struct xfs_buf		*ibp,
> -	struct xfs_dinode	*dip,
> -	struct xfs_imap		*imap,
> -	xfs_agino_t		next_agino)
> -{
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	int			offset;
> -
> -	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
> -
> -	trace_xfs_iunlink_update_dinode(mp, agno, agino,
> -			be32_to_cpu(dip->di_next_unlinked), next_agino);
> -
> -	dip->di_next_unlinked = cpu_to_be32(next_agino);
> -	offset = imap->im_boffset +
> -			offsetof(struct xfs_dinode, di_next_unlinked);
> -
> -	/* need to recalc the inode CRC if appropriate */
> -	xfs_dinode_calc_crc(mp, dip);
> -	xfs_trans_inode_buf(tp, ibp);
> -	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
> -}
> -
> -/* Set an in-core inode's unlinked pointer and return the old value. */
> -STATIC int
> -xfs_iunlink_update_inode(
> -	struct xfs_trans	*tp,
> -	struct xfs_inode	*ip,
> -	xfs_agnumber_t		agno,
> -	xfs_agino_t		next_agino,
> -	xfs_agino_t		*old_next_agino)
> -{
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_dinode	*dip;
> -	struct xfs_buf		*ibp;
> -	xfs_agino_t		old_value;
> -	int			error;
> -
> -	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
> -
> -	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp, 0);
> -	if (error)
> -		return error;
> -
> -	/* Make sure the old pointer isn't garbage. */
> -	old_value = be32_to_cpu(dip->di_next_unlinked);
> -	if (!xfs_verify_agino_or_null(mp, agno, old_value)) {
> -		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
> -				sizeof(*dip), __this_address);
> -		error = -EFSCORRUPTED;
> -		goto out;
> -	}
> -
> -	/*
> -	 * Since we're updating a linked list, we should never find that the
> -	 * current pointer is the same as the new value, unless we're
> -	 * terminating the list.
> -	 */
> -	*old_next_agino = old_value;
> -	if (old_value == next_agino) {
> -		if (next_agino != NULLAGINO) {
> -			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
> -					dip, sizeof(*dip), __this_address);
> -			error = -EFSCORRUPTED;
> -		}
> -		goto out;
> -	}
> -
> -	/* Ok, update the new pointer. */
> -	xfs_iunlink_update_dinode(tp, agno, XFS_INO_TO_AGINO(mp, ip->i_ino),
> -			ibp, dip, &ip->i_imap, next_agino);
> -	return 0;
> -out:
> -	xfs_trans_brelse(tp, ibp);
> -	return error;
> -}
> -
>  /*
> - * This is called when the inode's link count has gone to 0 or we are creating
> - * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> - *
> - * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> - * list when the inode is freed.
> + * Always insert at the head, so we only have to do a next inode lookup to
> + * update it's prev pointer. The AGI bucket will point at the one we are
> + * inserting.
>   */
> -STATIC int
> -xfs_iunlink(
> +static int
> +xfs_iunlink_insert_inode(
>  	struct xfs_trans	*tp,
> +	struct xfs_buf		*agibp,
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_agi		*agi;
> -	struct xfs_buf		*agibp;
> -	xfs_agino_t		next_agino;
> -	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	struct xfs_agi		*agi = agibp->b_addr;
> +	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> +	xfs_agino_t		next_agino;
>  	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  	int			error;
>  
> -	ASSERT(VFS_I(ip)->i_nlink == 0);
> -	ASSERT(VFS_I(ip)->i_mode != 0);
> -	trace_xfs_iunlink(ip);
> -
> -	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> -	error = xfs_read_agi(mp, tp, agno, &agibp);
> -	if (error)
> -		return error;
> -	agi = agibp->b_addr;
> -
>  	/*
>  	 * Get the index into the agi hash table for the list this inode will
>  	 * go on.  Make sure the pointer isn't garbage and that this inode
> @@ -2248,27 +2065,24 @@ xfs_iunlink(
>  		return -EFSCORRUPTED;
>  	}
>  
> -	if (next_agino != NULLAGINO) {
> -		struct xfs_perag	*pag;
> -		xfs_agino_t		old_agino;
> +	ip->i_prev_unlinked = NULLAGINO;
> +	ip->i_next_unlinked = next_agino;
> +	if (ip->i_next_unlinked != NULLAGINO) {
> +		struct xfs_inode	*nip;
>  
> -		/*
> -		 * There is already another inode in the bucket, so point this
> -		 * inode to the current head of the list.
> -		 */
> -		error = xfs_iunlink_update_inode(tp, ip, agno, next_agino,
> -				&old_agino);
> -		if (error)
> -			return error;
> -		ASSERT(old_agino == NULLAGINO);
> +		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
> +		if (IS_ERR_OR_NULL(nip))
> +			return -EFSCORRUPTED;
>  
> -		/*
> -		 * agino has been unlinked, add a backref from the next inode
> -		 * back to agino.
> -		 */
> -		pag = xfs_perag_get(mp, agno);
> -		error = xfs_iunlink_add_backref(pag, agino, next_agino);
> -		xfs_perag_put(pag);
> +		if (nip->i_prev_unlinked != NULLAGINO) {
> +			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
> +						NULL, 0, __this_address);
> +			return -EFSCORRUPTED;
> +		}
> +		nip->i_prev_unlinked = agino;
> +
> +		/* update the on disk inode now */
> +		error = xfs_iunlink_update_inode(tp, ip);
>  		if (error)
>  			return error;
>  	}
> @@ -2277,118 +2091,110 @@ xfs_iunlink(
>  	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
>  }
>  
> -/* Return the imap, dinode pointer, and buffer for an inode. */
> -STATIC int
> -xfs_iunlink_map_ino(
> +/*
> + * Remove can be from anywhere in the list, so we have to do two adjacent inode
> + * lookups here so we can update list pointers. We may be at the head or the
> + * tail of the list, so we have to handle those cases as well.
> + */
> +static int
> +xfs_iunlink_remove_inode(
>  	struct xfs_trans	*tp,
> -	xfs_agnumber_t		agno,
> -	xfs_agino_t		agino,
> -	struct xfs_imap		*imap,
> -	struct xfs_dinode	**dipp,
> -	struct xfs_buf		**bpp)
> +	struct xfs_buf		*agibp,
> +	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_agi		*agi = agibp->b_addr;
> +	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> +	xfs_agino_t		next_agino = ip->i_next_unlinked;
> +	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  	int			error;
>  
> -	imap->im_blkno = 0;
> -	error = xfs_imap(mp, tp, XFS_AGINO_TO_INO(mp, agno, agino), imap, 0);
> -	if (error) {
> -		xfs_warn(mp, "%s: xfs_imap returned error %d.",
> -				__func__, error);
> -		return error;
> +	if (ip->i_prev_unlinked == NULLAGINO) {
> +		/* remove from head of list */
> +		if (be32_to_cpu(agi->agi_unlinked[bucket_index]) != agino) {
> +			xfs_buf_mark_corrupt(agibp);
> +			return -EFSCORRUPTED;
> +		}
> +		if (next_agino == agino ||
> +		    !xfs_verify_agino_or_null(mp, agno, next_agino))
> +			return -EFSCORRUPTED;
> +
> +		error = xfs_iunlink_update_bucket(tp, agno, agibp,
> +					bucket_index, next_agino);
> +		if (error)
> +			return -EFSCORRUPTED;
> +	} else {
> +		/* lookup previous inode and update to point at next */
> +		struct xfs_inode	*pip;
> +
> +		pip = xfs_iunlink_lookup_prev(agibp->b_pag, ip);
> +		if (IS_ERR_OR_NULL(pip))
> +			return -EFSCORRUPTED;
> +
> +		if (pip->i_next_unlinked != agino) {
> +			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
> +						NULL, 0, __this_address);
> +			return -EFSCORRUPTED;
> +		}
> +
> +		/* update the on disk inode now */
> +		pip->i_next_unlinked = next_agino;
> +		error = xfs_iunlink_update_inode(tp, pip);
> +		if (error)
> +			return error;
>  	}
>  
> -	error = xfs_imap_to_bp(mp, tp, imap, dipp, bpp, 0);
> -	if (error) {
> -		xfs_warn(mp, "%s: xfs_imap_to_bp returned error %d.",
> -				__func__, error);
> -		return error;
> +	/* lookup the next inode and update to point at prev */
> +	if (ip->i_next_unlinked != NULLAGINO) {
> +		struct xfs_inode	*nip;
> +
> +		nip = xfs_iunlink_lookup_next(agibp->b_pag, ip);
> +		if (IS_ERR_OR_NULL(nip))
> +			return -EFSCORRUPTED;
> +
> +		if (nip->i_prev_unlinked != agino) {
> +			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
> +						NULL, 0, __this_address);
> +			return -EFSCORRUPTED;
> +		}
> +		/* in memory update only */
> +		nip->i_prev_unlinked = ip->i_prev_unlinked;
>  	}
>  
> -	return 0;
> +	/* now clear prev/next from this inode and update on disk */
> +	ip->i_prev_unlinked = NULLAGINO;
> +	ip->i_next_unlinked = NULLAGINO;
> +	return xfs_iunlink_update_inode(tp, ip);
>  }
>  
>  /*
> - * Walk the unlinked chain from @head_agino until we find the inode that
> - * points to @target_agino.  Return the inode number, map, dinode pointer,
> - * and inode cluster buffer of that inode as @agino, @imap, @dipp, and @bpp.
> - *
> - * @tp, @pag, @head_agino, and @target_agino are input parameters.
> - * @agino, @imap, @dipp, and @bpp are all output parameters.
> + * This is called when the inode's link count has gone to 0 or we are creating
> + * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
>   *
> - * Do not call this function if @target_agino is the head of the list.
> + * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> + * list when the inode is freed.
>   */
>  STATIC int
> -xfs_iunlink_map_prev(
> +xfs_iunlink(
>  	struct xfs_trans	*tp,
> -	xfs_agnumber_t		agno,
> -	xfs_agino_t		head_agino,
> -	xfs_agino_t		target_agino,
> -	xfs_agino_t		*agino,
> -	struct xfs_imap		*imap,
> -	struct xfs_dinode	**dipp,
> -	struct xfs_buf		**bpp,
> -	struct xfs_perag	*pag)
> +	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	xfs_agino_t		next_agino;
> +	struct xfs_buf		*agibp;
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	int			error;
>  
> -	ASSERT(head_agino != target_agino);
> -	*bpp = NULL;
> -
> -	/* See if our backref cache can find it faster. */
> -	*agino = xfs_iunlink_lookup_backref(pag, target_agino);
> -	if (*agino != NULLAGINO) {
> -		error = xfs_iunlink_map_ino(tp, agno, *agino, imap, dipp, bpp);
> -		if (error)
> -			return error;
> -
> -		if (be32_to_cpu((*dipp)->di_next_unlinked) == target_agino)
> -			return 0;
> -
> -		/*
> -		 * If we get here the cache contents were corrupt, so drop the
> -		 * buffer and fall back to walking the bucket list.
> -		 */
> -		xfs_trans_brelse(tp, *bpp);
> -		*bpp = NULL;
> -		WARN_ON_ONCE(1);
> -	}
> -
> -	trace_xfs_iunlink_map_prev_fallback(mp, agno);
> -
> -	/* Otherwise, walk the entire bucket until we find it. */
> -	next_agino = head_agino;
> -	while (next_agino != target_agino) {
> -		xfs_agino_t	unlinked_agino;
> -
> -		if (*bpp)
> -			xfs_trans_brelse(tp, *bpp);
> -
> -		*agino = next_agino;
> -		error = xfs_iunlink_map_ino(tp, agno, next_agino, imap, dipp,
> -				bpp);
> -		if (error)
> -			return error;
> +	ASSERT(VFS_I(ip)->i_nlink == 0);
> +	ASSERT(VFS_I(ip)->i_mode != 0);
> +	trace_xfs_iunlink(ip);
>  
> -		unlinked_agino = be32_to_cpu((*dipp)->di_next_unlinked);
> -		/*
> -		 * Make sure this pointer is valid and isn't an obvious
> -		 * infinite loop.
> -		 */
> -		if (!xfs_verify_agino(mp, agno, unlinked_agino) ||
> -		    next_agino == unlinked_agino) {
> -			XFS_CORRUPTION_ERROR(__func__,
> -					XFS_ERRLEVEL_LOW, mp,
> -					*dipp, sizeof(**dipp));
> -			error = -EFSCORRUPTED;
> -			return error;
> -		}
> -		next_agino = unlinked_agino;
> -	}
> +	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> +	error = xfs_read_agi(mp, tp, agno, &agibp);
> +	if (error)
> +		return error;
>  
> -	return 0;
> +	return xfs_iunlink_insert_inode(tp, agibp, ip);
>  }
>  
>  /*
> @@ -2400,16 +2206,8 @@ xfs_iunlink_remove(
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_agi		*agi;
>  	struct xfs_buf		*agibp;
> -	struct xfs_buf		*last_ibp;
> -	struct xfs_dinode	*last_dip = NULL;
> -	struct xfs_perag	*pag = NULL;
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> -	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> -	xfs_agino_t		next_agino;
> -	xfs_agino_t		head_agino;
> -	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  	int			error;
>  
>  	trace_xfs_iunlink_remove(ip);
> @@ -2418,84 +2216,8 @@ xfs_iunlink_remove(
>  	error = xfs_read_agi(mp, tp, agno, &agibp);
>  	if (error)
>  		return error;
> -	agi = agibp->b_addr;
> -
> -	/*
> -	 * Get the index into the agi hash table for the list this inode will
> -	 * go on.  Make sure the head pointer isn't garbage.
> -	 */
> -	head_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
> -	if (!xfs_verify_agino(mp, agno, head_agino)) {
> -		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
> -				agi, sizeof(*agi));
> -		return -EFSCORRUPTED;
> -	}
> -
> -	/*
> -	 * Set our inode's next_unlinked pointer to NULL and then return
> -	 * the old pointer value so that we can update whatever was previous
> -	 * to us in the list to point to whatever was next in the list.
> -	 */
> -	error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO, &next_agino);
> -	if (error)
> -		return error;
> -
> -	/*
> -	 * If there was a backref pointing from the next inode back to this
> -	 * one, remove it because we've removed this inode from the list.
> -	 *
> -	 * Later, if this inode was in the middle of the list we'll update
> -	 * this inode's backref to point from the next inode.
> -	 */
> -	if (next_agino != NULLAGINO) {
> -		pag = xfs_perag_get(mp, agno);
> -		error = xfs_iunlink_change_backref(pag, next_agino,
> -				NULLAGINO);
> -		if (error)
> -			goto out;
> -	}
> -
> -	if (head_agino == agino) {
> -		/* Point the head of the list to the next unlinked inode. */
> -		error = xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
> -				next_agino);
> -		if (error)
> -			goto out;
> -	} else {
> -		struct xfs_imap	imap;
> -		xfs_agino_t	prev_agino;
> -
> -		if (!pag)
> -			pag = xfs_perag_get(mp, agno);
> -
> -		/* We need to search the list for the inode being freed. */
> -		error = xfs_iunlink_map_prev(tp, agno, head_agino, agino,
> -				&prev_agino, &imap, &last_dip, &last_ibp,
> -				pag);
> -		if (error)
> -			goto out;
>  
> -		/* Point the previous inode on the list to the next inode. */
> -		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
> -				last_dip, &imap, next_agino);
> -
> -		/*
> -		 * Now we deal with the backref for this inode.  If this inode
> -		 * pointed at a real inode, change the backref that pointed to
> -		 * us to point to our old next.  If this inode was the end of
> -		 * the list, delete the backref that pointed to us.  Note that
> -		 * change_backref takes care of deleting the backref if
> -		 * next_agino is NULLAGINO.
> -		 */
> -		error = xfs_iunlink_change_backref(pag, agino, next_agino);
> -		if (error)
> -			goto out;
> -	}
> -
> -out:
> -	if (pag)
> -		xfs_perag_put(pag);
> -	return error;
> +	return xfs_iunlink_remove_inode(tp, agibp, ip);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 991ef00370d5..c3fb62fbdeb1 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -56,6 +56,8 @@ typedef struct xfs_inode {
>  	uint64_t		i_delayed_blks;	/* count of delay alloc blks */
>  
>  	struct xfs_icdinode	i_d;		/* most of ondisk inode */
> +	xfs_agino_t		i_prev_unlinked;
> +	xfs_agino_t		i_next_unlinked;
>  
>  	/* VFS inode */
>  	struct inode		i_vnode;	/* embedded VFS inode */
> @@ -463,9 +465,6 @@ extern struct kmem_zone	*xfs_inode_zone;
>  /* The default CoW extent size hint. */
>  #define XFS_DEFAULT_COWEXTSZ_HINT 32
>  
> -int xfs_iunlink_init(struct xfs_perag *pag);
> -void xfs_iunlink_destroy(struct xfs_perag *pag);
> -
>  void xfs_end_io(struct work_struct *work);
>  
>  #endif	/* __XFS_INODE_H__ */
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 52a65a74208f..d47eea31c165 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2682,15 +2682,13 @@ xlog_recover_clear_agi_bucket(
>  	return;
>  }
>  
> -STATIC xfs_agino_t
> -xlog_recover_process_one_iunlink(
> +static struct xfs_inode *
> +xlog_recover_get_one_iunlink(
>  	struct xfs_mount		*mp,
>  	xfs_agnumber_t			agno,
>  	xfs_agino_t			agino,
>  	int				bucket)
>  {
> -	struct xfs_buf			*ibp;
> -	struct xfs_dinode		*dip;
>  	struct xfs_inode		*ip;
>  	xfs_ino_t			ino;
>  	int				error;
> @@ -2698,45 +2696,20 @@ xlog_recover_process_one_iunlink(
>  	ino = XFS_AGINO_TO_INO(mp, agno, agino);
>  	error = xfs_iget(mp, NULL, ino, 0, 0, &ip);
>  	if (error)
> -		goto fail;
> -
> -	/*
> -	 * Get the on disk inode to find the next inode in the bucket.
> -	 */
> -	error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &ibp, 0);
> -	if (error)
> -		goto fail_iput;
> +		return NULL;
>  
>  	xfs_iflags_clear(ip, XFS_IRECOVERY);
>  	ASSERT(VFS_I(ip)->i_nlink == 0);
>  	ASSERT(VFS_I(ip)->i_mode != 0);
>  
> -	/* setup for the next pass */
> -	agino = be32_to_cpu(dip->di_next_unlinked);
> -	xfs_buf_relse(ibp);
> -
>  	/*
>  	 * Prevent any DMAPI event from being sent when the reference on
>  	 * the inode is dropped.
>  	 */
>  	ip->i_d.di_dmevmask = 0;
>  
> -	xfs_irele(ip);
> -	return agino;
> +	return ip;
>  
> - fail_iput:
> -	xfs_irele(ip);
> - fail:
> -	/*
> -	 * We can't read in the inode this bucket points to, or this inode
> -	 * is messed up.  Just ditch this bucket of inodes.  We will lose
> -	 * some inodes and space, but at least we won't hang.
> -	 *
> -	 * Call xlog_recover_clear_agi_bucket() to perform a transaction to
> -	 * clear the inode pointer in the bucket.
> -	 */
> -	xlog_recover_clear_agi_bucket(mp, agno, bucket);
> -	return NULLAGINO;
>  }
>  
>  /*
> @@ -2762,56 +2735,107 @@ xlog_recover_process_one_iunlink(
>   * scheduled on this CPU to ensure other scheduled work can run without undue
>   * latency.
>   */
> -STATIC void
> -xlog_recover_process_iunlinks(
> -	struct xlog	*log)
> +static int
> +xlog_recover_iunlink_ag(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno)
>  {
> -	xfs_mount_t	*mp;
> -	xfs_agnumber_t	agno;
> -	xfs_agi_t	*agi;
> -	xfs_buf_t	*agibp;
> -	xfs_agino_t	agino;
> -	int		bucket;
> -	int		error;
> +	struct xfs_agi		*agi;
> +	struct xfs_buf		*agibp;
> +	int			bucket;
> +	int			error;
>  
> -	mp = log->l_mp;
> +	/*
> +	 * Find the agi for this ag.
> +	 */
> +	error = xfs_read_agi(mp, NULL, agno, &agibp);
> +	if (error) {
>  
> -	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
>  		/*
> -		 * Find the agi for this ag.
> +		 * AGI is b0rked. Don't process it.
> +		 *
> +		 * We should probably mark the filesystem as corrupt after we've
> +		 * recovered all the ag's we can....
>  		 */
> -		error = xfs_read_agi(mp, NULL, agno, &agibp);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Unlock the buffer so that it can be acquired in the normal course of
> +	 * the transaction to truncate and free each inode.  Because we are not
> +	 * racing with anyone else here for the AGI buffer, we don't even need
> +	 * to hold it locked to read the initial unlinked bucket entries out of
> +	 * the buffer. We keep buffer reference though, so that it stays pinned
> +	 * in memory while we need the buffer.
> +	 */
> +	agi = agibp->b_addr;
> +	xfs_buf_unlock(agibp);
> +
> +	/*
> +	 * The unlinked inode list is maintained on incore inodes as a double
> +	 * linked list. We don't have any of that state in memory, so we have to
> +	 * create it as we go. This is simple as we are only removing from the
> +	 * head of the list and that means we only need to pull the current
> +	 * inode in and the next inode.  Inodes are unlinked when their
> +	 * reference count goes to zero, so we can overlap the xfs_iget() and
> +	 * xfs_irele() calls so we always have the first two inodes on the list
> +	 * in memory. Hence we can fake up the necessary in memory state for the
> +	 * unlink to "just work".
> +	 */
> +	for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
> +		struct xfs_inode	*ip, *prev_ip = NULL;
> +		xfs_agino_t		agino, prev_agino = NULLAGINO;
> +
> +		agino = be32_to_cpu(agi->agi_unlinked[bucket]);
> +		while (agino != NULLAGINO) {
> +			ip = xlog_recover_get_one_iunlink(mp, agno, agino,
> +							  bucket);
> +			if (!ip) {
> +				/*
> +				 * something busted, but still got to release
> +				 * prev_ip, so make it look like it's at the end
> +				 * of the list before it gets released.
> +				 */
> +				error = -EFSCORRUPTED;
> +				if (prev_ip)
> +					prev_ip->i_next_unlinked = NULLAGINO;
> +				break;
> +			}
> +			if (prev_ip) {
> +				ip->i_prev_unlinked = prev_agino;
> +				xfs_irele(prev_ip);
> +			}
> +			prev_agino = agino;
> +			prev_ip = ip;
> +			agino = ip->i_next_unlinked;
> +			cond_resched();
> +		}
> +		if (prev_ip)
> +			xfs_irele(prev_ip);
>  		if (error) {
>  			/*
> -			 * AGI is b0rked. Don't process it.
> -			 *
> -			 * We should probably mark the filesystem as corrupt
> -			 * after we've recovered all the ag's we can....
> +			 * We can't read an inode this bucket points to, or an
> +			 * inode is messed up.  Just ditch this bucket of
> +			 * inodes.  We will lose some inodes and space, but at
> +			 * least we won't hang.
>  			 */
> -			continue;
> -		}
> -		/*
> -		 * Unlock the buffer so that it can be acquired in the normal
> -		 * course of the transaction to truncate and free each inode.
> -		 * Because we are not racing with anyone else here for the AGI
> -		 * buffer, we don't even need to hold it locked to read the
> -		 * initial unlinked bucket entries out of the buffer. We keep
> -		 * buffer reference though, so that it stays pinned in memory
> -		 * while we need the buffer.
> -		 */
> -		agi = agibp->b_addr;
> -		xfs_buf_unlock(agibp);
> -
> -		for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
> -			agino = be32_to_cpu(agi->agi_unlinked[bucket]);
> -			while (agino != NULLAGINO) {
> -				agino = xlog_recover_process_one_iunlink(mp,
> -							agno, agino, bucket);
> -				cond_resched();
> -			}
> +			xlog_recover_clear_agi_bucket(mp, agno, bucket);
> +			break;
>  		}
> -		xfs_buf_rele(agibp);
>  	}
> +	xfs_buf_rele(agibp);
> +	return error;
> +}
> +
> +STATIC void
> +xlog_recover_process_iunlinks(
> +	struct xlog		*log)
> +{
> +	struct xfs_mount	*mp = log->l_mp;
> +	xfs_agnumber_t		agno;
> +
> +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++)
> +		xlog_recover_iunlink_ag(mp, agno);
>  }
>  
>  STATIC void
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index c8ae49a1e99c..031e96ff022d 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -146,7 +146,6 @@ xfs_free_perag(
>  		spin_unlock(&mp->m_perag_lock);
>  		ASSERT(pag);
>  		ASSERT(atomic_read(&pag->pag_ref) == 0);
> -		xfs_iunlink_destroy(pag);
>  		xfs_buf_hash_destroy(pag);
>  		call_rcu(&pag->rcu_head, __xfs_free_perag);
>  	}
> @@ -223,9 +222,6 @@ xfs_initialize_perag(
>  		/* first new pag is fully initialized */
>  		if (first_initialised == NULLAGNUMBER)
>  			first_initialised = index;
> -		error = xfs_iunlink_init(pag);
> -		if (error)
> -			goto out_hash_destroy;
>  		spin_lock_init(&pag->pag_state_lock);
>  	}
>  
> @@ -248,7 +244,6 @@ xfs_initialize_perag(
>  		if (!pag)
>  			break;
>  		xfs_buf_hash_destroy(pag);
> -		xfs_iunlink_destroy(pag);
>  		kmem_free(pag);
>  	}
>  	return error;
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] xfs: introduce inode unlink log item
  2020-06-23  9:50 ` [PATCH 4/4] xfs: introduce inode unlink log item Dave Chinner
  2020-06-30 18:19   ` Darrick J. Wong
@ 2020-07-01 14:32   ` Brian Foster
  2020-07-01 22:24     ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Brian Foster @ 2020-07-01 14:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 23, 2020 at 07:50:15PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Tracking dirty inodes via cluster buffers creates lock ordering
> issues with logging unlinked inode updates direct to the cluster
> buffer. The unlinked inode list is unordered, so we can lock cluster
> buffers in random orders and that causes deadlocks.
> 
> To solve this problem, we really want to dealy locking the cluster
> buffers until the pre-commit phase where we can order the buffers
> correctly along with all the other inode cluster buffers that are
> locked by the transaction. However, to do this we need to be able to
> tell the transaction which inodes need to have there unlinked list
> updated and what it should be updated to.
> 
> We can delay the buffer update to the pre-commit phase based on the
> fact taht all unlinked inode list updates are serialised by the AGI
> buffer. It will be locked into the transaction before the list
> update starts, and will remain locked until the transaction commits.
> Hence we can lock and update the cluster buffers safely any time
> during the transaction and we are still safe from other racing
> unlinked list updates.
> 
> The iunlink log item currently only exists in memory. we need a log
> item to attach information to the transaction, but it's context
> is completely owned by the transaction. Hence it is never formatted
> or inserted into the CIL, nor is it seen by the journal, the AIL or
> log recovery.
> 
> This makes it a very simple log item, and the changes makes results
> in adding addition buffer log items to the transaction. Hence once
> the iunlink log item has run it's pre-commit operation, it can be
> dropped by the transaction and released.
> 
> The creation of this in-memory intent does not prevent us from
> extending it in future to the journal to replace buffer based
> logging of the unlinked list. Changing the format of the items we
> write to the on disk journal is beyond the scope of this patchset,
> hence we limit it to being in-memory only.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/Makefile           |   1 +
>  fs/xfs/xfs_inode.c        |  70 +++----------------
>  fs/xfs/xfs_inode_item.c   |   3 +-
>  fs/xfs/xfs_iunlink_item.c | 141 ++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_iunlink_item.h |  24 +++++++
>  fs/xfs/xfs_super.c        |  10 +++
>  6 files changed, 189 insertions(+), 60 deletions(-)
>  create mode 100644 fs/xfs/xfs_iunlink_item.c
>  create mode 100644 fs/xfs/xfs_iunlink_item.h
> 
...
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 0494b907c63d..bc1970c37edc 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -488,8 +488,9 @@ xfs_inode_item_push(
>  	ASSERT(iip->ili_item.li_buf);
>  
>  	if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp) ||
> -	    (ip->i_flags & XFS_ISTALE))
> +	    (ip->i_flags & XFS_ISTALE)) {
>  		return XFS_ITEM_PINNED;
> +	}

Spurious change..?

>  
>  	/* If the inode is already flush locked, we're already flushing. */
>  	if (xfs_iflags_test(ip, XFS_IFLUSHING))
> diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
> new file mode 100644
> index 000000000000..83f1dc81133b
> --- /dev/null
> +++ b/fs/xfs/xfs_iunlink_item.c
> @@ -0,0 +1,141 @@
...
> +
> +static const struct xfs_item_ops xfs_iunlink_item_ops = {
> +	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
> +	.iop_release	= xfs_iunlink_item_release,

Presumably we need the release callback for transaction abort, but the
flag looks unnecessary. That triggers a release on commit to the on-disk
log, which IIUC should never happen for this item.

> +	.iop_sort	= xfs_iunlink_item_sort,
> +	.iop_precommit	= xfs_iunlink_item_precommit,
> +};
> +
> +
> +/*
> + * Initialize the inode log item for a newly allocated (in-core) inode.
> + *
> + * Inode extents can only reside within an AG. Hence specify the starting
> + * block for the inode chunk by offset within an AG as well as the
> + * length of the allocated extent.
> + *
> + * This joins the item to the transaction and marks it dirty so
> + * that we don't need a separate call to do this, nor does the
> + * caller need to know anything about the iunlink item.
> + */

Looks like some copy/paste remnants in the comment.

Brian

> +void
> +xfs_iunlink_log(
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_iunlink_item	*iup;
> +
> +	iup = kmem_zone_zalloc(xfs_iunlink_zone, 0);
> +
> +	xfs_log_item_init(tp->t_mountp, &iup->iu_item, XFS_LI_IUNLINK,
> +			  &xfs_iunlink_item_ops);
> +
> +	iup->iu_ino = ip->i_ino;
> +	iup->iu_next_unlinked = ip->i_next_unlinked;
> +	iup->iu_imap = ip->i_imap;
> +
> +	xfs_trans_add_item(tp, &iup->iu_item);
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	set_bit(XFS_LI_DIRTY, &iup->iu_item.li_flags);
> +}
> diff --git a/fs/xfs/xfs_iunlink_item.h b/fs/xfs/xfs_iunlink_item.h
> new file mode 100644
> index 000000000000..c9e58acf4ccf
> --- /dev/null
> +++ b/fs/xfs/xfs_iunlink_item.h
> @@ -0,0 +1,24 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2020, Red Hat, Inc.
> + * All Rights Reserved.
> + */
> +#ifndef XFS_IUNLINK_ITEM_H
> +#define XFS_IUNLINK_ITEM_H	1
> +
> +struct xfs_trans;
> +struct xfs_inode;
> +
> +/* in memory log item structure */
> +struct xfs_iunlink_item {
> +	struct xfs_log_item	iu_item;
> +	struct xfs_imap		iu_imap;
> +	xfs_ino_t		iu_ino;
> +	xfs_agino_t		iu_next_unlinked;
> +};
> +
> +extern kmem_zone_t *xfs_iunlink_zone;
> +
> +void xfs_iunlink_log(struct xfs_trans *tp, struct xfs_inode *ip);
> +
> +#endif	/* XFS_IUNLINK_ITEM_H */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 5a5d9453cf51..a36dfb0e7e5b 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -35,6 +35,7 @@
>  #include "xfs_refcount_item.h"
>  #include "xfs_bmap_item.h"
>  #include "xfs_reflink.h"
> +#include "xfs_iunlink_item.h"
>  
>  #include <linux/magic.h>
>  #include <linux/fs_context.h>
> @@ -1955,8 +1956,16 @@ xfs_init_zones(void)
>  	if (!xfs_bui_zone)
>  		goto out_destroy_bud_zone;
>  
> +	xfs_iunlink_zone = kmem_cache_create("xfs_iul_item",
> +					     sizeof(struct xfs_iunlink_item),
> +					     0, 0, NULL);
> +	if (!xfs_iunlink_zone)
> +		goto out_destroy_bui_zone;
> +
>  	return 0;
>  
> + out_destroy_bui_zone:
> +	kmem_cache_destroy(xfs_bui_zone);
>   out_destroy_bud_zone:
>  	kmem_cache_destroy(xfs_bud_zone);
>   out_destroy_cui_zone:
> @@ -2003,6 +2012,7 @@ xfs_destroy_zones(void)
>  	 * destroy caches.
>  	 */
>  	rcu_barrier();
> +	kmem_cache_destroy(xfs_iunlink_zone);
>  	kmem_cache_destroy(xfs_bui_zone);
>  	kmem_cache_destroy(xfs_bud_zone);
>  	kmem_cache_destroy(xfs_cui_zone);
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] xfs: add log item precommit operation
  2020-07-01 14:30   ` Brian Foster
@ 2020-07-01 22:02     ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2020-07-01 22:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Jul 01, 2020 at 10:30:57AM -0400, Brian Foster wrote:
> On Tue, Jun 23, 2020 at 07:50:13PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > For inodes that are dirty, we have an attached cluster buffer that
> > we want to use to track the dirty inode through the AIL.
> > Unfortunately, locking the cluster buffer and adding it to the
> > transaction when the inode is first logged in a transaction leads to
> > buffer lock ordering inversions.
> > 
> > The specific problem is ordering against the AGI buffer. When
> > modifying unlinked lists, the buffer lock order is AGI -> inode
> > cluster buffer as the AGI buffer lock serialises all access to the
> > unlinked lists. Unfortunately, functionality like xfs_droplink()
> > logs the inode before calling xfs_iunlink(), as do various directory
> > manipulation functions. The inode can be logged way down in the
> > stack as far as the bmapi routines and hence, without a major
> > rewrite of lots of APIs there's no way we can avoid the inode being
> > logged by something until after the AGI has been logged.
> > 
> > As we are going to be using ordered buffers for inode AIL tracking,
> > there isn't a need to actually lock that buffer against modification
> > as all the modifications are captured by logging the inode item
> > itself. Hence we don't actually need to join the cluster buffer into
> > the transaction until just before it is committed. This means we do
> > not perturb any of the existing buffer lock orders in transactions,
> > and the inode cluster buffer is always locked last in a transaction
> > that doesn't otherwise touch inode cluster buffers.
> > 
> > We do this by introducing a precommit log item method. A log item
> > method is used because it is likely dquots will be moved to this
> > same ordered buffer tracking scheme and hence will need a similar
> > callout. This commit just introduces the mechanism; the inode item
> > implementation is in followup commits.
> > 
> > The precommit items need to be sorted into consistent order as we
> > may be locking multiple items here. Hence if we have two dirty
> > inodes in cluster buffers A and B, and some other transaction has
> > two separate dirty inodes in the same cluster buffers, locking them
> > in different orders opens us up to ABBA deadlocks. Hence we sort the
> > items on the transaction based on the presence of a sort log item
> > method.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Seems like a nice abstraction, particularly when you consider the other
> use cases you described that should fall into place over time. A couple
> minor comments..
> 
> >  fs/xfs/xfs_icache.c |  1 +
> >  fs/xfs/xfs_trans.c  | 90 +++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_trans.h  |  6 ++-
> >  3 files changed, 95 insertions(+), 2 deletions(-)
> > 
> ...
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 3c94e5ff4316..6f350490f84b 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -799,6 +799,89 @@ xfs_trans_committed_bulk(
> >  	spin_unlock(&ailp->ail_lock);
> >  }
> >  
> > +/*
> > + * Sort transaction items prior to running precommit operations. This will
> > + * attempt to order the items such that they will always be locked in the same
> > + * order. Items that have no sort function are moved to the end of the list
> > + * and so are locked last (XXX: need to check the logic matches the comment).
> > + *
> 
> Heh, I was going to ask what the expected behavior was with the various
> !iop_sort() cases and whether we can really expect those items to be
> isolated at the end of the list.
> 
> > + * This may need refinement as different types of objects add sort functions.
> > + *
> > + * Function is more complex than it needs to be because we are comparing 64 bit
> > + * values and the function only returns 32 bit values.
> > + */
> > +static int
> > +xfs_trans_precommit_sort(
> > +	void			*unused_arg,
> > +	struct list_head	*a,
> > +	struct list_head	*b)
> > +{
> > +	struct xfs_log_item	*lia = container_of(a,
> > +					struct xfs_log_item, li_trans);
> > +	struct xfs_log_item	*lib = container_of(b,
> > +					struct xfs_log_item, li_trans);
> > +	int64_t			diff;
> > +
> > +	if (!lia->li_ops->iop_sort && !lib->li_ops->iop_sort)
> > +		return 0;
> > +	if (!lia->li_ops->iop_sort)
> > +		return 1;
> > +	if (!lib->li_ops->iop_sort)
> > +		return -1;
> 
> I'm a little confused on what these values are supposed to mean if one
> of the two items is non-sortable. Is the purpose of this simply to move
> sortable items to the head and non-sortable toward the tail, as noted
> above?

If the log item doesn't have a sort function, it implies the object
is already locked and modified and there's no pre-commit operation
going to be performed on it. In that case, I decided to move them to
the tail of the list so that it would be easier to verify that the
items that need sorting were, indeed, sorted into the correct order.

The choice was arbitrary - the could be moved to the head of the
list or they could be left where they are any everything else is
ordered around them, but I went for the behaviour that it easy to
verify visually with debug output or via a list walk in a debugger...

> > +static int
> > +xfs_trans_run_precommits(
> > +	struct xfs_trans	*tp)
> > +{
> > +	struct xfs_mount	*mp = tp->t_mountp;
> > +	struct xfs_log_item	*lip, *n;
> > +	int			error = 0;
> > +
> > +	if (XFS_FORCED_SHUTDOWN(mp))
> > +		return -EIO;
> > +
> 
> I'd rather not change behavior here. This effectively overrides the
> shutdown check in the caller because we get here regardless of whether
> the transaction has any pre-commit callouts or not. It seems like this
> is unnecessary, at least for the time being, if the precommit is
> primarily focused on sorting.

I put that there because if we are already shut down then there's no
point in even sorting or running pre-commits - they are going to
error out trying to access the objects they need to modify anyway.

It really isn't critical, just seemed superfluous to run code that
we already know will be cancelled and/or error out...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] xfs: track unlinked inodes in core inode
  2020-07-01  8:59   ` Gao Xiang
@ 2020-07-01 22:06     ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2020-07-01 22:06 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs

On Wed, Jul 01, 2020 at 04:59:04PM +0800, Gao Xiang wrote:
> On Tue, Jun 23, 2020 at 07:50:14PM +1000, Dave Chinner wrote:
> 
> ...
> 
> > +static struct xfs_inode *
> > +xfs_iunlink_ilookup(
> >  	struct xfs_perag	*pag,
> >  	xfs_agino_t		agino)
> >  {
> > -	struct xfs_iunlink	*iu;
> > +	struct xfs_mount	*mp = pag->pag_mount;
> > +	struct xfs_inode	*ip;
> >  
> > -	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
> > -			xfs_iunlink_hash_params);
> 
> Just notice that when working on this patchset. Since pagi_unlinked_hash
> is unused now, let's kill pagi_unlinked_hash in xfs_perag as well.
> 
> typedef struct xfs_perag {
> ...
> struct rhashtable       pagi_unlinked_hash;
> ...
> };

Good catch. I'll remove that for the next iteration.

> Also I noticed xfs_iunlink_insert_inode and xfs_iunlink_remove_inode
> are used once now, maybe folding into the caller would be better...
> (Just my personal thought...)

I used them to make it obvious what operation was being performed on
the unlinked list. In the end they might be simple enough to remove,
but when I first wrote this patch is was necessary to document what
operations were being performed by the code rather than it just
being a big tangle of spagetti...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] xfs: track unlinked inodes in core inode
  2020-07-01 14:31   ` Brian Foster
@ 2020-07-01 22:18     ` Dave Chinner
  2020-07-02 12:24       ` Brian Foster
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2020-07-01 22:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Jul 01, 2020 at 10:31:21AM -0400, Brian Foster wrote:
> On Tue, Jun 23, 2020 at 07:50:14PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Currently we cache unlinked inode list backrefs through a separate
> > cache which has to be maintained via memory allocation and a hash
> > table. When the inode is on the unlinked list, we have an existence
> > guarantee for the inode in memory.
> > 
> > That is, if the inode is on the unlinked list, there must be a
> > reference to the inode from the core VFS because dropping the last
> > reference to the inode causes it to be removed from the unlinked
> > list. Hence if we hold the AGI locked, we guarantee that any inode
> > on the unlinked list is pinned in memory. That means we can actually
> > track the entire unlinked list on the inode itself and use
> > unreferenced inode cache lookups to update the list pointers as
> > needed.
> > 
> > However, we don't use this relationship because log recovery has
> > no in memory state and so has to work directly from buffers.
> > However, because unlink recovery only removes from the head of the
> > list, we can easily fake this in memory state as the inode we read
> > in from the AGI bucket has a pointer to the next inode. Hence we can
> > play reference leapfrog in the recovery loop always reading the
> > second inode on the list and updating pointers before dropping the
> > reference to the first inode. Hence the in-memory state is always
> > valid for recovery, too.
> > 
> > This means we can tear out the old inode unlinked list cache and
> > update mechanisms and replace it with a much simpler "insert" and
> > "remove" functions that use in-memory inode state rather than on
> > buffer state to track the list. The diffstat speaks for itself.
> > 
> > Food for thought: This obliviates the need for the on-disk AGI
> > unlinked hash - we because we track as a double linked list in
> > memory, we don't need to keep hash chains on disk short to minimise
> > previous inode lookup overhead on list removal. Hence we probably
> > should just convert the unlinked list code to use a single list
> > on disk...
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Looks interesting, but are you planning to break this up into smaller
> pieces? E.g., perhaps add the new inode pointers and set them in one
> patch, then replace the whole backref thing with the in-core pointers,
> then update the insert/remove, then log recovery, etc.

Likely, yes.

> I'm sure there's
> various ways it can or cannot be split, but regardless this patch looks
> like it could be a series in and of itself.

This RFC series is largely centered around this single patch, so
splitting it out into a separate series makes no sense.

FWIW, This is basically the same sort of thing that the inode
flushing patchset started out as - a single patch that I wrote in
few hours and got working as a whole. It does need to be split up,
but given that the inode flushing rework took several months to turn
a few hours of coding into a mergable patchset, I haven't bothered
to do that for this patch set yet.

I'd kinda like to avoid having this explode into 30 patches as that
previous patchset did - this is a very self-contained change, so
there's really only 4-5 pieces it can be split up into. Trying to
split it more finely than that is going to make it quite hard to
find clean places to split it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] xfs: introduce inode unlink log item
  2020-07-01 14:32   ` Brian Foster
@ 2020-07-01 22:24     ` Dave Chinner
  2020-07-02 12:25       ` Brian Foster
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2020-07-01 22:24 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Jul 01, 2020 at 10:32:19AM -0400, Brian Foster wrote:
> On Tue, Jun 23, 2020 at 07:50:15PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Tracking dirty inodes via cluster buffers creates lock ordering
> > issues with logging unlinked inode updates direct to the cluster
> > buffer. The unlinked inode list is unordered, so we can lock cluster
> > buffers in random orders and that causes deadlocks.
> > 
> > To solve this problem, we really want to dealy locking the cluster
> > buffers until the pre-commit phase where we can order the buffers
> > correctly along with all the other inode cluster buffers that are
> > locked by the transaction. However, to do this we need to be able to
> > tell the transaction which inodes need to have there unlinked list
> > updated and what it should be updated to.
> > 
> > We can delay the buffer update to the pre-commit phase based on the
> > fact taht all unlinked inode list updates are serialised by the AGI
> > buffer. It will be locked into the transaction before the list
> > update starts, and will remain locked until the transaction commits.
> > Hence we can lock and update the cluster buffers safely any time
> > during the transaction and we are still safe from other racing
> > unlinked list updates.
> > 
> > The iunlink log item currently only exists in memory. we need a log
> > item to attach information to the transaction, but it's context
> > is completely owned by the transaction. Hence it is never formatted
> > or inserted into the CIL, nor is it seen by the journal, the AIL or
> > log recovery.
> > 
> > This makes it a very simple log item, and the changes makes results
> > in adding addition buffer log items to the transaction. Hence once
> > the iunlink log item has run it's pre-commit operation, it can be
> > dropped by the transaction and released.
> > 
> > The creation of this in-memory intent does not prevent us from
> > extending it in future to the journal to replace buffer based
> > logging of the unlinked list. Changing the format of the items we
> > write to the on disk journal is beyond the scope of this patchset,
> > hence we limit it to being in-memory only.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/Makefile           |   1 +
> >  fs/xfs/xfs_inode.c        |  70 +++----------------
> >  fs/xfs/xfs_inode_item.c   |   3 +-
> >  fs/xfs/xfs_iunlink_item.c | 141 ++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_iunlink_item.h |  24 +++++++
> >  fs/xfs/xfs_super.c        |  10 +++
> >  6 files changed, 189 insertions(+), 60 deletions(-)
> >  create mode 100644 fs/xfs/xfs_iunlink_item.c
> >  create mode 100644 fs/xfs/xfs_iunlink_item.h
> > 
> ...
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index 0494b907c63d..bc1970c37edc 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -488,8 +488,9 @@ xfs_inode_item_push(
> >  	ASSERT(iip->ili_item.li_buf);
> >  
> >  	if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp) ||
> > -	    (ip->i_flags & XFS_ISTALE))
> > +	    (ip->i_flags & XFS_ISTALE)) {
> >  		return XFS_ITEM_PINNED;
> > +	}
> 
> Spurious change..?

*nod*

Likely a left over from adding debug traceprints...

> >  	/* If the inode is already flush locked, we're already flushing. */
> >  	if (xfs_iflags_test(ip, XFS_IFLUSHING))
> > diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
> > new file mode 100644
> > index 000000000000..83f1dc81133b
> > --- /dev/null
> > +++ b/fs/xfs/xfs_iunlink_item.c
> > @@ -0,0 +1,141 @@
> ...
> > +
> > +static const struct xfs_item_ops xfs_iunlink_item_ops = {
> > +	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
> > +	.iop_release	= xfs_iunlink_item_release,
> 
> Presumably we need the release callback for transaction abort, but the
> flag looks unnecessary. That triggers a release on commit to the on-disk
> log, which IIUC should never happen for this item.

You are probably right - I didn't look that further than "it should
be freed at commit time" and the flag name implies it is freed at
commit time.

Which, of course, then raises the question: "Which commit are we
talking about here?". But because it's RFC work at this point I
didn't bother chasing that detail down because the code worked and I
had other things to do.....

> > +	.iop_sort	= xfs_iunlink_item_sort,
> > +	.iop_precommit	= xfs_iunlink_item_precommit,
> > +};
> > +
> > +
> > +/*
> > + * Initialize the inode log item for a newly allocated (in-core) inode.
> > + *
> > + * Inode extents can only reside within an AG. Hence specify the starting
> > + * block for the inode chunk by offset within an AG as well as the
> > + * length of the allocated extent.
> > + *
> > + * This joins the item to the transaction and marks it dirty so
> > + * that we don't need a separate call to do this, nor does the
> > + * caller need to know anything about the iunlink item.
> > + */
> 
> Looks like some copy/paste remnants in the comment.

Yup, I did just copy-pasta at lot of stuff around here...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] xfs: track unlinked inodes in core inode
  2020-07-01 22:18     ` Dave Chinner
@ 2020-07-02 12:24       ` Brian Foster
  0 siblings, 0 replies; 22+ messages in thread
From: Brian Foster @ 2020-07-02 12:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jul 02, 2020 at 08:18:39AM +1000, Dave Chinner wrote:
> On Wed, Jul 01, 2020 at 10:31:21AM -0400, Brian Foster wrote:
> > On Tue, Jun 23, 2020 at 07:50:14PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Currently we cache unlinked inode list backrefs through a separate
> > > cache which has to be maintained via memory allocation and a hash
> > > table. When the inode is on the unlinked list, we have an existence
> > > guarantee for the inode in memory.
> > > 
> > > That is, if the inode is on the unlinked list, there must be a
> > > reference to the inode from the core VFS because dropping the last
> > > reference to the inode causes it to be removed from the unlinked
> > > list. Hence if we hold the AGI locked, we guarantee that any inode
> > > on the unlinked list is pinned in memory. That means we can actually
> > > track the entire unlinked list on the inode itself and use
> > > unreferenced inode cache lookups to update the list pointers as
> > > needed.
> > > 
> > > However, we don't use this relationship because log recovery has
> > > no in memory state and so has to work directly from buffers.
> > > However, because unlink recovery only removes from the head of the
> > > list, we can easily fake this in memory state as the inode we read
> > > in from the AGI bucket has a pointer to the next inode. Hence we can
> > > play reference leapfrog in the recovery loop always reading the
> > > second inode on the list and updating pointers before dropping the
> > > reference to the first inode. Hence the in-memory state is always
> > > valid for recovery, too.
> > > 
> > > This means we can tear out the old inode unlinked list cache and
> > > update mechanisms and replace it with a much simpler "insert" and
> > > "remove" functions that use in-memory inode state rather than on
> > > buffer state to track the list. The diffstat speaks for itself.
> > > 
> > > Food for thought: This obliviates the need for the on-disk AGI
> > > unlinked hash - we because we track as a double linked list in
> > > memory, we don't need to keep hash chains on disk short to minimise
> > > previous inode lookup overhead on list removal. Hence we probably
> > > should just convert the unlinked list code to use a single list
> > > on disk...
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > 
> > Looks interesting, but are you planning to break this up into smaller
> > pieces? E.g., perhaps add the new inode pointers and set them in one
> > patch, then replace the whole backref thing with the in-core pointers,
> > then update the insert/remove, then log recovery, etc.
> 
> Likely, yes.
> 
> > I'm sure there's
> > various ways it can or cannot be split, but regardless this patch looks
> > like it could be a series in and of itself.
> 
> This RFC series is largely centered around this single patch, so
> splitting it out into a separate series makes no sense.
> 

I was just speaking generally that this patch looked quite overloaded. I
don't mean to imply it should be separated from the others.

> FWIW, This is basically the same sort of thing that the inode
> flushing patchset started out as - a single patch that I wrote in
> few hours and got working as a whole. It does need to be split up,
> but given that the inode flushing rework took several months to turn
> a few hours of coding into a mergable patchset, I haven't bothered
> to do that for this patch set yet.
> 

Understood.

> I'd kinda like to avoid having this explode into 30 patches as that
> previous patchset did - this is a very self-contained change, so
> there's really only 4-5 pieces it can be split up into. Trying to
> split it more finely than that is going to make it quite hard to
> find clean places to split it...
> 

I'm not expecting 30 patches. :) The quick flow I noted above, perhaps
with the addition of pushing refactoring changes towards the end, lands
right around 4 or 5 and seems like it would improve reviewability quite
a bit. Of course things change that might lead to more or less once you
get into the details/context of breaking things down...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] xfs: introduce inode unlink log item
  2020-07-01 22:24     ` Dave Chinner
@ 2020-07-02 12:25       ` Brian Foster
  0 siblings, 0 replies; 22+ messages in thread
From: Brian Foster @ 2020-07-02 12:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jul 02, 2020 at 08:24:28AM +1000, Dave Chinner wrote:
> On Wed, Jul 01, 2020 at 10:32:19AM -0400, Brian Foster wrote:
> > On Tue, Jun 23, 2020 at 07:50:15PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Tracking dirty inodes via cluster buffers creates lock ordering
> > > issues with logging unlinked inode updates direct to the cluster
> > > buffer. The unlinked inode list is unordered, so we can lock cluster
> > > buffers in random orders and that causes deadlocks.
> > > 
> > > To solve this problem, we really want to dealy locking the cluster
> > > buffers until the pre-commit phase where we can order the buffers
> > > correctly along with all the other inode cluster buffers that are
> > > locked by the transaction. However, to do this we need to be able to
> > > tell the transaction which inodes need to have there unlinked list
> > > updated and what it should be updated to.
> > > 
> > > We can delay the buffer update to the pre-commit phase based on the
> > > fact taht all unlinked inode list updates are serialised by the AGI
> > > buffer. It will be locked into the transaction before the list
> > > update starts, and will remain locked until the transaction commits.
> > > Hence we can lock and update the cluster buffers safely any time
> > > during the transaction and we are still safe from other racing
> > > unlinked list updates.
> > > 
> > > The iunlink log item currently only exists in memory. we need a log
> > > item to attach information to the transaction, but it's context
> > > is completely owned by the transaction. Hence it is never formatted
> > > or inserted into the CIL, nor is it seen by the journal, the AIL or
> > > log recovery.
> > > 
> > > This makes it a very simple log item, and the changes makes results
> > > in adding addition buffer log items to the transaction. Hence once
> > > the iunlink log item has run it's pre-commit operation, it can be
> > > dropped by the transaction and released.
> > > 
> > > The creation of this in-memory intent does not prevent us from
> > > extending it in future to the journal to replace buffer based
> > > logging of the unlinked list. Changing the format of the items we
> > > write to the on disk journal is beyond the scope of this patchset,
> > > hence we limit it to being in-memory only.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/Makefile           |   1 +
> > >  fs/xfs/xfs_inode.c        |  70 +++----------------
> > >  fs/xfs/xfs_inode_item.c   |   3 +-
> > >  fs/xfs/xfs_iunlink_item.c | 141 ++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_iunlink_item.h |  24 +++++++
> > >  fs/xfs/xfs_super.c        |  10 +++
> > >  6 files changed, 189 insertions(+), 60 deletions(-)
> > >  create mode 100644 fs/xfs/xfs_iunlink_item.c
> > >  create mode 100644 fs/xfs/xfs_iunlink_item.h
> > > 
...
> > > diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
> > > new file mode 100644
> > > index 000000000000..83f1dc81133b
> > > --- /dev/null
> > > +++ b/fs/xfs/xfs_iunlink_item.c
> > > @@ -0,0 +1,141 @@
> > ...
> > > +
> > > +static const struct xfs_item_ops xfs_iunlink_item_ops = {
> > > +	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
> > > +	.iop_release	= xfs_iunlink_item_release,
> > 
> > Presumably we need the release callback for transaction abort, but the
> > flag looks unnecessary. That triggers a release on commit to the on-disk
> > log, which IIUC should never happen for this item.
> 
> You are probably right - I didn't look that further than "it should
> be freed at commit time" and the flag name implies it is freed at
> commit time.
> 
> Which, of course, then raises the question: "Which commit are we
> talking about here?". But because it's RFC work at this point I
> didn't bother chasing that detail down because the code worked and I
> had other things to do.....
> 

I think it's funcionally harmless since this item would never be in a
situation where that flag has an effect. FWIW, I was never a big fan of
the iop_release() factoring and it hasn't really grown on me since. I
routinely have to double/triple check these callbacks where I didn't
before, especially when we have to consider things like this behavior
altering flag and cases where some of the other callbacks route back
into the release callback, etc.

Brian

> > > +	.iop_sort	= xfs_iunlink_item_sort,
> > > +	.iop_precommit	= xfs_iunlink_item_precommit,
> > > +};
> > > +
> > > +
> > > +/*
> > > + * Initialize the inode log item for a newly allocated (in-core) inode.
> > > + *
> > > + * Inode extents can only reside within an AG. Hence specify the starting
> > > + * block for the inode chunk by offset within an AG as well as the
> > > + * length of the allocated extent.
> > > + *
> > > + * This joins the item to the transaction and marks it dirty so
> > > + * that we don't need a separate call to do this, nor does the
> > > + * caller need to know anything about the iunlink item.
> > > + */
> > 
> > Looks like some copy/paste remnants in the comment.
> 
> Yup, I did just copy-pasta at lot of stuff around here...
> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] xfs: track unlinked inodes in core inode
  2020-06-23  9:50 ` [PATCH 3/4] xfs: track unlinked inodes in core inode Dave Chinner
  2020-07-01  8:59   ` Gao Xiang
  2020-07-01 14:31   ` Brian Foster
@ 2020-07-07 14:39   ` Gao Xiang
  2 siblings, 0 replies; 22+ messages in thread
From: Gao Xiang @ 2020-07-07 14:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi Dave,

During working on this stuff, I found some another noticiable places
to mention. Hope it of some help...

On Tue, Jun 23, 2020 at 07:50:14PM +1000, Dave Chinner wrote:

...

>  /*
> - * This is called when the inode's link count has gone to 0 or we are creating
> - * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> - *
> - * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> - * list when the inode is freed.
> + * Always insert at the head, so we only have to do a next inode lookup to
> + * update it's prev pointer. The AGI bucket will point at the one we are
> + * inserting.
>   */
> -STATIC int
> -xfs_iunlink(
> +static int
> +xfs_iunlink_insert_inode(
>  	struct xfs_trans	*tp,
> +	struct xfs_buf		*agibp,
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_agi		*agi;
> -	struct xfs_buf		*agibp;
> -	xfs_agino_t		next_agino;
> -	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	struct xfs_agi		*agi = agibp->b_addr;
> +	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);

unnecessary modification here... (use xfs_agnumber_t instead)

> -/* Return the imap, dinode pointer, and buffer for an inode. */
> -STATIC int
> -xfs_iunlink_map_ino(
> +/*
> + * Remove can be from anywhere in the list, so we have to do two adjacent inode
> + * lookups here so we can update list pointers. We may be at the head or the
> + * tail of the list, so we have to handle those cases as well.
> + */
> +static int
> +xfs_iunlink_remove_inode(
>  	struct xfs_trans	*tp,
> -	xfs_agnumber_t		agno,
> -	xfs_agino_t		agino,
> -	struct xfs_imap		*imap,
> -	struct xfs_dinode	**dipp,
> -	struct xfs_buf		**bpp)
> +	struct xfs_buf		*agibp,
> +	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_agi		*agi = agibp->b_addr;
> +	xfs_agino_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);

same here

>  /*
> @@ -2762,56 +2735,107 @@ xlog_recover_process_one_iunlink(
>   * scheduled on this CPU to ensure other scheduled work can run without undue
>   * latency.
>   */
> -STATIC void
> -xlog_recover_process_iunlinks(
> -	struct xlog	*log)
> +static int
> +xlog_recover_iunlink_ag(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno)
>  {
> -	xfs_mount_t	*mp;
> -	xfs_agnumber_t	agno;
> -	xfs_agi_t	*agi;
> -	xfs_buf_t	*agibp;
> -	xfs_agino_t	agino;
> -	int		bucket;
> -	int		error;
> +	struct xfs_agi		*agi;
> +	struct xfs_buf		*agibp;
> +	int			bucket;
> +	int			error;
>  
> -	mp = log->l_mp;
> +	/*
> +	 * Find the agi for this ag.
> +	 */
> +	error = xfs_read_agi(mp, NULL, agno, &agibp);
> +	if (error) {
>  
> -	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
>  		/*
> -		 * Find the agi for this ag.
> +		 * AGI is b0rked. Don't process it.
> +		 *
> +		 * We should probably mark the filesystem as corrupt after we've
> +		 * recovered all the ag's we can....
>  		 */
> -		error = xfs_read_agi(mp, NULL, agno, &agibp);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Unlock the buffer so that it can be acquired in the normal course of
> +	 * the transaction to truncate and free each inode.  Because we are not
> +	 * racing with anyone else here for the AGI buffer, we don't even need
> +	 * to hold it locked to read the initial unlinked bucket entries out of
> +	 * the buffer. We keep buffer reference though, so that it stays pinned
> +	 * in memory while we need the buffer.
> +	 */
> +	agi = agibp->b_addr;
> +	xfs_buf_unlock(agibp);
> +
> +	/*
> +	 * The unlinked inode list is maintained on incore inodes as a double
> +	 * linked list. We don't have any of that state in memory, so we have to
> +	 * create it as we go. This is simple as we are only removing from the
> +	 * head of the list and that means we only need to pull the current
> +	 * inode in and the next inode.  Inodes are unlinked when their
> +	 * reference count goes to zero, so we can overlap the xfs_iget() and
> +	 * xfs_irele() calls so we always have the first two inodes on the list
> +	 * in memory. Hence we can fake up the necessary in memory state for the
> +	 * unlink to "just work".
> +	 */
> +	for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
> +		struct xfs_inode	*ip, *prev_ip = NULL;
> +		xfs_agino_t		agino, prev_agino = NULLAGINO;
> +
> +		agino = be32_to_cpu(agi->agi_unlinked[bucket]);
> +		while (agino != NULLAGINO) {
> +			ip = xlog_recover_get_one_iunlink(mp, agno, agino,
> +							  bucket);
> +			if (!ip) {
> +				/*
> +				 * something busted, but still got to release
> +				 * prev_ip, so make it look like it's at the end
> +				 * of the list before it gets released.
> +				 */
> +				error = -EFSCORRUPTED;
> +				if (prev_ip)
> +					prev_ip->i_next_unlinked = NULLAGINO;
> +				break;
> +			}
> +			if (prev_ip) {
> +				ip->i_prev_unlinked = prev_agino;
> +				xfs_irele(prev_ip);
> +			}

(little confused about this, why not xfs_irele the current ip and move it after
agino = ip->i_next_unlinked; ....)


And some comments about "[PATCH 4/4] xfs: introduce inode unlink log item.
(no need to raise another thread about this since I'm not fully reviewing that..)

> @@ -2185,6 +2136,7 @@ xfs_iunlink(
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	int			error;
>  
> +	ASSERT(ip->i_next_unlinked == NULLAGINO);
>  	ASSERT(VFS_I(ip)->i_nlink == 0);
>  	ASSERT(VFS_I(ip)->i_mode != 0);
>  	trace_xfs_iunlink(ip);

this ASSERT is not necessary since some unreasonable numbers could be loaded
from disk for crafted images... (seems only for debugging use...)

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-07-07 14:39 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-23  9:50 [PATCH 0/4] [RFC] xfs: in memory inode unlink log items Dave Chinner
2020-06-23  9:50 ` [PATCH 1/4] xfs: xfs_iflock is no longer a completion Dave Chinner
2020-06-24 15:36   ` Brian Foster
2020-07-01  5:48     ` Dave Chinner
2020-06-23  9:50 ` [PATCH 2/4] xfs: add log item precommit operation Dave Chinner
2020-06-30 18:06   ` Darrick J. Wong
2020-07-01 14:30   ` Brian Foster
2020-07-01 22:02     ` Dave Chinner
2020-06-23  9:50 ` [PATCH 3/4] xfs: track unlinked inodes in core inode Dave Chinner
2020-07-01  8:59   ` Gao Xiang
2020-07-01 22:06     ` Dave Chinner
2020-07-01 14:31   ` Brian Foster
2020-07-01 22:18     ` Dave Chinner
2020-07-02 12:24       ` Brian Foster
2020-07-07 14:39   ` Gao Xiang
2020-06-23  9:50 ` [PATCH 4/4] xfs: introduce inode unlink log item Dave Chinner
2020-06-30 18:19   ` Darrick J. Wong
2020-06-30 22:31     ` Gao Xiang
2020-07-01  6:26       ` Dave Chinner
2020-07-01 14:32   ` Brian Foster
2020-07-01 22:24     ` Dave Chinner
2020-07-02 12:25       ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).