linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] xfs: in memory inode unlink log items
@ 2020-08-12  9:25 Dave Chinner
  2020-08-12  9:25 ` [PATCH 01/13] xfs: xfs_iflock is no longer a completion Dave Chinner
                   ` (13 more replies)
  0 siblings, 14 replies; 51+ messages in thread
From: Dave Chinner @ 2020-08-12  9:25 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

This is a cleaned up version of the original RFC I posted here:

https://lore.kernel.org/linux-xfs/20200623095015.1934171-1-david@fromorbit.com/

The original description is preserved below for quick reference,
I'll just walk though the changes in this version:

- rebased on current TOT and xfs/for-next
- split up into many smaller patches
- includes Xiang's single unlinked list bucket modification
- uses a list_head for the in memory double unlinked inode list
  rather than aginos and lockless inode lookups
- much simpler as it doesn't need to look up inodes from agino
  values
- iunlink log item changed to take an xfs_inode pointer rather than
  an imap and agino values
- a handful of small cleanups that breaking up into small patches
  allowed.

The patchset passes fstests for v5 filesystems - v4 filesytsems
testing is currently running, though I don't expect any new problems
there.

Code can be found here:

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-iunlink-item-2

Comments, thoughts, testing, etc all welcome.

-Dave.

============

[Original RFC text]

Inode cluster buffer pinning by dirty inodes allows us to improve
dirty inode tracking efficiency in the log by logging the inode
cluster buffer as an ordered transaction. However, this brings with
it some new issues, namely the order in which we lock inode cluster
buffers.

That is, transactions that dirty and commit multiple inodes in a
transaction will now need to locking multiple inode cluster buffers
in each transaction (e.g. create, rename, etc). This introduces new 
lock ordering constraints in these operations. It also introduces
lock ordering constraints between the AGI and inode cluster buffers
as a result of allocation/freeing being serialised by the AGI
buffer lock. And then there is unlinked inode list logging, which
currently has no fixed order of inode cluster buffer locking.

It's a bit messy.

Locking pure inode modifications in order is relatively easy. We
don't actually need to attach and log the buffer to the transaction
until the last moment. We have all the inodes locked, so nothing
other than unlinked inode list modification can race with the
transaction modifying inodes. Hence we can safely move the
attachment of the inodes to the cluster buffer from when we first
dirty them in xfs_trans_log_inode to just before we commit the
transaction.

At this point, all the inodes that have been dirtied in the
transaction have already been locked, modified, logged and attached
to the transaction. Hence if we add a hook into xfs_trans_commit()
to run a "precommit" operation on these log items, we can use this
operation to attach the inodes to the cluster buffer at commit time
instead of in xfs_trans_log_inode().

This, by itself, doesn't solve the lock ordering problem. What it
does do, however, is give us a place where we can -order- all the
dirty items in the transaction list. Hence before we call the
precommit operation on each log item, we sort them. This allows us
to sort all the inode items so that the pre-commit functions that
locks and logs the cluster buffers are run in a deterministic order.
This solves the lock order problem for pure inode modifications.

The unlinked inode list buffer locking is more complex. The unlinked
list is unordered - we add to the tail, remove from where-ever the
inode is in the list. Hence we might need to lock two inode buffers
here (previous inode in list and the one being removed). While we
can order the locking of these buffers correctly within the confines
of the unlinked list, there may be other inodes that need buffer
locking in the same transaction. e.g. O_TMPFILE being linked into a
directory also modifies the directory inode.

Hence we need a mechanism for defering unlinked inode list updates
to the pre-commit operation where it can be sorted into the correct
order. We can do this by first observing that we serialise unlinked
list modifications by holding the AGI buffer lock. IOWs, the AGI is
going to be locked until the transaction commits any time we modify
the unlinked list. Hence it doesn't matter when in the transaction
we actually load, lock and modify the inode cluster buffer.

IOWs, what we need is an unlinked inode log item to defer the inode
cluster buffer update to transaction commit time where it can be
ordered with all the other inode cluster operations. Essentially all
we need to do is record the inodes that need to have their unlinked
list pointer updated in a new log item that we attached to the
transaction.

This log item exists purely for the purpose of delaying the update
of the unlinked list pointer until the inode cluster buffer can be
locked in the correct order around the other inode cluster buffers.
It plays no part in the actual commit, and there's no change to
anything that is written to the log. i.e. the inode cluster buffers
still have to be fully logged here (not just ordered) as log
recovery depedends on this to replay mods to the unlinked inode
list.

To make this unlinked inode list processing simpler and easier to
implement as a log item, we need to change the way we track the
unlinked list in memory. Starting from the observation that an inode
on the unlinked list is pinned in memory by the VFS, we can use the
xfs_inode itself to track the unlinked list. To do this efficiently,
we want the unlinked list to be a double linked list. The current
implementation takes the approach of minimising the memory footprint
of this list in case we don't want to burn 16 bytes of memory per
inode for a largely unused list head. [*]

We can get this down to 8 bytes per inode because the unlinked list
is per-ag, and hence we only need to store the agino portion of the
inode number as list pointers. We can then use these for lockless
inode cache lookups to retreive the inode. The aginos in the inode
are modified only under the AGI lock, just like the cluster buffer
pointers, so we don't need any extra locking here.  The
i_next_unlinked field tracks the on-disk value of the unlinked list,
and the i_prev_unlinked is a purely in-memory pointer that enables
us to efficiently remove inodes from the middle of the list.

IOWs, we burn a bit more CPU to resolve the unlinked list pointers
to save 8 bytes of memory per inode. If we decide that 8 bytes of
memory isn't a big code, we can convert this to a list_head and just
link the inodes directly to a unlinked list head in the perag.[**]

This gets rid of the entire unlinked list reference hash table that
is used to track this back pointer relationship, greatly simplifying
the unlinked list modification code.

Comments, flames, thoughts all welcome.

-Dave.

[*] An in-memory double linked list removes the need for keeping
lists short to minimise previous inode lookup overhead when removing
from the list. The current backref hash has this function, but it's
not obvious that it can do this and it's a kinda complex way of
implementing a double linked list.

Once we've removed the need for keeping the lists short, we no
longer need the on-disk hash for unlinked lists, so we can put all
the inodes in a single list....

[**] A single unlinked list in the per-ag then leads to a mutex in
the per-ag to protect the list, removing the AGI lock from needing
to be held to modify the unlinked list unless the head of the list
is being modified. We can then add to the tail of the list instead
of the head, hence largely removing the AGI from the unlinked list
processing entirely when there is more than one inode on the
unlinked list.[***]

This is another advantage of moving to single unlinked list - we are
much more likely to have multiple inodes on a single unlinked list
than when they are spread across 64 lists. Hence we are more likely
to be able to elide AGI locking for the unlinked list modifications
the more pressure we put on the unlinked list...

[***] Taking the AGI out of the unlinked list processing means the
only thing it "protects" is the contents of the AGI itself. This is
basically updating accounting and tracking btree root pointers. We
could add another in-memory log item for AGI updates such that the
AGI only needs to be locked, updated and logged in the precommit
function, greatly reducing the time it spends locked for inode
unlink processing [*^4. This will improve performance of inode
alloc/freeing on AG constrained filessytems as we spend less time
serialising on the AGI lock.....

[*^4] This is how superblock updates work, except it's not by a
generic in-memory SB log item - the changes to accounting are stored
directly in the struct xfs_trans as deltas and then applied in
xfs_trans_commit() via xfs_trans_apply_sb_deltas() which locks,
applies and logs the superblock buffer. This could be converted to a
precommit operation, too. [*^5]

Note that this superblock locking is elided for the freespace and
inode accounting when lazy superblock updates are enabled. This
prevents the superblock buffer lock for transactional accounting
update from being a major global contention point.

[*^5] dquots also use a delta accounting structure hard coded into
the struct xfs_trans - the xfs_dquot_acct structure. This gets
allocated when dquot modifications are reserved, and then updated
with each quota modification that is made in the transaction.

Then, in xfs_trans_commit(), it calls xfs_trans_apply_dquot_deltas()
which then orders the locking of the dquots correct, reads, loads
and locks the dquots, modifies the in-memory on-disk dquots and logs
them. This could also be converted to pre-commit operations. [*^6]

[*^6] It should be obvious by now that the pattern of "pre-commit
processing" for "delayed object modification" is not a new idea.
It's been in the code for 25-odd years and copy-pasta'd through the
ages as needed. It's never been turned into a useful, formalised
infrastructure mechanism - that's what this patchset starts us down
the path of. It kinda reminds me of the btree infrastructure
abstraction I did years ago to get rid fo the the 15,000 lines of
copy-pastad btree code and set us on the path to the (relatively)
easy addition of more btrees....



Dave Chinner (12):
  xfs: xfs_iflock is no longer a completion
  xfs: add log item precommit operation
  xfs: factor the xfs_iunlink functions
  xfs: add unlink list pointers to xfs_inode
  xfs: replace iunlink backref lookups with list lookups
  xfs: mapping unlinked inodes is now redundant
  xfs: updating i_next_unlinked doesn't need to return old value
  xfs: validate the unlinked list pointer on update
  xfs: re-order AGI updates in unlink list updates
  xfs: combine iunlink inode update functions
  xfs: add in-memory iunlink log item
  xfs: reorder iunlink remove operation in xfs_ifree

Gao Xiang (1):
  xfs: arrange all unlinked inodes into one list

 fs/xfs/Makefile           |   1 +
 fs/xfs/xfs_error.c        |   2 -
 fs/xfs/xfs_icache.c       |  19 +-
 fs/xfs/xfs_inode.c        | 688 ++++++++------------------------------
 fs/xfs/xfs_inode.h        |  37 +-
 fs/xfs/xfs_inode_item.c   |  15 +-
 fs/xfs/xfs_inode_item.h   |   4 +-
 fs/xfs/xfs_iunlink_item.c | 168 ++++++++++
 fs/xfs/xfs_iunlink_item.h |  25 ++
 fs/xfs/xfs_log_recover.c  | 179 ++++++----
 fs/xfs/xfs_mount.c        |  17 +-
 fs/xfs/xfs_mount.h        |   1 +
 fs/xfs/xfs_super.c        |  20 +-
 fs/xfs/xfs_trace.h        |   1 -
 fs/xfs/xfs_trans.c        |  91 +++++
 fs/xfs/xfs_trans.h        |   6 +-
 16 files changed, 587 insertions(+), 687 deletions(-)
 create mode 100644 fs/xfs/xfs_iunlink_item.c
 create mode 100644 fs/xfs/xfs_iunlink_item.h

-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 01/13] xfs: xfs_iflock is no longer a completion
  2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
@ 2020-08-12  9:25 ` Dave Chinner
  2020-08-18 23:44   ` Darrick J. Wong
  2020-08-22  7:41   ` Christoph Hellwig
  2020-08-12  9:25 ` [PATCH 02/13] xfs: add log item precommit operation Dave Chinner
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 51+ messages in thread
From: Dave Chinner @ 2020-08-12  9:25 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

With the recent rework of the inode cluster flushing, we no longer
ever wait on the the inode flush "lock". It was never a lock in the
first place, just a completion to allow callers to wait for inode IO
to complete. We now never wait for flush completion as all inode
flushing is non-blocking. Hence we can get rid of all the iflock
infrastructure and instead just set and check a state flag.

Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
xfs_iflock_nowait() test-and-set operations on that flag, and
replace all the xfs_ifunlock() calls to clear operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c     | 17 ++++------
 fs/xfs/xfs_inode.c      | 73 +++++++++++++++--------------------------
 fs/xfs/xfs_inode.h      | 33 +------------------
 fs/xfs/xfs_inode_item.c | 15 ++++-----
 fs/xfs/xfs_inode_item.h |  4 +--
 fs/xfs/xfs_mount.c      | 11 ++++---
 fs/xfs/xfs_super.c      | 10 +++---
 7 files changed, 55 insertions(+), 108 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 101028ebb571..aa6aad258670 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -52,7 +52,6 @@ xfs_inode_alloc(
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
-	ASSERT(!xfs_isiflocked(ip));
 	ASSERT(ip->i_ino == 0);
 
 	/* initialise the xfs inode */
@@ -123,7 +122,7 @@ void
 xfs_inode_free(
 	struct xfs_inode	*ip)
 {
-	ASSERT(!xfs_isiflocked(ip));
+	ASSERT(!xfs_iflags_test(ip, XFS_IFLUSHING));
 
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always
@@ -1035,23 +1034,21 @@ xfs_reclaim_inode(
 
 	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
 		goto out;
-	if (!xfs_iflock_nowait(ip))
+	if (xfs_iflags_test_and_set(ip, XFS_IFLUSHING))
 		goto out_iunlock;
 
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
 		xfs_iunpin_wait(ip);
-		/* xfs_iflush_abort() drops the flush lock */
 		xfs_iflush_abort(ip);
 		goto reclaim;
 	}
 	if (xfs_ipincount(ip))
-		goto out_ifunlock;
+		goto out_clear_flush;
 	if (!xfs_inode_clean(ip))
-		goto out_ifunlock;
+		goto out_clear_flush;
 
-	xfs_ifunlock(ip);
+	xfs_iflags_clear(ip, XFS_IFLUSHING);
 reclaim:
-	ASSERT(!xfs_isiflocked(ip));
 
 	/*
 	 * Because we use RCU freeing we need to ensure the inode always appears
@@ -1101,8 +1098,8 @@ xfs_reclaim_inode(
 	__xfs_inode_free(ip);
 	return;
 
-out_ifunlock:
-	xfs_ifunlock(ip);
+out_clear_flush:
+	xfs_iflags_clear(ip, XFS_IFLUSHING);
 out_iunlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 out:
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c06129cffba9..2072bd25989a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -598,22 +598,6 @@ xfs_lock_two_inodes(
 	}
 }
 
-void
-__xfs_iflock(
-	struct xfs_inode	*ip)
-{
-	wait_queue_head_t *wq = bit_waitqueue(&ip->i_flags, __XFS_IFLOCK_BIT);
-	DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_IFLOCK_BIT);
-
-	do {
-		prepare_to_wait_exclusive(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (xfs_isiflocked(ip))
-			io_schedule();
-	} while (!xfs_iflock_nowait(ip));
-
-	finish_wait(wq, &wait.wq_entry);
-}
-
 STATIC uint
 _xfs_dic2xflags(
 	uint16_t		di_flags,
@@ -2531,11 +2515,8 @@ xfs_ifree_mark_inode_stale(
 	 * valid, the wrong inode or stale.
 	 */
 	spin_lock(&ip->i_flags_lock);
-	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
-		spin_unlock(&ip->i_flags_lock);
-		rcu_read_unlock();
-		return;
-	}
+	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE))
+		goto out_iflags_unlock;
 
 	/*
 	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
@@ -2552,16 +2533,14 @@ xfs_ifree_mark_inode_stale(
 		}
 	}
 	ip->i_flags |= XFS_ISTALE;
-	spin_unlock(&ip->i_flags_lock);
-	rcu_read_unlock();
 
 	/*
-	 * If we can't get the flush lock, the inode is already attached.  All
+	 * If the inode is flushing, it is already attached to the buffer.  All
 	 * we needed to do here is mark the inode stale so buffer IO completion
 	 * will remove it from the AIL.
 	 */
 	iip = ip->i_itemp;
-	if (!xfs_iflock_nowait(ip)) {
+	if (__xfs_iflags_test(ip, XFS_IFLUSHING)) {
 		ASSERT(!list_empty(&iip->ili_item.li_bio_list));
 		ASSERT(iip->ili_last_fields);
 		goto out_iunlock;
@@ -2573,10 +2552,12 @@ xfs_ifree_mark_inode_stale(
 	 * commit as the flock synchronises removal of the inode from the
 	 * cluster buffer against inode reclaim.
 	 */
-	if (!iip || list_empty(&iip->ili_item.li_bio_list)) {
-		xfs_ifunlock(ip);
+	if (!iip || list_empty(&iip->ili_item.li_bio_list))
 		goto out_iunlock;
-	}
+
+	__xfs_iflags_set(ip, XFS_IFLUSHING);
+	spin_unlock(&ip->i_flags_lock);
+	rcu_read_unlock();
 
 	/* we have a dirty inode in memory that has not yet been flushed. */
 	spin_lock(&iip->ili_lock);
@@ -2586,9 +2567,16 @@ xfs_ifree_mark_inode_stale(
 	spin_unlock(&iip->ili_lock);
 	ASSERT(iip->ili_last_fields);
 
+	if (ip != free_ip)
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return;
+
 out_iunlock:
 	if (ip != free_ip)
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out_iflags_unlock:
+	spin_unlock(&ip->i_flags_lock);
+	rcu_read_unlock();
 }
 
 /*
@@ -2631,8 +2619,9 @@ xfs_ifree_cluster(
 
 		/*
 		 * We obtain and lock the backing buffer first in the process
-		 * here, as we have to ensure that any dirty inode that we
-		 * can't get the flush lock on is attached to the buffer.
+		 * here to ensure dirty inodes attached to the buffer remain in
+		 * the flushing state while we mark them stale.
+		 *
 		 * If we scan the in-memory inodes first, then buffer IO can
 		 * complete before we get a lock on it, and hence we may fail
 		 * to mark all the active inodes on the buffer stale.
@@ -3443,7 +3432,7 @@ xfs_iflush(
 	int			error;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
-	ASSERT(xfs_isiflocked(ip));
+	ASSERT(xfs_iflags_test(ip, XFS_IFLUSHING));
 	ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_BTREE ||
 	       ip->i_df.if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
 	ASSERT(iip->ili_item.li_buf == bp);
@@ -3613,7 +3602,7 @@ xfs_iflush_cluster(
 		/*
 		 * Quick and dirty check to avoid locks if possible.
 		 */
-		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK))
+		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLUSHING))
 			continue;
 		if (xfs_ipincount(ip))
 			continue;
@@ -3627,7 +3616,7 @@ xfs_iflush_cluster(
 		 */
 		spin_lock(&ip->i_flags_lock);
 		ASSERT(!__xfs_iflags_test(ip, XFS_ISTALE));
-		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK)) {
+		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLUSHING)) {
 			spin_unlock(&ip->i_flags_lock);
 			continue;
 		}
@@ -3635,23 +3624,16 @@ xfs_iflush_cluster(
 		/*
 		 * ILOCK will pin the inode against reclaim and prevent
 		 * concurrent transactions modifying the inode while we are
-		 * flushing the inode.
+		 * flushing the inode. If we get the lock, set the flushing
+		 * state before we drop the i_flags_lock.
 		 */
 		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) {
 			spin_unlock(&ip->i_flags_lock);
 			continue;
 		}
+		__xfs_iflags_set(ip, XFS_IFLUSHING);
 		spin_unlock(&ip->i_flags_lock);
 
-		/*
-		 * Skip inodes that are already flush locked as they have
-		 * already been written to the buffer.
-		 */
-		if (!xfs_iflock_nowait(ip)) {
-			xfs_iunlock(ip, XFS_ILOCK_SHARED);
-			continue;
-		}
-
 		/*
 		 * Abort flushing this inode if we are shut down because the
 		 * inode may not currently be in the AIL. This can occur when
@@ -3661,7 +3643,6 @@ xfs_iflush_cluster(
 		 */
 		if (XFS_FORCED_SHUTDOWN(mp)) {
 			xfs_iunpin_wait(ip);
-			/* xfs_iflush_abort() drops the flush lock */
 			xfs_iflush_abort(ip);
 			xfs_iunlock(ip, XFS_ILOCK_SHARED);
 			error = -EIO;
@@ -3670,7 +3651,7 @@ xfs_iflush_cluster(
 
 		/* don't block waiting on a log force to unpin dirty inodes */
 		if (xfs_ipincount(ip)) {
-			xfs_ifunlock(ip);
+			xfs_iflags_clear(ip, XFS_IFLUSHING);
 			xfs_iunlock(ip, XFS_ILOCK_SHARED);
 			continue;
 		}
@@ -3678,7 +3659,7 @@ xfs_iflush_cluster(
 		if (!xfs_inode_clean(ip))
 			error = xfs_iflush(ip, bp);
 		else
-			xfs_ifunlock(ip);
+			xfs_iflags_clear(ip, XFS_IFLUSHING);
 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		if (error)
 			break;
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index e9a8bb184d1f..5ea962c6cf98 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -211,8 +211,7 @@ static inline bool xfs_inode_has_cow_data(struct xfs_inode *ip)
 #define XFS_INEW		(1 << __XFS_INEW_BIT)
 #define XFS_ITRUNCATED		(1 << 5) /* truncated down so flush-on-close */
 #define XFS_IDIRTY_RELEASE	(1 << 6) /* dirty release already seen */
-#define __XFS_IFLOCK_BIT	7	 /* inode is being flushed right now */
-#define XFS_IFLOCK		(1 << __XFS_IFLOCK_BIT)
+#define XFS_IFLUSHING		(1 << 7) /* inode is being flushed */
 #define __XFS_IPINNED_BIT	8	 /* wakeup key for zero pin count */
 #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
 #define XFS_IEOFBLOCKS		(1 << 9) /* has the preallocblocks tag set */
@@ -233,36 +232,6 @@ static inline bool xfs_inode_has_cow_data(struct xfs_inode *ip)
 	(XFS_IRECLAIMABLE | XFS_IRECLAIM | \
 	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED)
 
-/*
- * Synchronize processes attempting to flush the in-core inode back to disk.
- */
-
-static inline int xfs_isiflocked(struct xfs_inode *ip)
-{
-	return xfs_iflags_test(ip, XFS_IFLOCK);
-}
-
-extern void __xfs_iflock(struct xfs_inode *ip);
-
-static inline int xfs_iflock_nowait(struct xfs_inode *ip)
-{
-	return !xfs_iflags_test_and_set(ip, XFS_IFLOCK);
-}
-
-static inline void xfs_iflock(struct xfs_inode *ip)
-{
-	if (!xfs_iflock_nowait(ip))
-		__xfs_iflock(ip);
-}
-
-static inline void xfs_ifunlock(struct xfs_inode *ip)
-{
-	ASSERT(xfs_isiflocked(ip));
-	xfs_iflags_clear(ip, XFS_IFLOCK);
-	smp_mb();
-	wake_up_bit(&ip->i_flags, __XFS_IFLOCK_BIT);
-}
-
 /*
  * Flags for inode locking.
  * Bit ranges:	1<<1  - 1<<16-1 -- iolock/ilock modes (bitfield)
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6c65938cee1c..099ae8ee7908 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -491,8 +491,7 @@ xfs_inode_item_push(
 	    (ip->i_flags & XFS_ISTALE))
 		return XFS_ITEM_PINNED;
 
-	/* If the inode is already flush locked, we're already flushing. */
-	if (xfs_isiflocked(ip))
+	if (xfs_iflags_test(ip, XFS_IFLUSHING))
 		return XFS_ITEM_FLUSHING;
 
 	if (!xfs_buf_trylock(bp))
@@ -703,7 +702,7 @@ xfs_iflush_finish(
 		iip->ili_last_fields = 0;
 		iip->ili_flush_lsn = 0;
 		spin_unlock(&iip->ili_lock);
-		xfs_ifunlock(iip->ili_inode);
+		xfs_iflags_clear(iip->ili_inode, XFS_IFLUSHING);
 		if (drop_buffer)
 			xfs_buf_rele(bp);
 	}
@@ -711,8 +710,8 @@ xfs_iflush_finish(
 
 /*
  * Inode buffer IO completion routine.  It is responsible for removing inodes
- * attached to the buffer from the AIL if they have not been re-logged, as well
- * as completing the flush and unlocking the inode.
+ * attached to the buffer from the AIL if they have not been re-logged and
+ * completing the inode flush.
  */
 void
 xfs_iflush_done(
@@ -755,10 +754,10 @@ xfs_iflush_done(
 }
 
 /*
- * This is the inode flushing abort routine.  It is called from xfs_iflush when
+ * This is the inode flushing abort routine.  It is called when
  * the filesystem is shutting down to clean up the inode state.  It is
  * responsible for removing the inode item from the AIL if it has not been
- * re-logged, and unlocking the inode's flush lock.
+ * re-logged and clearing the inode's flush state.
  */
 void
 xfs_iflush_abort(
@@ -790,7 +789,7 @@ xfs_iflush_abort(
 		list_del_init(&iip->ili_item.li_bio_list);
 		spin_unlock(&iip->ili_lock);
 	}
-	xfs_ifunlock(ip);
+	xfs_iflags_clear(ip, XFS_IFLUSHING);
 	if (bp)
 		xfs_buf_rele(bp);
 }
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 048b5e7dee90..23a7b4928727 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -25,8 +25,8 @@ struct xfs_inode_log_item {
 	 *
 	 * We need atomic changes between inode dirtying, inode flushing and
 	 * inode completion, but these all hold different combinations of
-	 * ILOCK and iflock and hence we need some other method of serialising
-	 * updates to the flush state.
+	 * ILOCK and IFLUSHING and hence we need some other method of
+	 * serialising updates to the flush state.
 	 */
 	spinlock_t		ili_lock;	   /* flush state lock */
 	unsigned int		ili_last_fields;   /* fields when flushed */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index c8ae49a1e99c..bbfd1d5b1c04 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1059,11 +1059,12 @@ xfs_unmountfs(
 	 * We can potentially deadlock here if we have an inode cluster
 	 * that has been freed has its buffer still pinned in memory because
 	 * the transaction is still sitting in a iclog. The stale inodes
-	 * on that buffer will have their flush locks held until the
-	 * transaction hits the disk and the callbacks run. the inode
-	 * flush takes the flush lock unconditionally and with nothing to
-	 * push out the iclog we will never get that unlocked. hence we
-	 * need to force the log first.
+	 * on that buffer will be pinned to the buffer until the
+	 * transaction hits the disk and the callbacks run. Pushing the AIL will
+	 * skip the stale inodes and may never see the pinned buffer, so
+	 * nothing will push out the iclog and unpin the buffer. Hence we
+	 * need to force the log here to ensure all items are flushed into the
+	 * AIL before we go any further.
 	 */
 	xfs_log_force(mp, XFS_LOG_SYNC);
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 71ac6c1cdc36..68ec8db12cc7 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -654,11 +654,11 @@ xfs_fs_destroy_inode(
 	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
 
 	/*
-	 * We always use background reclaim here because even if the
-	 * inode is clean, it still may be under IO and hence we have
-	 * to take the flush lock. The background reclaim path handles
-	 * this more efficiently than we can here, so simply let background
-	 * reclaim tear down all inodes.
+	 * We always use background reclaim here because even if the inode is
+	 * clean, it still may be under IO and hence we have wait for IO
+	 * completion to occur before we can reclaim the inode. The background
+	 * reclaim path handles this more efficiently than we can here, so
+	 * simply let background reclaim tear down all inodes.
 	 */
 	xfs_inode_set_reclaim_tag(ip);
 }
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 02/13] xfs: add log item precommit operation
  2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
  2020-08-12  9:25 ` [PATCH 01/13] xfs: xfs_iflock is no longer a completion Dave Chinner
@ 2020-08-12  9:25 ` Dave Chinner
  2020-08-22  9:06   ` Christoph Hellwig
  2020-08-12  9:25 ` [PATCH 03/13] xfs: factor the xfs_iunlink functions Dave Chinner
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Dave Chinner @ 2020-08-12  9:25 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

For inodes that are dirty, we have an attached cluster buffer that
we want to use to track the dirty inode through the AIL.
Unfortunately, locking the cluster buffer and adding it to the
transaction when the inode is first logged in a transaction leads to
buffer lock ordering inversions.

The specific problem is ordering against the AGI buffer. When
modifying unlinked lists, the buffer lock order is AGI -> inode
cluster buffer as the AGI buffer lock serialises all access to the
unlinked lists. Unfortunately, functionality like xfs_droplink()
logs the inode before calling xfs_iunlink(), as do various directory
manipulation functions. The inode can be logged way down in the
stack as far as the bmapi routines and hence, without a major
rewrite of lots of APIs there's no way we can avoid the inode being
logged by something until after the AGI has been logged.

As we are going to be using ordered buffers for inode AIL tracking,
there isn't a need to actually lock that buffer against modification
as all the modifications are captured by logging the inode item
itself. Hence we don't actually need to join the cluster buffer into
the transaction until just before it is committed. This means we do
not perturb any of the existing buffer lock orders in transactions,
and the inode cluster buffer is always locked last in a transaction
that doesn't otherwise touch inode cluster buffers.

We do this by introducing a precommit log item method. A log item
method is used because it is likely dquots will be moved to this
same ordered buffer tracking scheme and hence will need a similar
callout. This commit just introduces the mechanism; the inode item
implementation is in followup commits.

The precommit items need to be sorted into consistent order as we
may be locking multiple items here. Hence if we have two dirty
inodes in cluster buffers A and B, and some other transaction has
two separate dirty inodes in the same cluster buffers, locking them
in different orders opens us up to ABBA deadlocks. Hence we sort the
items on the transaction based on the presence of a sort log item
method.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c |  1 +
 fs/xfs/xfs_trans.c  | 91 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trans.h  |  6 ++-
 3 files changed, 96 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index aa6aad258670..5cdded02cdc8 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1065,6 +1065,7 @@ xfs_reclaim_inode(
 	ip->i_ino = 0;
 	spin_unlock(&ip->i_flags_lock);
 
+	ASSERT(!ip->i_itemp || ip->i_itemp->ili_item.li_buf == NULL);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
 	XFS_STATS_INC(ip->i_mount, xs_ig_reclaims);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index ed72867b1a19..68b03446db8e 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -816,6 +816,90 @@ xfs_trans_committed_bulk(
 	spin_unlock(&ailp->ail_lock);
 }
 
+/*
+ * Sort transaction items prior to running precommit operations. This will
+ * attempt to order the items such that they will always be locked in the same
+ * order. Items that have no sort function are moved to the end of the list
+ * and so are locked last (XXX: need to check the logic matches the comment).
+ *
+ * This may need refinement as different types of objects add sort functions.
+ *
+ * Function is more complex than it needs to be because we are comparing 64 bit
+ * values and the function only returns 32 bit values.
+ */
+static int
+xfs_trans_precommit_sort(
+	void			*unused_arg,
+	struct list_head	*a,
+	struct list_head	*b)
+{
+	struct xfs_log_item	*lia = container_of(a,
+					struct xfs_log_item, li_trans);
+	struct xfs_log_item	*lib = container_of(b,
+					struct xfs_log_item, li_trans);
+	int64_t			diff;
+
+	/*
+	 * If both items are non-sortable, leave them alone. If only one is
+	 * sortable, move the non-sortable item towards the end of the list.
+	 */
+	if (!lia->li_ops->iop_sort && !lib->li_ops->iop_sort)
+		return 0;
+	if (!lia->li_ops->iop_sort)
+		return 1;
+	if (!lib->li_ops->iop_sort)
+		return -1;
+
+	diff = lia->li_ops->iop_sort(lia) - lib->li_ops->iop_sort(lib);
+	if (diff < 0)
+		return -1;
+	if (diff > 0)
+		return 1;
+	return 0;
+}
+
+/*
+ * Run transaction precommit functions.
+ *
+ * If there is an error in any of the callouts, then stop immediately and
+ * trigger a shutdown to abort the transaction. There is no recovery possible
+ * from errors at this point as the transaction is dirty....
+ */
+static int
+xfs_trans_run_precommits(
+	struct xfs_trans	*tp)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_log_item	*lip, *n;
+	int			error = 0;
+
+	/*
+	 * Sort the item list to avoid ABBA deadlocks with other transactions
+	 * running precommit operations that lock multiple shared items such as
+	 * inode cluster buffers.
+	 */
+	list_sort(NULL, &tp->t_items, xfs_trans_precommit_sort);
+
+	/*
+	 * Precommit operations can remove the log item from the transaction
+	 * if the log item exists purely to delay modifications until they
+	 * can be ordered against other operations. Hence we have to use
+	 * list_for_each_entry_safe() here.
+	 */
+	list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
+		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
+			continue;
+		if (lip->li_ops->iop_precommit) {
+			error = lip->li_ops->iop_precommit(tp, lip);
+			if (error)
+				break;
+		}
+	}
+	if (error)
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+	return error;
+}
+
 /*
  * Commit the given transaction to the log.
  *
@@ -840,6 +924,13 @@ __xfs_trans_commit(
 
 	trace_xfs_trans_commit(tp, _RET_IP_);
 
+	error = xfs_trans_run_precommits(tp);
+	if (error) {
+		if (tp->t_flags & XFS_TRANS_PERM_LOG_RES)
+			xfs_defer_cancel(tp);
+		goto out_unreserve;
+	}
+
 	/*
 	 * Finish deferred items on final commit. Only permanent transactions
 	 * should ever have deferred ops.
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index b752501818d2..26ea19bd0621 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -70,10 +70,12 @@ struct xfs_item_ops {
 	void (*iop_format)(struct xfs_log_item *, struct xfs_log_vec *);
 	void (*iop_pin)(struct xfs_log_item *);
 	void (*iop_unpin)(struct xfs_log_item *, int remove);
-	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
+	uint64_t (*iop_sort)(struct xfs_log_item *);
+	int (*iop_precommit)(struct xfs_trans *, struct xfs_log_item *);
 	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
-	void (*iop_release)(struct xfs_log_item *);
 	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
+	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
+	void (*iop_release)(struct xfs_log_item *);
 	int (*iop_recover)(struct xfs_log_item *lip, struct xfs_trans *tp);
 	bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
 };
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 03/13] xfs: factor the xfs_iunlink functions
  2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
  2020-08-12  9:25 ` [PATCH 01/13] xfs: xfs_iflock is no longer a completion Dave Chinner
  2020-08-12  9:25 ` [PATCH 02/13] xfs: add log item precommit operation Dave Chinner
@ 2020-08-12  9:25 ` Dave Chinner
  2020-08-18 23:49   ` Darrick J. Wong
  2020-08-22  7:45   ` Christoph Hellwig
  2020-08-12  9:25 ` [PATCH 04/13] xfs: arrange all unlinked inodes into one list Dave Chinner
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 51+ messages in thread
From: Dave Chinner @ 2020-08-12  9:25 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Prep work that separates the locking that protects the unlinked list
from the actual operations being performed. This also helps document
the fact they are performing list insert  and remove operations. No
functional code change.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 92 ++++++++++++++++++++++++++++++----------------
 1 file changed, 60 insertions(+), 32 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 2072bd25989a..f2f502b65691 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2205,35 +2205,20 @@ xfs_iunlink_update_inode(
 	return error;
 }
 
-/*
- * This is called when the inode's link count has gone to 0 or we are creating
- * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
- *
- * We place the on-disk inode on a list in the AGI.  It will be pulled from this
- * list when the inode is freed.
- */
-STATIC int
-xfs_iunlink(
+static int
+xfs_iunlink_insert_inode(
 	struct xfs_trans	*tp,
+	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_agi		*agi;
-	struct xfs_buf		*agibp;
 	xfs_agino_t		next_agino;
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
-	ASSERT(VFS_I(ip)->i_nlink == 0);
-	ASSERT(VFS_I(ip)->i_mode != 0);
-	trace_xfs_iunlink(ip);
-
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, agno, &agibp);
-	if (error)
-		return error;
 	agi = agibp->b_addr;
 
 	/*
@@ -2274,6 +2259,35 @@ xfs_iunlink(
 	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
 }
 
+/*
+ * This is called when the inode's link count has gone to 0 or we are creating
+ * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
+ *
+ * We place the on-disk inode on a list in the AGI.  It will be pulled from this
+ * list when the inode is freed.
+ */
+STATIC int
+xfs_iunlink(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_buf		*agibp;
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	int			error;
+
+	ASSERT(VFS_I(ip)->i_nlink == 0);
+	ASSERT(VFS_I(ip)->i_mode != 0);
+	trace_xfs_iunlink(ip);
+
+	/* Get the agi buffer first.  It ensures lock ordering on the list. */
+	error = xfs_read_agi(mp, tp, agno, &agibp);
+	if (error)
+		return error;
+
+	return xfs_iunlink_insert_inode(tp, agibp, ip);
+}
+
 /* Return the imap, dinode pointer, and buffer for an inode. */
 STATIC int
 xfs_iunlink_map_ino(
@@ -2388,32 +2402,23 @@ xfs_iunlink_map_prev(
 	return 0;
 }
 
-/*
- * Pull the on-disk inode from the AGI unlinked list.
- */
-STATIC int
-xfs_iunlink_remove(
+static int
+xfs_iunlink_remove_inode(
 	struct xfs_trans	*tp,
+	struct xfs_buf		*agibp,
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_agi		*agi;
-	struct xfs_buf		*agibp;
 	struct xfs_buf		*last_ibp;
 	struct xfs_dinode	*last_dip = NULL;
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
 	xfs_agino_t		head_agino;
 	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
-	trace_xfs_iunlink_remove(ip);
-
-	/* Get the agi buffer first.  It ensures lock ordering on the list. */
-	error = xfs_read_agi(mp, tp, agno, &agibp);
-	if (error)
-		return error;
 	agi = agibp->b_addr;
 
 	/*
@@ -2482,6 +2487,29 @@ xfs_iunlink_remove(
 			next_agino);
 }
 
+/*
+ * Pull the on-disk inode from the AGI unlinked list.
+ */
+STATIC int
+xfs_iunlink_remove(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_buf		*agibp;
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	int			error;
+
+	trace_xfs_iunlink_remove(ip);
+
+	/* Get the agi buffer first.  It ensures lock ordering on the list. */
+	error = xfs_read_agi(mp, tp, agno, &agibp);
+	if (error)
+		return error;
+
+	return xfs_iunlink_remove_inode(tp, agibp, ip);
+}
+
 /*
  * Look up the inode number specified and if it is not already marked XFS_ISTALE
  * mark it stale. We should only find clean inodes in this lookup that aren't
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 04/13] xfs: arrange all unlinked inodes into one list
  2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
                   ` (2 preceding siblings ...)
  2020-08-12  9:25 ` [PATCH 03/13] xfs: factor the xfs_iunlink functions Dave Chinner
@ 2020-08-12  9:25 ` Dave Chinner
  2020-08-18 23:59   ` Darrick J. Wong
  2020-08-12  9:25 ` [PATCH 05/13] xfs: add unlink list pointers to xfs_inode Dave Chinner
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Dave Chinner @ 2020-08-12  9:25 UTC (permalink / raw)
  To: linux-xfs

From: Gao Xiang <hsiangkao@redhat.com>

We currently keep unlinked lists short on disk by hashing the inodes
across multiple buckets. We don't need to ikeep them short anymore
as we no longer need to traverse the entire to remove an inode from
it. The in-memory back reference index provides the previous inode
in the list for us instead.

Log recovery still has to handle existing filesystems that use all
64 on-disk buckets so we detect and handle this case specially so
that so inode eviction can still work properly in recovery.

[dchinner: imported into parent patch series early on and modified
to fit cleanly. ]

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 49 +++++++++++++++++++++++++++-------------------
 1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index f2f502b65691..fa92bdf6e0da 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -33,6 +33,7 @@
 #include "xfs_symlink.h"
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
 
@@ -2092,25 +2093,32 @@ xfs_iunlink_update_bucket(
 	struct xfs_trans	*tp,
 	xfs_agnumber_t		agno,
 	struct xfs_buf		*agibp,
-	unsigned int		bucket_index,
+	xfs_agino_t		old_agino,
 	xfs_agino_t		new_agino)
 {
+	struct xlog		*log = tp->t_mountp->m_log;
 	struct xfs_agi		*agi = agibp->b_addr;
 	xfs_agino_t		old_value;
-	int			offset;
+	unsigned int		bucket_index;
+	int                     offset;
 
 	ASSERT(xfs_verify_agino_or_null(tp->t_mountp, agno, new_agino));
 
+	bucket_index = 0;
+	/* During recovery, the old multiple bucket index can be applied */
+	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
+		ASSERT(old_agino != NULLAGINO);
+
+		if (be32_to_cpu(agi->agi_unlinked[0]) != old_agino)
+			bucket_index = old_agino % XFS_AGI_UNLINKED_BUCKETS;
+	}
+
 	old_value = be32_to_cpu(agi->agi_unlinked[bucket_index]);
 	trace_xfs_iunlink_update_bucket(tp->t_mountp, agno, bucket_index,
 			old_value, new_agino);
 
-	/*
-	 * We should never find the head of the list already set to the value
-	 * passed in because either we're adding or removing ourselves from the
-	 * head of the list.
-	 */
-	if (old_value == new_agino) {
+	/* check if the old agi_unlinked head is as expected */
+	if (old_value != old_agino) {
 		xfs_buf_mark_corrupt(agibp);
 		return -EFSCORRUPTED;
 	}
@@ -2216,17 +2224,18 @@ xfs_iunlink_insert_inode(
 	xfs_agino_t		next_agino;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
-	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
 	agi = agibp->b_addr;
 
 	/*
-	 * Get the index into the agi hash table for the list this inode will
-	 * go on.  Make sure the pointer isn't garbage and that this inode
-	 * isn't already on the list.
+	 * We don't need to traverse the on disk unlinked list to find the
+	 * previous inode in the list when removing inodes anymore, so we don't
+	 * need multiple on-disk lists anymore. Hence we always use bucket 0.
+	 * Make sure the pointer isn't garbage and that this inode isn't already
+	 * on the list.
 	 */
-	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
+	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
 	if (next_agino == agino ||
 	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
 		xfs_buf_mark_corrupt(agibp);
@@ -2256,7 +2265,7 @@ xfs_iunlink_insert_inode(
 	}
 
 	/* Point the head of the list to point to this inode. */
-	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
+	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
 }
 
 /*
@@ -2416,16 +2425,17 @@ xfs_iunlink_remove_inode(
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
 	xfs_agino_t		head_agino;
-	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
 	int			error;
 
 	agi = agibp->b_addr;
 
 	/*
-	 * Get the index into the agi hash table for the list this inode will
-	 * go on.  Make sure the head pointer isn't garbage.
+	 * We don't need to traverse the on disk unlinked list to find the
+	 * previous inode in the list when removing inodes anymore, so we don't
+	 * need multiple on-disk lists anymore. Hence we always use bucket 0.
+	 * Make sure the head pointer isn't garbage.
 	 */
-	head_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
+	head_agino = be32_to_cpu(agi->agi_unlinked[0]);
 	if (!xfs_verify_agino(mp, agno, head_agino)) {
 		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 				agi, sizeof(*agi));
@@ -2483,8 +2493,7 @@ xfs_iunlink_remove_inode(
 	}
 
 	/* Point the head of the list to the next unlinked inode. */
-	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
-			next_agino);
+	return xfs_iunlink_update_bucket(tp, agno, agibp, agino, next_agino);
 }
 
 /*
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 05/13] xfs: add unlink list pointers to xfs_inode
  2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
                   ` (3 preceding siblings ...)
  2020-08-12  9:25 ` [PATCH 04/13] xfs: arrange all unlinked inodes into one list Dave Chinner
@ 2020-08-12  9:25 ` Dave Chinner
  2020-08-19  0:02   ` Darrick J. Wong
  2020-08-22  9:03   ` Christoph Hellwig
  2020-08-12  9:25 ` [PATCH 06/13] xfs: replace iunlink backref lookups with list lookups Dave Chinner
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 51+ messages in thread
From: Dave Chinner @ 2020-08-12  9:25 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To move away from using the on disk inode buffers to track and log
unlinked inodes, we need pointers to track them in memory. Because
we have arbitrary remove order from the list, it needs to be a
double linked list.

We start by noting that inodes are always in memory when they are
active on the unlinked list, and hence we can track these inodes
without needing to take references to the inodes or store them in
the list. We cannot, however, use inode locks to protect the inodes
on the list - the list needs an external lock to serialise all
inserts and removals. We can use the existing AGI buffer lock for
this right now as that already serialises all unlinked list
traversals and modifications.

Hence we can convert the in-memory unlinked list to a simple
list_head list in the perag. We can use list_empty() to detect an
empty unlinked list, likewise we can detect the end of the list when
the inode next pointer points back to the perag list_head. This
makes insert, remove and traversal.

The only complication here is log recovery of old filesystems that
have multiple lists. These always remove from the head of the list,
so we can easily construct just enough of the unlinked list for
recovery from any list to work correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c      |   1 +
 fs/xfs/xfs_inode.c       |  21 ++++-
 fs/xfs/xfs_inode.h       |   1 +
 fs/xfs/xfs_log_recover.c | 179 +++++++++++++++++++++++----------------
 fs/xfs/xfs_mount.c       |   1 +
 fs/xfs/xfs_mount.h       |   1 +
 6 files changed, 130 insertions(+), 74 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 5cdded02cdc8..0c04a66bf88d 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -66,6 +66,7 @@ xfs_inode_alloc(
 	memset(&ip->i_d, 0, sizeof(ip->i_d));
 	ip->i_sick = 0;
 	ip->i_checked = 0;
+	INIT_LIST_HEAD(&ip->i_unlink);
 	INIT_WORK(&ip->i_ioend_work, xfs_end_io);
 	INIT_LIST_HEAD(&ip->i_ioend_list);
 	spin_lock_init(&ip->i_ioend_lock);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index fa92bdf6e0da..dcf80ac51267 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2294,7 +2294,17 @@ xfs_iunlink(
 	if (error)
 		return error;
 
-	return xfs_iunlink_insert_inode(tp, agibp, ip);
+	/*
+	 * Insert the inode into the on disk unlinked list, and if that
+	 * succeeds, then insert it into the in memory list. We do it in this
+	 * order so that the modifications required to the on disk list are not
+	 * impacted by already having this inode in the list.
+	 */
+	error = xfs_iunlink_insert_inode(tp, agibp, ip);
+	if (!error)
+		list_add(&ip->i_unlink, &agibp->b_pag->pag_ici_unlink_list);
+
+	return error;
 }
 
 /* Return the imap, dinode pointer, and buffer for an inode. */
@@ -2516,7 +2526,14 @@ xfs_iunlink_remove(
 	if (error)
 		return error;
 
-	return xfs_iunlink_remove_inode(tp, agibp, ip);
+	/*
+	 * Remove the inode from the on-disk list and then remove it from the
+	 * in-memory list. This order of operations ensures we can look up both
+	 * next and previous inode in the on-disk list via the in-memory list.
+	 */
+	error = xfs_iunlink_remove_inode(tp, agibp, ip);
+	list_del(&ip->i_unlink);
+	return error;
 }
 
 /*
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 5ea962c6cf98..73f36908a1ce 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -56,6 +56,7 @@ typedef struct xfs_inode {
 	uint64_t		i_delayed_blks;	/* count of delay alloc blks */
 
 	struct xfs_icdinode	i_d;		/* most of ondisk inode */
+	struct list_head	i_unlink;
 
 	/* VFS inode */
 	struct inode		i_vnode;	/* embedded VFS inode */
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index e2ec91b2d0f4..b3481f4e2f96 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2682,11 +2682,11 @@ xlog_recover_clear_agi_bucket(
 	return;
 }
 
-STATIC xfs_agino_t
-xlog_recover_process_one_iunlink(
+static struct xfs_inode *
+xlog_recover_get_one_iunlink(
 	struct xfs_mount		*mp,
 	xfs_agnumber_t			agno,
-	xfs_agino_t			agino,
+	xfs_agino_t			*agino,
 	int				bucket)
 {
 	struct xfs_buf			*ibp;
@@ -2695,48 +2695,35 @@ xlog_recover_process_one_iunlink(
 	xfs_ino_t			ino;
 	int				error;
 
-	ino = XFS_AGINO_TO_INO(mp, agno, agino);
+	ino = XFS_AGINO_TO_INO(mp, agno, *agino);
 	error = xfs_iget(mp, NULL, ino, 0, 0, &ip);
 	if (error)
-		goto fail;
+		return NULL;
 
 	/*
-	 * Get the on disk inode to find the next inode in the bucket.
+	 * Get the on disk inode to find the next inode in the bucket. Should
+	 * not fail because we just read the inode from this buffer, but if it
+	 * does then we still have to allow the caller to set up and release
+	 * the inode we just looked up. Make sure the list walk terminates here,
+	 * though.
 	 */
 	error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &ibp, 0);
-	if (error)
-		goto fail_iput;
+	if (error) {
+		ASSERT(0);
+		*agino = NULLAGINO;
+		return ip;
+	}
+
 
 	xfs_iflags_clear(ip, XFS_IRECOVERY);
 	ASSERT(VFS_I(ip)->i_nlink == 0);
 	ASSERT(VFS_I(ip)->i_mode != 0);
 
-	/* setup for the next pass */
-	agino = be32_to_cpu(dip->di_next_unlinked);
+	/* Get the next inode we will be looking up. */
+	*agino = be32_to_cpu(dip->di_next_unlinked);
 	xfs_buf_relse(ibp);
 
-	/*
-	 * Prevent any DMAPI event from being sent when the reference on
-	 * the inode is dropped.
-	 */
-	ip->i_d.di_dmevmask = 0;
-
-	xfs_irele(ip);
-	return agino;
-
- fail_iput:
-	xfs_irele(ip);
- fail:
-	/*
-	 * We can't read in the inode this bucket points to, or this inode
-	 * is messed up.  Just ditch this bucket of inodes.  We will lose
-	 * some inodes and space, but at least we won't hang.
-	 *
-	 * Call xlog_recover_clear_agi_bucket() to perform a transaction to
-	 * clear the inode pointer in the bucket.
-	 */
-	xlog_recover_clear_agi_bucket(mp, agno, bucket);
-	return NULLAGINO;
+	return ip;
 }
 
 /*
@@ -2762,58 +2749,106 @@ xlog_recover_process_one_iunlink(
  * scheduled on this CPU to ensure other scheduled work can run without undue
  * latency.
  */
-STATIC void
-xlog_recover_process_iunlinks(
-	struct xlog	*log)
+static int
+xlog_recover_iunlinks_ag(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
 {
-	xfs_mount_t	*mp;
-	xfs_agnumber_t	agno;
-	xfs_agi_t	*agi;
-	xfs_buf_t	*agibp;
-	xfs_agino_t	agino;
-	int		bucket;
-	int		error;
+	struct xfs_agi		*agi;
+	struct xfs_buf		*agibp;
+	int			bucket;
+	int			error;
 
-	mp = log->l_mp;
+	/*
+	 * Find the agi for this ag.
+	 */
+	error = xfs_read_agi(mp, NULL, agno, &agibp);
+	if (error) {
 
-	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
 		/*
-		 * Find the agi for this ag.
+		 * AGI is b0rked. Don't process it.
+		 *
+		 * We should probably mark the filesystem as corrupt after we've
+		 * recovered all the ag's we can....
 		 */
-		error = xfs_read_agi(mp, NULL, agno, &agibp);
+		return 0;
+	}
+
+	/*
+	 * Unlock the buffer so that it can be acquired in the normal course of
+	 * the transaction to truncate and free each inode.  Because we are not
+	 * racing with anyone else here for the AGI buffer, we don't even need
+	 * to hold it locked to read the initial unlinked bucket entries out of
+	 * the buffer. We keep buffer reference though, so that it stays pinned
+	 * in memory while we need the buffer.
+	 */
+	agi = agibp->b_addr;
+	xfs_buf_unlock(agibp);
+
+	/*
+	 * The unlinked inode list is maintained on incore inodes as a double
+	 * linked list. We don't have any of that state in memory, so we have to
+	 * create it as we go. This is simple as we are only removing from the
+	 * head of the list and that means we only need to pull the current
+	 * inode in and the next inode.  Inodes are unlinked when their
+	 * reference count goes to zero, so we can overlap the xfs_iget() and
+	 * xfs_irele() calls so we always have the first two inodes on the list
+	 * in memory. Hence we can fake up the necessary in memory state for the
+	 * unlink to "just work".
+	 */
+	for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
+		struct xfs_inode	*ip, *prev_ip = NULL;
+		xfs_agino_t		agino;
+
+		agino = be32_to_cpu(agi->agi_unlinked[bucket]);
+		while (agino != NULLAGINO) {
+			ip = xlog_recover_get_one_iunlink(mp, agno, &agino,
+							  bucket);
+			if (!ip) {
+				/*
+				 * something busted, but still got to release
+				 * prev_ip, so make it look like it's at the end
+				 * of the list before it gets released.
+				 */
+				error = -EFSCORRUPTED;
+				break;
+			}
+			list_add_tail(&ip->i_unlink,
+					&agibp->b_pag->pag_ici_unlink_list);
+			if (prev_ip)
+				xfs_irele(prev_ip);
+			prev_ip = ip;
+			cond_resched();
+		}
+		if (prev_ip)
+			xfs_irele(prev_ip);
 		if (error) {
 			/*
-			 * AGI is b0rked. Don't process it.
-			 *
-			 * We should probably mark the filesystem as corrupt
-			 * after we've recovered all the ag's we can....
+			 * We can't read an inode this bucket points to, or an
+			 * inode is messed up.  Just ditch this bucket of
+			 * inodes.  We will lose some inodes and space, but at
+			 * least we won't hang.
 			 */
-			continue;
-		}
-		/*
-		 * Unlock the buffer so that it can be acquired in the normal
-		 * course of the transaction to truncate and free each inode.
-		 * Because we are not racing with anyone else here for the AGI
-		 * buffer, we don't even need to hold it locked to read the
-		 * initial unlinked bucket entries out of the buffer. We keep
-		 * buffer reference though, so that it stays pinned in memory
-		 * while we need the buffer.
-		 */
-		agi = agibp->b_addr;
-		xfs_buf_unlock(agibp);
-
-		for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
-			agino = be32_to_cpu(agi->agi_unlinked[bucket]);
-			while (agino != NULLAGINO) {
-				agino = xlog_recover_process_one_iunlink(mp,
-							agno, agino, bucket);
-				cond_resched();
-			}
+			xlog_recover_clear_agi_bucket(mp, agno, bucket);
+			break;
 		}
-		xfs_buf_rele(agibp);
 	}
+	xfs_buf_rele(agibp);
+	return error;
 }
 
+void
+xlog_recover_process_iunlinks(
+       struct xlog             *log)
+{
+       struct xfs_mount        *mp = log->l_mp;
+       xfs_agnumber_t          agno;
+
+       for (agno = 0; agno < mp->m_sb.sb_agcount; agno++)
+               xlog_recover_iunlinks_ag(mp, agno);
+}
+
+
 STATIC void
 xlog_unpack_data(
 	struct xlog_rec_header	*rhead,
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bbfd1d5b1c04..2def15297a5f 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -200,6 +200,7 @@ xfs_initialize_perag(
 		pag->pag_mount = mp;
 		spin_lock_init(&pag->pag_ici_lock);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
+		INIT_LIST_HEAD(&pag->pag_ici_unlink_list);
 		if (xfs_buf_hash_init(pag))
 			goto out_free_pag;
 		init_waitqueue_head(&pag->pagb_wait);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a72cfcaa4ad1..c35a6c463529 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -355,6 +355,7 @@ typedef struct xfs_perag {
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
+	struct list_head pag_ici_unlink_list;	/* unlinked inode list */
 
 	/* buffer cache index */
 	spinlock_t	pag_buf_lock;	/* lock for pag_buf_hash */
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 06/13] xfs: replace iunlink backref lookups with list lookups
  2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
                   ` (4 preceding siblings ...)
  2020-08-12  9:25 ` [PATCH 05/13] xfs: add unlink list pointers to xfs_inode Dave Chinner
@ 2020-08-12  9:25 ` Dave Chinner
  2020-08-19  0:13   ` Darrick J. Wong
  2020-08-12  9:25 ` [PATCH 07/13] xfs: mapping unlinked inodes is now redundant Dave Chinner
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Dave Chinner @ 2020-08-12  9:25 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now we have an in memory linked list of all the inodes on the
unlinked list, use that to look up inodes in the list that we need
to modify when adding or removing from the list.

This means we are no longer using the backref cache to maintain the
previous inode lookups, so we can remove all that infrastructure
now.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_error.c |   2 -
 fs/xfs/xfs_inode.c | 327 ++++++++-------------------------------------
 fs/xfs/xfs_inode.h |   3 -
 fs/xfs/xfs_mount.c |   5 -
 fs/xfs/xfs_trace.h |   1 -
 5 files changed, 54 insertions(+), 284 deletions(-)

diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index 7f6e20899473..829a89418830 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -162,7 +162,6 @@ XFS_ERRORTAG_ATTR_RW(log_item_pin,	XFS_ERRTAG_LOG_ITEM_PIN);
 XFS_ERRORTAG_ATTR_RW(buf_lru_ref,	XFS_ERRTAG_BUF_LRU_REF);
 XFS_ERRORTAG_ATTR_RW(force_repair,	XFS_ERRTAG_FORCE_SCRUB_REPAIR);
 XFS_ERRORTAG_ATTR_RW(bad_summary,	XFS_ERRTAG_FORCE_SUMMARY_RECALC);
-XFS_ERRORTAG_ATTR_RW(iunlink_fallback,	XFS_ERRTAG_IUNLINK_FALLBACK);
 XFS_ERRORTAG_ATTR_RW(buf_ioerror,	XFS_ERRTAG_BUF_IOERROR);
 
 static struct attribute *xfs_errortag_attrs[] = {
@@ -200,7 +199,6 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(buf_lru_ref),
 	XFS_ERRORTAG_ATTR_LIST(force_repair),
 	XFS_ERRORTAG_ATTR_LIST(bad_summary),
-	XFS_ERRORTAG_ATTR_LIST(iunlink_fallback),
 	XFS_ERRORTAG_ATTR_LIST(buf_ioerror),
 	NULL,
 };
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index dcf80ac51267..2c930de99561 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1893,196 +1893,29 @@ xfs_inactive(
  * because we must walk that list to find the inode that points to the inode
  * being removed from the unlinked hash bucket list.
  *
- * What if we modelled the unlinked list as a collection of records capturing
- * "X.next_unlinked = Y" relations?  If we indexed those records on Y, we'd
- * have a fast way to look up unlinked list predecessors, which avoids the
- * slow list walk.  That's exactly what we do here (in-core) with a per-AG
- * rhashtable.
+ * However, inodes that are on the unlinked list are also guaranteed to be in
+ * memory as they are loaded and then pinned in memory by whatever holds
+ * references to the inode to perform the unlink. Same goes for the O_TMPFILE
+ * usage of the unlinked list - those files are pinned in memory by an open file
+ * descriptor. Hence the inodes on the list are pinned in memory until they are
+ * removed from the list.
  *
- * Because this is a backref cache, we ignore operational failures since the
- * iunlink code can fall back to the slow bucket walk.  The only errors that
- * should bubble out are for obviously incorrect situations.
+ * That means we can simply use an in-memory double linked list to track inodes
+ * on the unlinked list. As we've removed the scalability problem resulting from
+ * removal on a single linked list requiring traversal, we also no longer use
+ * the on-disk hash to keep traversals short. We just use a single list on disk
+ * now, and track the previous inode in the list in memory.
  *
- * All users of the backref cache MUST hold the AGI buffer lock to serialize
- * access or have otherwise provided for concurrency control.
- */
-
-/* Capture a "X.next_unlinked = Y" relationship. */
-struct xfs_iunlink {
-	struct rhash_head	iu_rhash_head;
-	xfs_agino_t		iu_agino;		/* X */
-	xfs_agino_t		iu_next_unlinked;	/* Y */
-};
-
-/* Unlinked list predecessor lookup hashtable construction */
-static int
-xfs_iunlink_obj_cmpfn(
-	struct rhashtable_compare_arg	*arg,
-	const void			*obj)
-{
-	const xfs_agino_t		*key = arg->key;
-	const struct xfs_iunlink	*iu = obj;
-
-	if (iu->iu_next_unlinked != *key)
-		return 1;
-	return 0;
-}
-
-static const struct rhashtable_params xfs_iunlink_hash_params = {
-	.min_size		= XFS_AGI_UNLINKED_BUCKETS,
-	.key_len		= sizeof(xfs_agino_t),
-	.key_offset		= offsetof(struct xfs_iunlink,
-					   iu_next_unlinked),
-	.head_offset		= offsetof(struct xfs_iunlink, iu_rhash_head),
-	.automatic_shrinking	= true,
-	.obj_cmpfn		= xfs_iunlink_obj_cmpfn,
-};
-
-/*
- * Return X, where X.next_unlinked == @agino.  Returns NULLAGINO if no such
- * relation is found.
+ * To provide the guarantee that inodes are always on this in memory list, log
+ * recovery does what is necessary to populate the list sufficient to perform
+ * removal from the head of the list correctly. As such, we can now always rely
+ * on the in-memory list and if it differs from what we find on disk then we
+ * have a memory corruption problem or a software bug and so mismatches are now
+ * considered EFSCORRUPTION errors and are not recoverable.
+ *
+ * All users of the unlinked list MUST hold the AGI buffer lock to serialize
+ * access to the list.
  */
-static xfs_agino_t
-xfs_iunlink_lookup_backref(
-	struct xfs_perag	*pag,
-	xfs_agino_t		agino)
-{
-	struct xfs_iunlink	*iu;
-
-	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
-			xfs_iunlink_hash_params);
-	return iu ? iu->iu_agino : NULLAGINO;
-}
-
-/*
- * Take ownership of an iunlink cache entry and insert it into the hash table.
- * If successful, the entry will be owned by the cache; if not, it is freed.
- * Either way, the caller does not own @iu after this call.
- */
-static int
-xfs_iunlink_insert_backref(
-	struct xfs_perag	*pag,
-	struct xfs_iunlink	*iu)
-{
-	int			error;
-
-	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
-			&iu->iu_rhash_head, xfs_iunlink_hash_params);
-	/*
-	 * Fail loudly if there already was an entry because that's a sign of
-	 * corruption of in-memory data.  Also fail loudly if we see an error
-	 * code we didn't anticipate from the rhashtable code.  Currently we
-	 * only anticipate ENOMEM.
-	 */
-	if (error) {
-		WARN(error != -ENOMEM, "iunlink cache insert error %d", error);
-		kmem_free(iu);
-	}
-	/*
-	 * Absorb any runtime errors that aren't a result of corruption because
-	 * this is a cache and we can always fall back to bucket list scanning.
-	 */
-	if (error != 0 && error != -EEXIST)
-		error = 0;
-	return error;
-}
-
-/* Remember that @prev_agino.next_unlinked = @this_agino. */
-static int
-xfs_iunlink_add_backref(
-	struct xfs_perag	*pag,
-	xfs_agino_t		prev_agino,
-	xfs_agino_t		this_agino)
-{
-	struct xfs_iunlink	*iu;
-
-	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
-		return 0;
-
-	iu = kmem_zalloc(sizeof(*iu), KM_NOFS);
-	iu->iu_agino = prev_agino;
-	iu->iu_next_unlinked = this_agino;
-
-	return xfs_iunlink_insert_backref(pag, iu);
-}
-
-/*
- * Replace X.next_unlinked = @agino with X.next_unlinked = @next_unlinked.
- * If @next_unlinked is NULLAGINO, we drop the backref and exit.  If there
- * wasn't any such entry then we don't bother.
- */
-static int
-xfs_iunlink_change_backref(
-	struct xfs_perag	*pag,
-	xfs_agino_t		agino,
-	xfs_agino_t		next_unlinked)
-{
-	struct xfs_iunlink	*iu;
-	int			error;
-
-	/* Look up the old entry; if there wasn't one then exit. */
-	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
-			xfs_iunlink_hash_params);
-	if (!iu)
-		return 0;
-
-	/*
-	 * Remove the entry.  This shouldn't ever return an error, but if we
-	 * couldn't remove the old entry we don't want to add it again to the
-	 * hash table, and if the entry disappeared on us then someone's
-	 * violated the locking rules and we need to fail loudly.  Either way
-	 * we cannot remove the inode because internal state is or would have
-	 * been corrupt.
-	 */
-	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
-			&iu->iu_rhash_head, xfs_iunlink_hash_params);
-	if (error)
-		return error;
-
-	/* If there is no new next entry just free our item and return. */
-	if (next_unlinked == NULLAGINO) {
-		kmem_free(iu);
-		return 0;
-	}
-
-	/* Update the entry and re-add it to the hash table. */
-	iu->iu_next_unlinked = next_unlinked;
-	return xfs_iunlink_insert_backref(pag, iu);
-}
-
-/* Set up the in-core predecessor structures. */
-int
-xfs_iunlink_init(
-	struct xfs_perag	*pag)
-{
-	return rhashtable_init(&pag->pagi_unlinked_hash,
-			&xfs_iunlink_hash_params);
-}
-
-/* Free the in-core predecessor structures. */
-static void
-xfs_iunlink_free_item(
-	void			*ptr,
-	void			*arg)
-{
-	struct xfs_iunlink	*iu = ptr;
-	bool			*freed_anything = arg;
-
-	*freed_anything = true;
-	kmem_free(iu);
-}
-
-void
-xfs_iunlink_destroy(
-	struct xfs_perag	*pag)
-{
-	bool			freed_anything = false;
-
-	rhashtable_free_and_destroy(&pag->pagi_unlinked_hash,
-			xfs_iunlink_free_item, &freed_anything);
-
-	ASSERT(freed_anything == false || XFS_FORCED_SHUTDOWN(pag->pag_mount));
-}
 
 /*
  * Point the AGI unlinked bucket at an inode and log the results.  The caller
@@ -2221,6 +2054,7 @@ xfs_iunlink_insert_inode(
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_agi		*agi;
+	struct xfs_inode	*nip;
 	xfs_agino_t		next_agino;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
@@ -2242,9 +2076,13 @@ xfs_iunlink_insert_inode(
 		return -EFSCORRUPTED;
 	}
 
-	if (next_agino != NULLAGINO) {
+	nip = list_first_entry_or_null(&agibp->b_pag->pag_ici_unlink_list,
+					struct xfs_inode, i_unlink);
+	if (nip) {
 		xfs_agino_t		old_agino;
 
+		ASSERT(next_agino == XFS_INO_TO_AGINO(mp, nip->i_ino));
+
 		/*
 		 * There is already another inode in the bucket, so point this
 		 * inode to the current head of the list.
@@ -2254,14 +2092,8 @@ xfs_iunlink_insert_inode(
 		if (error)
 			return error;
 		ASSERT(old_agino == NULLAGINO);
-
-		/*
-		 * agino has been unlinked, add a backref from the next inode
-		 * back to agino.
-		 */
-		error = xfs_iunlink_add_backref(agibp->b_pag, agino, next_agino);
-		if (error)
-			return error;
+	} else {
+		ASSERT(next_agino == NULLAGINO);
 	}
 
 	/* Point the head of the list to point to this inode. */
@@ -2354,70 +2186,24 @@ xfs_iunlink_map_prev(
 	xfs_agnumber_t		agno,
 	xfs_agino_t		head_agino,
 	xfs_agino_t		target_agino,
-	xfs_agino_t		*agino,
+	xfs_agino_t		agino,
 	struct xfs_imap		*imap,
 	struct xfs_dinode	**dipp,
 	struct xfs_buf		**bpp,
 	struct xfs_perag	*pag)
 {
-	struct xfs_mount	*mp = tp->t_mountp;
-	xfs_agino_t		next_agino;
 	int			error;
 
 	ASSERT(head_agino != target_agino);
 	*bpp = NULL;
 
-	/* See if our backref cache can find it faster. */
-	*agino = xfs_iunlink_lookup_backref(pag, target_agino);
-	if (*agino != NULLAGINO) {
-		error = xfs_iunlink_map_ino(tp, agno, *agino, imap, dipp, bpp);
-		if (error)
-			return error;
-
-		if (be32_to_cpu((*dipp)->di_next_unlinked) == target_agino)
-			return 0;
-
-		/*
-		 * If we get here the cache contents were corrupt, so drop the
-		 * buffer and fall back to walking the bucket list.
-		 */
-		xfs_trans_brelse(tp, *bpp);
-		*bpp = NULL;
-		WARN_ON_ONCE(1);
-	}
-
-	trace_xfs_iunlink_map_prev_fallback(mp, agno);
-
-	/* Otherwise, walk the entire bucket until we find it. */
-	next_agino = head_agino;
-	while (next_agino != target_agino) {
-		xfs_agino_t	unlinked_agino;
-
-		if (*bpp)
-			xfs_trans_brelse(tp, *bpp);
-
-		*agino = next_agino;
-		error = xfs_iunlink_map_ino(tp, agno, next_agino, imap, dipp,
-				bpp);
-		if (error)
-			return error;
-
-		unlinked_agino = be32_to_cpu((*dipp)->di_next_unlinked);
-		/*
-		 * Make sure this pointer is valid and isn't an obvious
-		 * infinite loop.
-		 */
-		if (!xfs_verify_agino(mp, agno, unlinked_agino) ||
-		    next_agino == unlinked_agino) {
-			XFS_CORRUPTION_ERROR(__func__,
-					XFS_ERRLEVEL_LOW, mp,
-					*dipp, sizeof(**dipp));
-			error = -EFSCORRUPTED;
-			return error;
-		}
-		next_agino = unlinked_agino;
-	}
+	ASSERT(agino != NULLAGINO);
+	error = xfs_iunlink_map_ino(tp, agno, agino, imap, dipp, bpp);
+	if (error)
+		return error;
 
+	if (be32_to_cpu((*dipp)->di_next_unlinked) != target_agino)
+		return -EFSCORRUPTED;
 	return 0;
 }
 
@@ -2461,27 +2247,31 @@ xfs_iunlink_remove_inode(
 	if (error)
 		return error;
 
-	/*
-	 * If there was a backref pointing from the next inode back to this
-	 * one, remove it because we've removed this inode from the list.
-	 *
-	 * Later, if this inode was in the middle of the list we'll update
-	 * this inode's backref to point from the next inode.
-	 */
-	if (next_agino != NULLAGINO) {
-		error = xfs_iunlink_change_backref(agibp->b_pag, next_agino,
-				NULLAGINO);
-		if (error)
-			return error;
+#ifdef DEBUG
+	{
+	struct xfs_inode *nip = list_next_entry(ip, i_unlink);
+	if (nip)
+		ASSERT(next_agino == XFS_INO_TO_AGINO(mp, nip->i_ino));
+	else
+		ASSERT(next_agino == NULLAGINO);
 	}
+#endif
+
+	if (ip != list_first_entry(&agibp->b_pag->pag_ici_unlink_list,
+					struct xfs_inode, i_unlink)) {
 
-	if (head_agino != agino) {
+		struct xfs_inode *pip;
 		struct xfs_imap	imap;
 		xfs_agino_t	prev_agino;
 
+		ASSERT(head_agino != agino);
+
+		pip = list_prev_entry(ip, i_unlink);
+		prev_agino = XFS_INO_TO_AGINO(mp, pip->i_ino);
+
 		/* We need to search the list for the inode being freed. */
 		error = xfs_iunlink_map_prev(tp, agno, head_agino, agino,
-				&prev_agino, &imap, &last_dip, &last_ibp,
+				prev_agino, &imap, &last_dip, &last_ibp,
 				agibp->b_pag);
 		if (error)
 			return error;
@@ -2490,16 +2280,7 @@ xfs_iunlink_remove_inode(
 		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
 				last_dip, &imap, next_agino);
 
-		/*
-		 * Now we deal with the backref for this inode.  If this inode
-		 * pointed at a real inode, change the backref that pointed to
-		 * us to point to our old next.  If this inode was the end of
-		 * the list, delete the backref that pointed to us.  Note that
-		 * change_backref takes care of deleting the backref if
-		 * next_agino is NULLAGINO.
-		 */
-		return xfs_iunlink_change_backref(agibp->b_pag, agino,
-				next_agino);
+		return 0;
 	}
 
 	/* Point the head of the list to the next unlinked inode. */
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 73f36908a1ce..7f8fbb7c8594 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -464,9 +464,6 @@ extern struct kmem_zone	*xfs_inode_zone;
 /* The default CoW extent size hint. */
 #define XFS_DEFAULT_COWEXTSZ_HINT 32
 
-int xfs_iunlink_init(struct xfs_perag *pag);
-void xfs_iunlink_destroy(struct xfs_perag *pag);
-
 void xfs_end_io(struct work_struct *work);
 
 int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 2def15297a5f..f28c969af272 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -146,7 +146,6 @@ xfs_free_perag(
 		spin_unlock(&mp->m_perag_lock);
 		ASSERT(pag);
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
-		xfs_iunlink_destroy(pag);
 		xfs_buf_hash_destroy(pag);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
 	}
@@ -224,9 +223,6 @@ xfs_initialize_perag(
 		/* first new pag is fully initialized */
 		if (first_initialised == NULLAGNUMBER)
 			first_initialised = index;
-		error = xfs_iunlink_init(pag);
-		if (error)
-			goto out_hash_destroy;
 		spin_lock_init(&pag->pag_state_lock);
 	}
 
@@ -249,7 +245,6 @@ xfs_initialize_perag(
 		if (!pag)
 			break;
 		xfs_buf_hash_destroy(pag);
-		xfs_iunlink_destroy(pag);
 		kmem_free(pag);
 	}
 	return error;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index abb1d859f226..acddc60f6d88 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3514,7 +3514,6 @@ DEFINE_EVENT(xfs_ag_inode_class, name, \
 	TP_ARGS(ip))
 DEFINE_AGINODE_EVENT(xfs_iunlink);
 DEFINE_AGINODE_EVENT(xfs_iunlink_remove);
-DEFINE_AG_EVENT(xfs_iunlink_map_prev_fallback);
 
 DECLARE_EVENT_CLASS(xfs_fs_corrupt_class,
 	TP_PROTO(struct xfs_mount *mp, unsigned int flags),
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 07/13] xfs: mapping unlinked inodes is now redundant
  2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
                   ` (5 preceding siblings ...)
  2020-08-12  9:25 ` [PATCH 06/13] xfs: replace iunlink backref lookups with list lookups Dave Chinner
@ 2020-08-12  9:25 ` Dave Chinner
  2020-08-19  0:14   ` Darrick J. Wong
  2020-08-12  9:25 ` [PATCH 08/13] xfs: updating i_next_unlinked doesn't need to return old value Dave Chinner
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Dave Chinner @ 2020-08-12  9:25 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We now have a direct pointer to the xfs_inodes in the unlinked
lists, so we can use the imap built into the inode to read the
underlying cluster buffer. Hence we can remove all the "lookup by
agino" code that currently exists in the iunlink list processing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 88 ++++++----------------------------------------
 1 file changed, 10 insertions(+), 78 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 2c930de99561..bacd5ae9f5a7 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2139,74 +2139,6 @@ xfs_iunlink(
 	return error;
 }
 
-/* Return the imap, dinode pointer, and buffer for an inode. */
-STATIC int
-xfs_iunlink_map_ino(
-	struct xfs_trans	*tp,
-	xfs_agnumber_t		agno,
-	xfs_agino_t		agino,
-	struct xfs_imap		*imap,
-	struct xfs_dinode	**dipp,
-	struct xfs_buf		**bpp)
-{
-	struct xfs_mount	*mp = tp->t_mountp;
-	int			error;
-
-	imap->im_blkno = 0;
-	error = xfs_imap(mp, tp, XFS_AGINO_TO_INO(mp, agno, agino), imap, 0);
-	if (error) {
-		xfs_warn(mp, "%s: xfs_imap returned error %d.",
-				__func__, error);
-		return error;
-	}
-
-	error = xfs_imap_to_bp(mp, tp, imap, dipp, bpp, 0);
-	if (error) {
-		xfs_warn(mp, "%s: xfs_imap_to_bp returned error %d.",
-				__func__, error);
-		return error;
-	}
-
-	return 0;
-}
-
-/*
- * Walk the unlinked chain from @head_agino until we find the inode that
- * points to @target_agino.  Return the inode number, map, dinode pointer,
- * and inode cluster buffer of that inode as @agino, @imap, @dipp, and @bpp.
- *
- * @tp, @pag, @head_agino, and @target_agino are input parameters.
- * @agino, @imap, @dipp, and @bpp are all output parameters.
- *
- * Do not call this function if @target_agino is the head of the list.
- */
-STATIC int
-xfs_iunlink_map_prev(
-	struct xfs_trans	*tp,
-	xfs_agnumber_t		agno,
-	xfs_agino_t		head_agino,
-	xfs_agino_t		target_agino,
-	xfs_agino_t		agino,
-	struct xfs_imap		*imap,
-	struct xfs_dinode	**dipp,
-	struct xfs_buf		**bpp,
-	struct xfs_perag	*pag)
-{
-	int			error;
-
-	ASSERT(head_agino != target_agino);
-	*bpp = NULL;
-
-	ASSERT(agino != NULLAGINO);
-	error = xfs_iunlink_map_ino(tp, agno, agino, imap, dipp, bpp);
-	if (error)
-		return error;
-
-	if (be32_to_cpu((*dipp)->di_next_unlinked) != target_agino)
-		return -EFSCORRUPTED;
-	return 0;
-}
-
 static int
 xfs_iunlink_remove_inode(
 	struct xfs_trans	*tp,
@@ -2215,8 +2147,6 @@ xfs_iunlink_remove_inode(
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_agi		*agi;
-	struct xfs_buf		*last_ibp;
-	struct xfs_dinode	*last_dip = NULL;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		next_agino;
@@ -2260,25 +2190,27 @@ xfs_iunlink_remove_inode(
 	if (ip != list_first_entry(&agibp->b_pag->pag_ici_unlink_list,
 					struct xfs_inode, i_unlink)) {
 
-		struct xfs_inode *pip;
-		struct xfs_imap	imap;
-		xfs_agino_t	prev_agino;
+		struct xfs_inode	*pip;
+		xfs_agino_t		prev_agino;
+		struct xfs_buf		*last_ibp;
+		struct xfs_dinode	*last_dip = NULL;
 
 		ASSERT(head_agino != agino);
 
 		pip = list_prev_entry(ip, i_unlink);
 		prev_agino = XFS_INO_TO_AGINO(mp, pip->i_ino);
 
-		/* We need to search the list for the inode being freed. */
-		error = xfs_iunlink_map_prev(tp, agno, head_agino, agino,
-				prev_agino, &imap, &last_dip, &last_ibp,
-				agibp->b_pag);
+		error = xfs_imap_to_bp(mp, tp, &pip->i_imap, &last_dip, 
+						&last_ibp, 0);
 		if (error)
 			return error;
 
+		if (be32_to_cpu(last_dip->di_next_unlinked) != agino)
+			return -EFSCORRUPTED;
+
 		/* Point the previous inode on the list to the next inode. */
 		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
-				last_dip, &imap, next_agino);
+				last_dip, &pip->i_imap, next_agino);
 
 		return 0;
 	}
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 08/13] xfs: updating i_next_unlinked doesn't need to return old value
  2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
                   ` (6 preceding siblings ...)
  2020-08-12  9:25 ` [PATCH 07/13] xfs: mapping unlinked inodes is now redundant Dave Chinner
@ 2020-08-12  9:25 ` Dave Chinner
  2020-08-19  0:19   ` Darrick J. Wong
  2020-08-12  9:25 ` [PATCH 09/13] xfs: validate the unlinked list pointer on update Dave Chinner
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Dave Chinner @ 2020-08-12  9:25 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We already know what the next inode in the unlinked list is supposed
to be from the in-memory list, so we do not need to look it up first
from the current inode to be able to update in memory list
pointers...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 63 +++++++++++-----------------------------------
 1 file changed, 14 insertions(+), 49 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index bacd5ae9f5a7..4dde1970f7cd 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1998,13 +1998,11 @@ xfs_iunlink_update_inode(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
 	xfs_agnumber_t		agno,
-	xfs_agino_t		next_agino,
-	xfs_agino_t		*old_next_agino)
+	xfs_agino_t		next_agino)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_dinode	*dip;
 	struct xfs_buf		*ibp;
-	xfs_agino_t		old_value;
 	int			error;
 
 	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
@@ -2013,37 +2011,10 @@ xfs_iunlink_update_inode(
 	if (error)
 		return error;
 
-	/* Make sure the old pointer isn't garbage. */
-	old_value = be32_to_cpu(dip->di_next_unlinked);
-	if (!xfs_verify_agino_or_null(mp, agno, old_value)) {
-		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
-				sizeof(*dip), __this_address);
-		error = -EFSCORRUPTED;
-		goto out;
-	}
-
-	/*
-	 * Since we're updating a linked list, we should never find that the
-	 * current pointer is the same as the new value, unless we're
-	 * terminating the list.
-	 */
-	*old_next_agino = old_value;
-	if (old_value == next_agino) {
-		if (next_agino != NULLAGINO) {
-			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
-					dip, sizeof(*dip), __this_address);
-			error = -EFSCORRUPTED;
-		}
-		goto out;
-	}
-
 	/* Ok, update the new pointer. */
 	xfs_iunlink_update_dinode(tp, agno, XFS_INO_TO_AGINO(mp, ip->i_ino),
 			ibp, dip, &ip->i_imap, next_agino);
 	return 0;
-out:
-	xfs_trans_brelse(tp, ibp);
-	return error;
 }
 
 static int
@@ -2079,19 +2050,15 @@ xfs_iunlink_insert_inode(
 	nip = list_first_entry_or_null(&agibp->b_pag->pag_ici_unlink_list,
 					struct xfs_inode, i_unlink);
 	if (nip) {
-		xfs_agino_t		old_agino;
-
 		ASSERT(next_agino == XFS_INO_TO_AGINO(mp, nip->i_ino));
 
 		/*
 		 * There is already another inode in the bucket, so point this
 		 * inode to the current head of the list.
 		 */
-		error = xfs_iunlink_update_inode(tp, ip, agno, next_agino,
-				&old_agino);
+		error = xfs_iunlink_update_inode(tp, ip, agno, next_agino);
 		if (error)
 			return error;
-		ASSERT(old_agino == NULLAGINO);
 	} else {
 		ASSERT(next_agino == NULLAGINO);
 	}
@@ -2149,7 +2116,7 @@ xfs_iunlink_remove_inode(
 	struct xfs_agi		*agi;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
-	xfs_agino_t		next_agino;
+	xfs_agino_t		next_agino = NULLAGINO;
 	xfs_agino_t		head_agino;
 	int			error;
 
@@ -2169,23 +2136,21 @@ xfs_iunlink_remove_inode(
 	}
 
 	/*
-	 * Set our inode's next_unlinked pointer to NULL and then return
-	 * the old pointer value so that we can update whatever was previous
-	 * to us in the list to point to whatever was next in the list.
+	 * Get the next agino in the list. If we are at the end of the list,
+	 * then the previous inode's i_next_unlinked filed will get cleared.
 	 */
-	error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO, &next_agino);
+	if (ip != list_last_entry(&agibp->b_pag->pag_ici_unlink_list,
+					struct xfs_inode, i_unlink)) {
+		struct xfs_inode *nip = list_next_entry(ip, i_unlink);
+
+		next_agino = XFS_INO_TO_AGINO(mp, nip->i_ino);
+	}
+
+	/* Clear the on disk next unlinked pointer for this inode. */
+	error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO);
 	if (error)
 		return error;
 
-#ifdef DEBUG
-	{
-	struct xfs_inode *nip = list_next_entry(ip, i_unlink);
-	if (nip)
-		ASSERT(next_agino == XFS_INO_TO_AGINO(mp, nip->i_ino));
-	else
-		ASSERT(next_agino == NULLAGINO);
-	}
-#endif
 
 	if (ip != list_first_entry(&agibp->b_pag->pag_ici_unlink_list,
 					struct xfs_inode, i_unlink)) {
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 09/13] xfs: validate the unlinked list pointer on update
  2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
                   ` (7 preceding siblings ...)
  2020-08-12  9:25 ` [PATCH 08/13] xfs: updating i_next_unlinked doesn't need to return old value Dave Chinner
@ 2020-08-12  9:25 ` Dave Chinner
  2020-08-19  0:23   ` Darrick J. Wong
  2020-08-12  9:25 ` [PATCH 10/13] xfs: re-order AGI updates in unlink list updates Dave Chinner
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Dave Chinner @ 2020-08-12  9:25 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Factor this check into xfs_iunlink_update_inode() when are updating
the code. This replaces the checks that were removed in previous
patches as bits of functionality were removed from the update
process.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 38 ++++++++++++++------------------------
 1 file changed, 14 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4dde1970f7cd..b098e5df07e7 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1998,6 +1998,7 @@ xfs_iunlink_update_inode(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
 	xfs_agnumber_t		agno,
+	xfs_agino_t		old_agino,
 	xfs_agino_t		next_agino)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
@@ -2011,6 +2012,13 @@ xfs_iunlink_update_inode(
 	if (error)
 		return error;
 
+	if (be32_to_cpu(dip->di_next_unlinked) != old_agino) {
+		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
+					sizeof(*dip), __this_address);
+		xfs_trans_brelse(tp, ibp);
+		return -EFSCORRUPTED;
+	}
+
 	/* Ok, update the new pointer. */
 	xfs_iunlink_update_dinode(tp, agno, XFS_INO_TO_AGINO(mp, ip->i_ino),
 			ibp, dip, &ip->i_imap, next_agino);
@@ -2056,7 +2064,8 @@ xfs_iunlink_insert_inode(
 		 * There is already another inode in the bucket, so point this
 		 * inode to the current head of the list.
 		 */
-		error = xfs_iunlink_update_inode(tp, ip, agno, next_agino);
+		error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO,
+						 next_agino);
 		if (error)
 			return error;
 	} else {
@@ -2147,37 +2156,18 @@ xfs_iunlink_remove_inode(
 	}
 
 	/* Clear the on disk next unlinked pointer for this inode. */
-	error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO);
+	error = xfs_iunlink_update_inode(tp, ip, agno, next_agino, NULLAGINO);
 	if (error)
 		return error;
 
 
 	if (ip != list_first_entry(&agibp->b_pag->pag_ici_unlink_list,
 					struct xfs_inode, i_unlink)) {
-
-		struct xfs_inode	*pip;
-		xfs_agino_t		prev_agino;
-		struct xfs_buf		*last_ibp;
-		struct xfs_dinode	*last_dip = NULL;
+		struct xfs_inode *pip = list_prev_entry(ip, i_unlink);
 
 		ASSERT(head_agino != agino);
-
-		pip = list_prev_entry(ip, i_unlink);
-		prev_agino = XFS_INO_TO_AGINO(mp, pip->i_ino);
-
-		error = xfs_imap_to_bp(mp, tp, &pip->i_imap, &last_dip, 
-						&last_ibp, 0);
-		if (error)
-			return error;
-
-		if (be32_to_cpu(last_dip->di_next_unlinked) != agino)
-			return -EFSCORRUPTED;
-
-		/* Point the previous inode on the list to the next inode. */
-		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
-				last_dip, &pip->i_imap, next_agino);
-
-		return 0;
+		return xfs_iunlink_update_inode(tp, pip, agno, agino,
+						next_agino);
 	}
 
 	/* Point the head of the list to the next unlinked inode. */
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 10/13] xfs: re-order AGI updates in unlink list updates
  2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
                   ` (8 preceding siblings ...)
  2020-08-12  9:25 ` [PATCH 09/13] xfs: validate the unlinked list pointer on update Dave Chinner
@ 2020-08-12  9:25 ` Dave Chinner
  2020-08-19  0:29   ` Darrick J. Wong
  2020-08-12  9:25 ` [PATCH 11/13] xfs: combine iunlink inode update functions Dave Chinner
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Dave Chinner @ 2020-08-12  9:25 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We always access and check the AGI bucket entry for the unlinked
list even if we are not going to need it either for lookup or remove
purposes. Move the code that accesses the AGI to the code that
modifes the AGI, hence keeping the AGI accesses local to the code
that needs to modify it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 84 ++++++++++++++++------------------------------
 1 file changed, 28 insertions(+), 56 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b098e5df07e7..4f616e1b64dc 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1918,44 +1918,53 @@ xfs_inactive(
  */
 
 /*
- * Point the AGI unlinked bucket at an inode and log the results.  The caller
- * is responsible for validating the old value.
+ * Point the AGI unlinked bucket at an inode and log the results. The caller
+ * passes in the expected current agino the bucket points at via @cur_agino so
+ * we can validate that we are about to remove the inode we expect to be
+ * removing from the AGI bucket.
  */
-STATIC int
+static int
 xfs_iunlink_update_bucket(
 	struct xfs_trans	*tp,
 	xfs_agnumber_t		agno,
 	struct xfs_buf		*agibp,
-	xfs_agino_t		old_agino,
+	xfs_agino_t		cur_agino,
 	xfs_agino_t		new_agino)
 {
-	struct xlog		*log = tp->t_mountp->m_log;
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xlog		*log = mp->m_log;
 	struct xfs_agi		*agi = agibp->b_addr;
-	xfs_agino_t		old_value;
+	xfs_agino_t		old_agino;
 	unsigned int		bucket_index;
 	int                     offset;
 
-	ASSERT(xfs_verify_agino_or_null(tp->t_mountp, agno, new_agino));
+	ASSERT(xfs_verify_agino_or_null(mp, agno, new_agino));
 
+	/*
+	 * We don't need to traverse the on disk unlinked list to find the
+	 * previous inode in the list when removing inodes anymore, so we don't
+	 * use multiple on-disk lists anymore. Hence we always use bucket 0
+	 * unless we are in log recovery in which case we might be recovering an
+	 * old filesystem that has multiple buckets.
+	 */
 	bucket_index = 0;
-	/* During recovery, the old multiple bucket index can be applied */
 	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
-		ASSERT(old_agino != NULLAGINO);
+		ASSERT(cur_agino != NULLAGINO);
 
-		if (be32_to_cpu(agi->agi_unlinked[0]) != old_agino)
-			bucket_index = old_agino % XFS_AGI_UNLINKED_BUCKETS;
+		if (be32_to_cpu(agi->agi_unlinked[0]) != cur_agino)
+			bucket_index = cur_agino % XFS_AGI_UNLINKED_BUCKETS;
 	}
 
-	old_value = be32_to_cpu(agi->agi_unlinked[bucket_index]);
-	trace_xfs_iunlink_update_bucket(tp->t_mountp, agno, bucket_index,
-			old_value, new_agino);
-
-	/* check if the old agi_unlinked head is as expected */
-	if (old_value != old_agino) {
+	old_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
+	if (new_agino == old_agino || cur_agino != old_agino ||
+	    !xfs_verify_agino_or_null(mp, agno, old_agino)) {
 		xfs_buf_mark_corrupt(agibp);
 		return -EFSCORRUPTED;
 	}
 
+	trace_xfs_iunlink_update_bucket(mp, agno, bucket_index,
+			old_agino, new_agino);
+
 	agi->agi_unlinked[bucket_index] = cpu_to_be32(new_agino);
 	offset = offsetof(struct xfs_agi, agi_unlinked) +
 			(sizeof(xfs_agino_t) * bucket_index);
@@ -2032,44 +2041,25 @@ xfs_iunlink_insert_inode(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi;
 	struct xfs_inode	*nip;
-	xfs_agino_t		next_agino;
+	xfs_agino_t		next_agino = NULLAGINO;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	int			error;
 
-	agi = agibp->b_addr;
-
-	/*
-	 * We don't need to traverse the on disk unlinked list to find the
-	 * previous inode in the list when removing inodes anymore, so we don't
-	 * need multiple on-disk lists anymore. Hence we always use bucket 0.
-	 * Make sure the pointer isn't garbage and that this inode isn't already
-	 * on the list.
-	 */
-	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
-	if (next_agino == agino ||
-	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
-		xfs_buf_mark_corrupt(agibp);
-		return -EFSCORRUPTED;
-	}
-
 	nip = list_first_entry_or_null(&agibp->b_pag->pag_ici_unlink_list,
 					struct xfs_inode, i_unlink);
 	if (nip) {
-		ASSERT(next_agino == XFS_INO_TO_AGINO(mp, nip->i_ino));
 
 		/*
 		 * There is already another inode in the bucket, so point this
 		 * inode to the current head of the list.
 		 */
+		next_agino = XFS_INO_TO_AGINO(mp, nip->i_ino);
 		error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO,
 						 next_agino);
 		if (error)
 			return error;
-	} else {
-		ASSERT(next_agino == NULLAGINO);
 	}
 
 	/* Point the head of the list to point to this inode. */
@@ -2122,28 +2112,11 @@ xfs_iunlink_remove_inode(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_agi		*agi;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		next_agino = NULLAGINO;
-	xfs_agino_t		head_agino;
 	int			error;
 
-	agi = agibp->b_addr;
-
-	/*
-	 * We don't need to traverse the on disk unlinked list to find the
-	 * previous inode in the list when removing inodes anymore, so we don't
-	 * need multiple on-disk lists anymore. Hence we always use bucket 0.
-	 * Make sure the head pointer isn't garbage.
-	 */
-	head_agino = be32_to_cpu(agi->agi_unlinked[0]);
-	if (!xfs_verify_agino(mp, agno, head_agino)) {
-		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
-				agi, sizeof(*agi));
-		return -EFSCORRUPTED;
-	}
-
 	/*
 	 * Get the next agino in the list. If we are at the end of the list,
 	 * then the previous inode's i_next_unlinked filed will get cleared.
@@ -2165,7 +2138,6 @@ xfs_iunlink_remove_inode(
 					struct xfs_inode, i_unlink)) {
 		struct xfs_inode *pip = list_prev_entry(ip, i_unlink);
 
-		ASSERT(head_agino != agino);
 		return xfs_iunlink_update_inode(tp, pip, agno, agino,
 						next_agino);
 	}
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 11/13] xfs: combine iunlink inode update functions
  2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
                   ` (9 preceding siblings ...)
  2020-08-12  9:25 ` [PATCH 10/13] xfs: re-order AGI updates in unlink list updates Dave Chinner
@ 2020-08-12  9:25 ` Dave Chinner
  2020-08-19  0:30   ` Darrick J. Wong
  2020-08-12  9:25 ` [PATCH 12/13] xfs: add in-memory iunlink log item Dave Chinner
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Dave Chinner @ 2020-08-12  9:25 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Combine the logging of the inode unlink list update into the
calling function that looks up the buffer we end up logging. These
do not need to be separate functions as they are both short, simple
operations and there's only a single call path through them. This
new function will end up being the core of the iunlink log item
processing...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 58 ++++++++++++++++------------------------------
 1 file changed, 20 insertions(+), 38 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4f616e1b64dc..82242d15b1d7 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1972,38 +1972,12 @@ xfs_iunlink_update_bucket(
 	return 0;
 }
 
-/* Set an on-disk inode's next_unlinked pointer. */
-STATIC void
-xfs_iunlink_update_dinode(
-	struct xfs_trans	*tp,
-	xfs_agnumber_t		agno,
-	xfs_agino_t		agino,
-	struct xfs_buf		*ibp,
-	struct xfs_dinode	*dip,
-	struct xfs_imap		*imap,
-	xfs_agino_t		next_agino)
-{
-	struct xfs_mount	*mp = tp->t_mountp;
-	int			offset;
-
-	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
-
-	trace_xfs_iunlink_update_dinode(mp, agno, agino,
-			be32_to_cpu(dip->di_next_unlinked), next_agino);
-
-	dip->di_next_unlinked = cpu_to_be32(next_agino);
-	offset = imap->im_boffset +
-			offsetof(struct xfs_dinode, di_next_unlinked);
-
-	/* need to recalc the inode CRC if appropriate */
-	xfs_dinode_calc_crc(mp, dip);
-	xfs_trans_inode_buf(tp, ibp);
-	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
-}
-
-/* Set an in-core inode's unlinked pointer and return the old value. */
+/*
+ * Look up the inode cluster buffer and log the on-disk unlinked inode change
+ * we need to make.
+ */
 STATIC int
-xfs_iunlink_update_inode(
+xfs_iunlink_log_inode(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
 	xfs_agnumber_t		agno,
@@ -2013,6 +1987,7 @@ xfs_iunlink_update_inode(
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_dinode	*dip;
 	struct xfs_buf		*ibp;
+	int			offset;
 	int			error;
 
 	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
@@ -2028,9 +2003,17 @@ xfs_iunlink_update_inode(
 		return -EFSCORRUPTED;
 	}
 
-	/* Ok, update the new pointer. */
-	xfs_iunlink_update_dinode(tp, agno, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			ibp, dip, &ip->i_imap, next_agino);
+	trace_xfs_iunlink_update_dinode(mp, agno,
+			XFS_INO_TO_AGINO(mp, ip->i_ino),
+			be32_to_cpu(dip->di_next_unlinked), next_agino);
+
+	dip->di_next_unlinked = cpu_to_be32(next_agino);
+	offset = ip->i_imap.im_boffset +
+			offsetof(struct xfs_dinode, di_next_unlinked);
+
+	xfs_dinode_calc_crc(mp, dip);
+	xfs_trans_inode_buf(tp, ibp);
+	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
 	return 0;
 }
 
@@ -2056,7 +2039,7 @@ xfs_iunlink_insert_inode(
 		 * inode to the current head of the list.
 		 */
 		next_agino = XFS_INO_TO_AGINO(mp, nip->i_ino);
-		error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO,
+		error = xfs_iunlink_log_inode(tp, ip, agno, NULLAGINO,
 						 next_agino);
 		if (error)
 			return error;
@@ -2129,7 +2112,7 @@ xfs_iunlink_remove_inode(
 	}
 
 	/* Clear the on disk next unlinked pointer for this inode. */
-	error = xfs_iunlink_update_inode(tp, ip, agno, next_agino, NULLAGINO);
+	error = xfs_iunlink_log_inode(tp, ip, agno, next_agino, NULLAGINO);
 	if (error)
 		return error;
 
@@ -2138,8 +2121,7 @@ xfs_iunlink_remove_inode(
 					struct xfs_inode, i_unlink)) {
 		struct xfs_inode *pip = list_prev_entry(ip, i_unlink);
 
-		return xfs_iunlink_update_inode(tp, pip, agno, agino,
-						next_agino);
+		return xfs_iunlink_log_inode(tp, pip, agno, agino, next_agino);
 	}
 
 	/* Point the head of the list to the next unlinked inode. */
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 12/13] xfs: add in-memory iunlink log item
  2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
                   ` (10 preceding siblings ...)
  2020-08-12  9:25 ` [PATCH 11/13] xfs: combine iunlink inode update functions Dave Chinner
@ 2020-08-12  9:25 ` Dave Chinner
  2020-08-19  0:35   ` Darrick J. Wong
  2020-08-12  9:25 ` [PATCH 13/13] xfs: reorder iunlink remove operation in xfs_ifree Dave Chinner
  2020-08-18 18:17 ` [PATCH 00/13] xfs: in memory inode unlink log items Darrick J. Wong
  13 siblings, 1 reply; 51+ messages in thread
From: Dave Chinner @ 2020-08-12  9:25 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we have a clean operation to update the di_next_unlinked
field of inode cluster buffers, we can easily defer this operation
to transaction commit time so we can order the inode cluster buffer
locking consistently.

TO do this, we introduce a new in-memory log item to track the
unlinked list item modification that we are going to make. This
follows the same observations as the in-memory double linked list
used to track unlinked inodes in that the inodes on the list are
pinned in memory and cannot go away, and hence we can simply
reference them for the duration of the transaction without needing
to take active references or pin them or look them up.

This allows us to pass the xfs_inode to the transaction commit code
along with the modification to be made, and then order the logged
modifications via the ->iop_sort and ->iop_precommit operations
for the new log item type. As this is an in-memory log item, it
doesn't have formatting, CIL or AIL operational hooks - it exists
purely to run the inode unlink modifications and is then removed
from the transaction item list and freed once the precommit
operation has run.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/Makefile           |   1 +
 fs/xfs/xfs_inode.c        |  61 ++------------
 fs/xfs/xfs_iunlink_item.c | 168 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iunlink_item.h |  25 ++++++
 fs/xfs/xfs_super.c        |  10 +++
 5 files changed, 209 insertions(+), 56 deletions(-)
 create mode 100644 fs/xfs/xfs_iunlink_item.c
 create mode 100644 fs/xfs/xfs_iunlink_item.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 04611a1068b4..febdf034ca94 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -105,6 +105,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_icreate_item.o \
 				   xfs_inode_item.o \
 				   xfs_inode_item_recover.o \
+				   xfs_iunlink_item.o \
 				   xfs_refcount_item.o \
 				   xfs_rmap_item.o \
 				   xfs_log_recover.o \
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 82242d15b1d7..ce128ff12762 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -36,6 +36,7 @@
 #include "xfs_log_priv.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
+#include "xfs_iunlink_item.h"
 
 kmem_zone_t *xfs_inode_zone;
 
@@ -1972,51 +1973,6 @@ xfs_iunlink_update_bucket(
 	return 0;
 }
 
-/*
- * Look up the inode cluster buffer and log the on-disk unlinked inode change
- * we need to make.
- */
-STATIC int
-xfs_iunlink_log_inode(
-	struct xfs_trans	*tp,
-	struct xfs_inode	*ip,
-	xfs_agnumber_t		agno,
-	xfs_agino_t		old_agino,
-	xfs_agino_t		next_agino)
-{
-	struct xfs_mount	*mp = tp->t_mountp;
-	struct xfs_dinode	*dip;
-	struct xfs_buf		*ibp;
-	int			offset;
-	int			error;
-
-	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
-
-	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp, 0);
-	if (error)
-		return error;
-
-	if (be32_to_cpu(dip->di_next_unlinked) != old_agino) {
-		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
-					sizeof(*dip), __this_address);
-		xfs_trans_brelse(tp, ibp);
-		return -EFSCORRUPTED;
-	}
-
-	trace_xfs_iunlink_update_dinode(mp, agno,
-			XFS_INO_TO_AGINO(mp, ip->i_ino),
-			be32_to_cpu(dip->di_next_unlinked), next_agino);
-
-	dip->di_next_unlinked = cpu_to_be32(next_agino);
-	offset = ip->i_imap.im_boffset +
-			offsetof(struct xfs_dinode, di_next_unlinked);
-
-	xfs_dinode_calc_crc(mp, dip);
-	xfs_trans_inode_buf(tp, ibp);
-	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
-	return 0;
-}
-
 static int
 xfs_iunlink_insert_inode(
 	struct xfs_trans	*tp,
@@ -2028,7 +1984,6 @@ xfs_iunlink_insert_inode(
 	xfs_agino_t		next_agino = NULLAGINO;
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
-	int			error;
 
 	nip = list_first_entry_or_null(&agibp->b_pag->pag_ici_unlink_list,
 					struct xfs_inode, i_unlink);
@@ -2039,10 +1994,7 @@ xfs_iunlink_insert_inode(
 		 * inode to the current head of the list.
 		 */
 		next_agino = XFS_INO_TO_AGINO(mp, nip->i_ino);
-		error = xfs_iunlink_log_inode(tp, ip, agno, NULLAGINO,
-						 next_agino);
-		if (error)
-			return error;
+		xfs_iunlink_log(tp, ip, NULLAGINO, next_agino);
 	}
 
 	/* Point the head of the list to point to this inode. */
@@ -2098,7 +2050,6 @@ xfs_iunlink_remove_inode(
 	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
 	xfs_agino_t		next_agino = NULLAGINO;
-	int			error;
 
 	/*
 	 * Get the next agino in the list. If we are at the end of the list,
@@ -2112,16 +2063,14 @@ xfs_iunlink_remove_inode(
 	}
 
 	/* Clear the on disk next unlinked pointer for this inode. */
-	error = xfs_iunlink_log_inode(tp, ip, agno, next_agino, NULLAGINO);
-	if (error)
-		return error;
-
+	xfs_iunlink_log(tp, ip, next_agino, NULLAGINO);
 
 	if (ip != list_first_entry(&agibp->b_pag->pag_ici_unlink_list,
 					struct xfs_inode, i_unlink)) {
 		struct xfs_inode *pip = list_prev_entry(ip, i_unlink);
 
-		return xfs_iunlink_log_inode(tp, pip, agno, agino, next_agino);
+		xfs_iunlink_log(tp, pip, agino, next_agino);
+		return 0;
 	}
 
 	/* Point the head of the list to the next unlinked inode. */
diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
new file mode 100644
index 000000000000..2ee05f98aa97
--- /dev/null
+++ b/fs/xfs/xfs_iunlink_item.c
@@ -0,0 +1,168 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2020, Red Hat, Inc.
+ * All Rights Reserved.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_iunlink_item.h"
+#include "xfs_trace.h"
+#include "xfs_error.h"
+
+struct kmem_cache	*xfs_iunlink_zone;
+
+static inline struct xfs_iunlink_item *IUL_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_iunlink_item, iu_item);
+}
+
+static void
+xfs_iunlink_item_release(
+	struct xfs_log_item	*lip)
+{
+	kmem_cache_free(xfs_iunlink_zone, IUL_ITEM(lip));
+}
+
+
+static uint64_t
+xfs_iunlink_item_sort(
+	struct xfs_log_item	*lip)
+{
+	return IUL_ITEM(lip)->iu_ip->i_ino;
+}
+
+/*
+ * Look up the inode cluster buffer and log the on-disk unlinked inode change
+ * we need to make.
+ */
+static int
+xfs_iunlink_log_inode(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	xfs_agino_t		old_agino,
+	xfs_agino_t		next_agino)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
+	struct xfs_dinode	*dip;
+	struct xfs_buf		*ibp;
+	int			offset;
+	int			error;
+
+	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
+
+	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp, 0);
+	if (error)
+		return error;
+
+	/*
+	 * Don't bother updating the unlinked field on stale buffers as
+	 * it will never get to disk anyway.
+	 */
+	if (ibp->b_flags & XBF_STALE)
+		return 0;
+
+	if (be32_to_cpu(dip->di_next_unlinked) != old_agino) {
+		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
+					sizeof(*dip), __this_address);
+		xfs_trans_brelse(tp, ibp);
+		return -EFSCORRUPTED;
+	}
+
+	trace_xfs_iunlink_update_dinode(mp, agno,
+			XFS_INO_TO_AGINO(mp, ip->i_ino),
+			be32_to_cpu(dip->di_next_unlinked), next_agino);
+
+	dip->di_next_unlinked = cpu_to_be32(next_agino);
+	offset = ip->i_imap.im_boffset +
+			offsetof(struct xfs_dinode, di_next_unlinked);
+
+	xfs_dinode_calc_crc(mp, dip);
+	xfs_trans_inode_buf(tp, ibp);
+	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
+	return 0;
+}
+
+/*
+ * On precommit, we grab the inode cluster buffer for the inode number
+ * we were passed, then update the next unlinked field for that inode in
+ * the buffer and log the buffer. This ensures that the inode cluster buffer
+ * was logged in the correct order w.r.t. other inode cluster buffers.
+ *
+ * Note: if the inode cluster buffer is marked stale, this transaction is
+ * actually freeing the inode cluster. In that case, do not relog the buffer
+ * as this removes the stale state from it. That then causes the post-commit
+ * processing that is dependent on the cluster buffer being stale to go wrong
+ * and we'll leave stale inodes in the AIL that cannot be removed, hanging the
+ * log.
+ */
+static int
+xfs_iunlink_item_precommit(
+	struct xfs_trans	*tp,
+	struct xfs_log_item	*lip)
+{
+	struct xfs_iunlink_item	*iup = IUL_ITEM(lip);
+	int			error;
+
+	error = xfs_iunlink_log_inode(tp, iup->iu_ip, iup->iu_old_agino,
+					iup->iu_next_agino);
+
+	/*
+	 * This log item only exists to perform this action. We now remove
+	 * it from the transaction and free it as it should never reach the
+	 * CIL.
+	 */
+	list_del(&lip->li_trans);
+	xfs_iunlink_item_release(lip);
+	return error;
+}
+
+static const struct xfs_item_ops xfs_iunlink_item_ops = {
+	.iop_release	= xfs_iunlink_item_release,
+	.iop_sort	= xfs_iunlink_item_sort,
+	.iop_precommit	= xfs_iunlink_item_precommit,
+};
+
+
+/*
+ * Initialize the inode log item for a newly allocated (in-core) inode.
+ *
+ * Inode extents can only reside within an AG. Hence specify the starting
+ * block for the inode chunk by offset within an AG as well as the
+ * length of the allocated extent.
+ *
+ * This joins the item to the transaction and marks it dirty so
+ * that we don't need a separate call to do this, nor does the
+ * caller need to know anything about the iunlink item.
+ */
+void
+xfs_iunlink_log(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	xfs_agino_t		old_agino,
+	xfs_agino_t		next_agino)
+{
+	struct xfs_iunlink_item	*iup;
+
+	iup = kmem_cache_zalloc(xfs_iunlink_zone, GFP_KERNEL | __GFP_NOFAIL);
+
+	xfs_log_item_init(tp->t_mountp, &iup->iu_item, XFS_LI_IUNLINK,
+			  &xfs_iunlink_item_ops);
+
+	iup->iu_ip = ip;
+	iup->iu_next_agino = next_agino;
+	iup->iu_old_agino = old_agino;
+
+	xfs_trans_add_item(tp, &iup->iu_item);
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	set_bit(XFS_LI_DIRTY, &iup->iu_item.li_flags);
+}
+
diff --git a/fs/xfs/xfs_iunlink_item.h b/fs/xfs/xfs_iunlink_item.h
new file mode 100644
index 000000000000..f2b95032cf6b
--- /dev/null
+++ b/fs/xfs/xfs_iunlink_item.h
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2020, Red Hat, Inc.
+ * All Rights Reserved.
+ */
+#ifndef XFS_IUNLINK_ITEM_H
+#define XFS_IUNLINK_ITEM_H	1
+
+struct xfs_trans;
+struct xfs_inode;
+
+/* in memory log item structure */
+struct xfs_iunlink_item {
+	struct xfs_log_item	iu_item;
+	struct xfs_inode	*iu_ip;
+	xfs_agino_t		iu_next_agino;
+	xfs_agino_t		iu_old_agino;
+};
+
+extern kmem_zone_t *xfs_iunlink_zone;
+
+void xfs_iunlink_log(struct xfs_trans *tp, struct xfs_inode *ip,
+			xfs_agino_t old_agino, xfs_agino_t next_agino);
+
+#endif	/* XFS_IUNLINK_ITEM_H */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 68ec8db12cc7..b8f66ccc7090 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -35,6 +35,7 @@
 #include "xfs_refcount_item.h"
 #include "xfs_bmap_item.h"
 #include "xfs_reflink.h"
+#include "xfs_iunlink_item.h"
 
 #include <linux/magic.h>
 #include <linux/fs_context.h>
@@ -1969,8 +1970,16 @@ xfs_init_zones(void)
 	if (!xfs_bui_zone)
 		goto out_destroy_bud_zone;
 
+	xfs_iunlink_zone = kmem_cache_create("xfs_iul_item",
+					     sizeof(struct xfs_iunlink_item),
+					     0, 0, NULL);
+	if (!xfs_iunlink_zone)
+		goto out_destroy_bui_zone;
+
 	return 0;
 
+ out_destroy_bui_zone:
+	kmem_cache_destroy(xfs_bui_zone);
  out_destroy_bud_zone:
 	kmem_cache_destroy(xfs_bud_zone);
  out_destroy_cui_zone:
@@ -2017,6 +2026,7 @@ xfs_destroy_zones(void)
 	 * destroy caches.
 	 */
 	rcu_barrier();
+	kmem_cache_destroy(xfs_iunlink_zone);
 	kmem_cache_destroy(xfs_bui_zone);
 	kmem_cache_destroy(xfs_bud_zone);
 	kmem_cache_destroy(xfs_cui_zone);
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 13/13] xfs: reorder iunlink remove operation in xfs_ifree
  2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
                   ` (11 preceding siblings ...)
  2020-08-12  9:25 ` [PATCH 12/13] xfs: add in-memory iunlink log item Dave Chinner
@ 2020-08-12  9:25 ` Dave Chinner
  2020-08-12 11:12   ` Gao Xiang
  2020-08-19  0:45   ` Darrick J. Wong
  2020-08-18 18:17 ` [PATCH 00/13] xfs: in memory inode unlink log items Darrick J. Wong
  13 siblings, 2 replies; 51+ messages in thread
From: Dave Chinner @ 2020-08-12  9:25 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The O_TMPFILE creation implementation creates a specific order of
operations for inode allocation/freeing and unlinked list
modification. Currently both are serialised by the AGI, so the order
doesn't strictly matter as long as the are both in the same
transaction.

However, if we want to move the unlinked list insertions largely
out from under the AGI lock, then we have to be concerned about the
order in which we do unlinked list modification operations.
O_TMPFILE creation tells us this order is inode allocation/free,
then unlinked list modification.

Change xfs_ifree() to use this same ordering on unlinked list
removal. THis way we always guarantee that when we enter the
iunlinked list removal code from this path, we have the already
locked and we don't have to worry about lock nesting AGI reads
inside unlink list locks because it's already locked and attached to
the transaction.

We can do this safely as the inode freeing and unlinked list removal
are done in the same transaction and hence are atomic operations
with resepect to log recovery.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ce128ff12762..7ee778bcde06 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2283,14 +2283,13 @@ xfs_ifree_cluster(
 }
 
 /*
- * This is called to return an inode to the inode free list.
- * The inode should already be truncated to 0 length and have
- * no pages associated with it.  This routine also assumes that
- * the inode is already a part of the transaction.
+ * This is called to return an inode to the inode free list.  The inode should
+ * already be truncated to 0 length and have no pages associated with it.  This
+ * routine also assumes that the inode is already a part of the transaction.
  *
- * The on-disk copy of the inode will have been added to the list
- * of unlinked inodes in the AGI. We need to remove the inode from
- * that list atomically with respect to freeing it here.
+ * The on-disk copy of the inode will have been added to the list of unlinked
+ * inodes in the AGI. We need to remove the inode from that list atomically with
+ * respect to freeing it here.
  */
 int
 xfs_ifree(
@@ -2308,13 +2307,16 @@ xfs_ifree(
 	ASSERT(ip->i_d.di_nblocks == 0);
 
 	/*
-	 * Pull the on-disk inode from the AGI unlinked list.
+	 * Free the inode first so that we guarantee that the AGI lock is going
+	 * to be taken before we remove the inode from the unlinked list. This
+	 * makes the AGI lock -> unlinked list modification order the same as
+	 * used in O_TMPFILE creation.
 	 */
-	error = xfs_iunlink_remove(tp, ip);
+	error = xfs_difree(tp, ip->i_ino, &xic);
 	if (error)
 		return error;
 
-	error = xfs_difree(tp, ip->i_ino, &xic);
+	error = xfs_iunlink_remove(tp, ip);
 	if (error)
 		return error;
 
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 13/13] xfs: reorder iunlink remove operation in xfs_ifree
  2020-08-12  9:25 ` [PATCH 13/13] xfs: reorder iunlink remove operation in xfs_ifree Dave Chinner
@ 2020-08-12 11:12   ` Gao Xiang
  2020-08-19  0:45   ` Darrick J. Wong
  1 sibling, 0 replies; 51+ messages in thread
From: Gao Xiang @ 2020-08-12 11:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:56PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The O_TMPFILE creation implementation creates a specific order of
> operations for inode allocation/freeing and unlinked list
> modification. Currently both are serialised by the AGI, so the order
> doesn't strictly matter as long as the are both in the same
> transaction.
> 
> However, if we want to move the unlinked list insertions largely
> out from under the AGI lock, then we have to be concerned about the
> order in which we do unlinked list modification operations.
> O_TMPFILE creation tells us this order is inode allocation/free,
> then unlinked list modification.
> 
> Change xfs_ifree() to use this same ordering on unlinked list
> removal. THis way we always guarantee that when we enter the
> iunlinked list removal code from this path, we have the already
> locked and we don't have to worry about lock nesting AGI reads
> inside unlink list locks because it's already locked and attached to
> the transaction.
> 
> We can do this safely as the inode freeing and unlinked list removal
> are done in the same transaction and hence are atomic operations
> with resepect to log recovery.

Yeah, due to all these constraints, such reorder is much cleaner,
otherwise it needs forcely taking AGI lock in xfs_iunlink_remove()
in advance as what I did in my new v3 ( due to exist AGI lock in
xfs_difree() )...
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/tree/fs/xfs/xfs_inode.c?h=xfs/iunlink_opt_v3#n2511
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/linux.git/commit/fs/xfs/xfs_inode.c?h=xfs/iunlink_opt_v3&id=79a6a18a7f13d12726c2554e2581a56fc473b152

Since the new based patchset is out, I will look into this patchset
and skip sending out my v3 (looks like the previous logging order
issues has been resolved) and directly rebase the rest patches
into v4.

Thanks,
Gao Xiang

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode.c | 22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ce128ff12762..7ee778bcde06 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2283,14 +2283,13 @@ xfs_ifree_cluster(
>  }
>  
>  /*
> - * This is called to return an inode to the inode free list.
> - * The inode should already be truncated to 0 length and have
> - * no pages associated with it.  This routine also assumes that
> - * the inode is already a part of the transaction.
> + * This is called to return an inode to the inode free list.  The inode should
> + * already be truncated to 0 length and have no pages associated with it.  This
> + * routine also assumes that the inode is already a part of the transaction.
>   *
> - * The on-disk copy of the inode will have been added to the list
> - * of unlinked inodes in the AGI. We need to remove the inode from
> - * that list atomically with respect to freeing it here.
> + * The on-disk copy of the inode will have been added to the list of unlinked
> + * inodes in the AGI. We need to remove the inode from that list atomically with
> + * respect to freeing it here.
>   */
>  int
>  xfs_ifree(
> @@ -2308,13 +2307,16 @@ xfs_ifree(
>  	ASSERT(ip->i_d.di_nblocks == 0);
>  
>  	/*
> -	 * Pull the on-disk inode from the AGI unlinked list.
> +	 * Free the inode first so that we guarantee that the AGI lock is going
> +	 * to be taken before we remove the inode from the unlinked list. This
> +	 * makes the AGI lock -> unlinked list modification order the same as
> +	 * used in O_TMPFILE creation.
>  	 */
> -	error = xfs_iunlink_remove(tp, ip);
> +	error = xfs_difree(tp, ip->i_ino, &xic);
>  	if (error)
>  		return error;
>  
> -	error = xfs_difree(tp, ip->i_ino, &xic);
> +	error = xfs_iunlink_remove(tp, ip);
>  	if (error)
>  		return error;
>  
> -- 
> 2.26.2.761.g0e0b3e54be
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/13] xfs: in memory inode unlink log items
  2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
                   ` (12 preceding siblings ...)
  2020-08-12  9:25 ` [PATCH 13/13] xfs: reorder iunlink remove operation in xfs_ifree Dave Chinner
@ 2020-08-18 18:17 ` Darrick J. Wong
  2020-08-18 20:01   ` Gao Xiang
  2020-08-18 21:42   ` Dave Chinner
  13 siblings, 2 replies; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-18 18:17 UTC (permalink / raw)
  To: Dave Chinner, hsiangkao; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:43PM +1000, Dave Chinner wrote:
> Hi folks,
> 
> This is a cleaned up version of the original RFC I posted here:
> 
> https://lore.kernel.org/linux-xfs/20200623095015.1934171-1-david@fromorbit.com/
> 
> The original description is preserved below for quick reference,
> I'll just walk though the changes in this version:
> 
> - rebased on current TOT and xfs/for-next
> - split up into many smaller patches
> - includes Xiang's single unlinked list bucket modification
> - uses a list_head for the in memory double unlinked inode list
>   rather than aginos and lockless inode lookups
> - much simpler as it doesn't need to look up inodes from agino
>   values
> - iunlink log item changed to take an xfs_inode pointer rather than
>   an imap and agino values
> - a handful of small cleanups that breaking up into small patches
>   allowed.

Two questions: How does this patchset intersect with the other one that
changes the iunlink series?  I guess the v4 of that series (when it
appears) is intended to be applied directly after this one?

The second is that I got this corruption warning on generic/043 with...

FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 ca-nfsdev6-mtr01 5.9.0-rc1-djw #rc1 SMP PREEMPT Mon Aug 17 20:13:04 PDT 2020
MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdd
MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, /dev/sdd /opt

[16533.664277] run fstests generic/043 at 2020-08-18 00:50:48
[16534.875994] XFS (sde): Mounting V5 Filesystem
[16534.889508] XFS (sde): Ending clean mount
[16534.893661] xfs filesystem being mounted at /mnt supports timestamps until 2038 (0x7fffffff)
[16535.403285] XFS (sdd): Mounting V5 Filesystem
[16535.412082] XFS (sdd): Ending clean mount
[16535.414126] XFS (sdd): Quotacheck needed: Please wait.
[16535.450551] XFS (sdd): Quotacheck: Done.
[16535.453583] xfs filesystem being mounted at /opt supports timestamps until 2038 (0x7fffffff)
[16535.468595] XFS (sdd): User initiated shutdown received. Shutting down filesystem
[16535.477876] XFS (sdd): Unmounting Filesystem
[16535.787559] XFS (sdd): Mounting V5 Filesystem
[16535.797105] XFS (sdd): Ending clean mount
[16535.801363] XFS (sdd): Quotacheck needed: Please wait.
[16535.838561] XFS (sdd): Quotacheck: Done.
[16535.841371] xfs filesystem being mounted at /opt supports timestamps until 2038 (0x7fffffff)
[16556.765496] XFS (sdd): User initiated shutdown received. Shutting down filesystem
[16556.898239] XFS (sdd): Unmounting Filesystem
[16556.903292] list_del corruption. next->prev should be ffff88802dbb0fc8, but was ffff888008d46050
[16556.905487] ------------[ cut here ]------------
[16556.906424] kernel BUG at lib/list_debug.c:54!
[16556.907314] invalid opcode: 0000 [#1] PREEMPT SMP
[16556.908216] CPU: 0 PID: 2975390 Comm: xfsaild/sdd Tainted: G        W         5.9.0-rc1-djw #rc1
[16556.909816] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
[16556.911406] RIP: 0010:__list_del_entry_valid.cold+0x1d/0x51
[16556.912453] Code: c7 c7 e0 a2 e4 81 e8 55 1e cf ff 0f 0b 48 89 fe 48 c7 c7 70 a3 e4 81 e8 44 1e cf ff 0f 0b 48 c7 c7 20 a4 e4 81 e8 36 1e cf ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 e0 a3 e4 81 e8 22 1e cf ff 0f 0b
[16556.915781] RSP: 0018:ffffc900018abd58 EFLAGS: 00010246
[16556.916782] RAX: 0000000000000054 RBX: ffff88802dbb0f78 RCX: 0000000000000000
[16556.918081] RDX: 0000000000000000 RSI: ffffffff81e2b51a RDI: 00000000ffffffff
[16556.919385] RBP: ffff88804e615a00 R08: 0000000000000001 R09: 0000000000000001
[16556.920691] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88804e615be0
[16556.921995] R13: ffff88802dbb0fc8 R14: ffff88806a6dd340 R15: ffff88802dbb1018
[16556.923304] FS:  0000000000000000(0000) GS:ffff88803ea00000(0000) knlGS:0000000000000000
[16556.924860] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16556.925939] CR2: 00005593e4e4dd5c CR3: 000000003d5f0002 CR4: 00000000001706b0
[16556.927280] Call Trace:
[16556.927894]  xfs_iflush_abort+0x80/0x110 [xfs]
[16556.928828]  xfs_iflush_cluster+0x4fb/0x920 [xfs]
[16556.929734]  ? rcu_read_lock_sched_held+0x56/0x80
[16556.930721]  xfs_inode_item_push+0xac/0x150 [xfs]
[16556.931707]  xfsaild+0x61b/0x13b0 [xfs]
[16556.932452]  ? kvm_clock_read+0x14/0x30
[16556.933197]  ? sched_clock+0x9/0x10
[16556.933886]  ? trace_hardirqs_on+0x20/0xf0
[16556.934760]  ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
[16556.935755]  kthread+0x13c/0x180
[16556.936343]  ? kthread_park+0x90/0x90
[16556.937027]  ret_from_fork+0x1f/0x30

--D

> The patchset passes fstests for v5 filesystems - v4 filesytsems
> testing is currently running, though I don't expect any new problems
> there.
> 
> Code can be found here:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-iunlink-item-2
> 
> Comments, thoughts, testing, etc all welcome.
> 
> -Dave.
> 
> ============
> 
> [Original RFC text]
> 
> Inode cluster buffer pinning by dirty inodes allows us to improve
> dirty inode tracking efficiency in the log by logging the inode
> cluster buffer as an ordered transaction. However, this brings with
> it some new issues, namely the order in which we lock inode cluster
> buffers.
> 
> That is, transactions that dirty and commit multiple inodes in a
> transaction will now need to locking multiple inode cluster buffers
> in each transaction (e.g. create, rename, etc). This introduces new 
> lock ordering constraints in these operations. It also introduces
> lock ordering constraints between the AGI and inode cluster buffers
> as a result of allocation/freeing being serialised by the AGI
> buffer lock. And then there is unlinked inode list logging, which
> currently has no fixed order of inode cluster buffer locking.
> 
> It's a bit messy.
> 
> Locking pure inode modifications in order is relatively easy. We
> don't actually need to attach and log the buffer to the transaction
> until the last moment. We have all the inodes locked, so nothing
> other than unlinked inode list modification can race with the
> transaction modifying inodes. Hence we can safely move the
> attachment of the inodes to the cluster buffer from when we first
> dirty them in xfs_trans_log_inode to just before we commit the
> transaction.
> 
> At this point, all the inodes that have been dirtied in the
> transaction have already been locked, modified, logged and attached
> to the transaction. Hence if we add a hook into xfs_trans_commit()
> to run a "precommit" operation on these log items, we can use this
> operation to attach the inodes to the cluster buffer at commit time
> instead of in xfs_trans_log_inode().
> 
> This, by itself, doesn't solve the lock ordering problem. What it
> does do, however, is give us a place where we can -order- all the
> dirty items in the transaction list. Hence before we call the
> precommit operation on each log item, we sort them. This allows us
> to sort all the inode items so that the pre-commit functions that
> locks and logs the cluster buffers are run in a deterministic order.
> This solves the lock order problem for pure inode modifications.
> 
> The unlinked inode list buffer locking is more complex. The unlinked
> list is unordered - we add to the tail, remove from where-ever the
> inode is in the list. Hence we might need to lock two inode buffers
> here (previous inode in list and the one being removed). While we
> can order the locking of these buffers correctly within the confines
> of the unlinked list, there may be other inodes that need buffer
> locking in the same transaction. e.g. O_TMPFILE being linked into a
> directory also modifies the directory inode.
> 
> Hence we need a mechanism for defering unlinked inode list updates
> to the pre-commit operation where it can be sorted into the correct
> order. We can do this by first observing that we serialise unlinked
> list modifications by holding the AGI buffer lock. IOWs, the AGI is
> going to be locked until the transaction commits any time we modify
> the unlinked list. Hence it doesn't matter when in the transaction
> we actually load, lock and modify the inode cluster buffer.
> 
> IOWs, what we need is an unlinked inode log item to defer the inode
> cluster buffer update to transaction commit time where it can be
> ordered with all the other inode cluster operations. Essentially all
> we need to do is record the inodes that need to have their unlinked
> list pointer updated in a new log item that we attached to the
> transaction.
> 
> This log item exists purely for the purpose of delaying the update
> of the unlinked list pointer until the inode cluster buffer can be
> locked in the correct order around the other inode cluster buffers.
> It plays no part in the actual commit, and there's no change to
> anything that is written to the log. i.e. the inode cluster buffers
> still have to be fully logged here (not just ordered) as log
> recovery depedends on this to replay mods to the unlinked inode
> list.
> 
> To make this unlinked inode list processing simpler and easier to
> implement as a log item, we need to change the way we track the
> unlinked list in memory. Starting from the observation that an inode
> on the unlinked list is pinned in memory by the VFS, we can use the
> xfs_inode itself to track the unlinked list. To do this efficiently,
> we want the unlinked list to be a double linked list. The current
> implementation takes the approach of minimising the memory footprint
> of this list in case we don't want to burn 16 bytes of memory per
> inode for a largely unused list head. [*]
> 
> We can get this down to 8 bytes per inode because the unlinked list
> is per-ag, and hence we only need to store the agino portion of the
> inode number as list pointers. We can then use these for lockless
> inode cache lookups to retreive the inode. The aginos in the inode
> are modified only under the AGI lock, just like the cluster buffer
> pointers, so we don't need any extra locking here.  The
> i_next_unlinked field tracks the on-disk value of the unlinked list,
> and the i_prev_unlinked is a purely in-memory pointer that enables
> us to efficiently remove inodes from the middle of the list.
> 
> IOWs, we burn a bit more CPU to resolve the unlinked list pointers
> to save 8 bytes of memory per inode. If we decide that 8 bytes of
> memory isn't a big code, we can convert this to a list_head and just
> link the inodes directly to a unlinked list head in the perag.[**]
> 
> This gets rid of the entire unlinked list reference hash table that
> is used to track this back pointer relationship, greatly simplifying
> the unlinked list modification code.
> 
> Comments, flames, thoughts all welcome.
> 
> -Dave.
> 
> [*] An in-memory double linked list removes the need for keeping
> lists short to minimise previous inode lookup overhead when removing
> from the list. The current backref hash has this function, but it's
> not obvious that it can do this and it's a kinda complex way of
> implementing a double linked list.
> 
> Once we've removed the need for keeping the lists short, we no
> longer need the on-disk hash for unlinked lists, so we can put all
> the inodes in a single list....
> 
> [**] A single unlinked list in the per-ag then leads to a mutex in
> the per-ag to protect the list, removing the AGI lock from needing
> to be held to modify the unlinked list unless the head of the list
> is being modified. We can then add to the tail of the list instead
> of the head, hence largely removing the AGI from the unlinked list
> processing entirely when there is more than one inode on the
> unlinked list.[***]
> 
> This is another advantage of moving to single unlinked list - we are
> much more likely to have multiple inodes on a single unlinked list
> than when they are spread across 64 lists. Hence we are more likely
> to be able to elide AGI locking for the unlinked list modifications
> the more pressure we put on the unlinked list...
> 
> [***] Taking the AGI out of the unlinked list processing means the
> only thing it "protects" is the contents of the AGI itself. This is
> basically updating accounting and tracking btree root pointers. We
> could add another in-memory log item for AGI updates such that the
> AGI only needs to be locked, updated and logged in the precommit
> function, greatly reducing the time it spends locked for inode
> unlink processing [*^4. This will improve performance of inode
> alloc/freeing on AG constrained filessytems as we spend less time
> serialising on the AGI lock.....
> 
> [*^4] This is how superblock updates work, except it's not by a
> generic in-memory SB log item - the changes to accounting are stored
> directly in the struct xfs_trans as deltas and then applied in
> xfs_trans_commit() via xfs_trans_apply_sb_deltas() which locks,
> applies and logs the superblock buffer. This could be converted to a
> precommit operation, too. [*^5]
> 
> Note that this superblock locking is elided for the freespace and
> inode accounting when lazy superblock updates are enabled. This
> prevents the superblock buffer lock for transactional accounting
> update from being a major global contention point.
> 
> [*^5] dquots also use a delta accounting structure hard coded into
> the struct xfs_trans - the xfs_dquot_acct structure. This gets
> allocated when dquot modifications are reserved, and then updated
> with each quota modification that is made in the transaction.
> 
> Then, in xfs_trans_commit(), it calls xfs_trans_apply_dquot_deltas()
> which then orders the locking of the dquots correct, reads, loads
> and locks the dquots, modifies the in-memory on-disk dquots and logs
> them. This could also be converted to pre-commit operations. [*^6]
> 
> [*^6] It should be obvious by now that the pattern of "pre-commit
> processing" for "delayed object modification" is not a new idea.
> It's been in the code for 25-odd years and copy-pasta'd through the
> ages as needed. It's never been turned into a useful, formalised
> infrastructure mechanism - that's what this patchset starts us down
> the path of. It kinda reminds me of the btree infrastructure
> abstraction I did years ago to get rid fo the the 15,000 lines of
> copy-pastad btree code and set us on the path to the (relatively)
> easy addition of more btrees....
> 
> 
> 
> Dave Chinner (12):
>   xfs: xfs_iflock is no longer a completion
>   xfs: add log item precommit operation
>   xfs: factor the xfs_iunlink functions
>   xfs: add unlink list pointers to xfs_inode
>   xfs: replace iunlink backref lookups with list lookups
>   xfs: mapping unlinked inodes is now redundant
>   xfs: updating i_next_unlinked doesn't need to return old value
>   xfs: validate the unlinked list pointer on update
>   xfs: re-order AGI updates in unlink list updates
>   xfs: combine iunlink inode update functions
>   xfs: add in-memory iunlink log item
>   xfs: reorder iunlink remove operation in xfs_ifree
> 
> Gao Xiang (1):
>   xfs: arrange all unlinked inodes into one list
> 
>  fs/xfs/Makefile           |   1 +
>  fs/xfs/xfs_error.c        |   2 -
>  fs/xfs/xfs_icache.c       |  19 +-
>  fs/xfs/xfs_inode.c        | 688 ++++++++------------------------------
>  fs/xfs/xfs_inode.h        |  37 +-
>  fs/xfs/xfs_inode_item.c   |  15 +-
>  fs/xfs/xfs_inode_item.h   |   4 +-
>  fs/xfs/xfs_iunlink_item.c | 168 ++++++++++
>  fs/xfs/xfs_iunlink_item.h |  25 ++
>  fs/xfs/xfs_log_recover.c  | 179 ++++++----
>  fs/xfs/xfs_mount.c        |  17 +-
>  fs/xfs/xfs_mount.h        |   1 +
>  fs/xfs/xfs_super.c        |  20 +-
>  fs/xfs/xfs_trace.h        |   1 -
>  fs/xfs/xfs_trans.c        |  91 +++++
>  fs/xfs/xfs_trans.h        |   6 +-
>  16 files changed, 587 insertions(+), 687 deletions(-)
>  create mode 100644 fs/xfs/xfs_iunlink_item.c
>  create mode 100644 fs/xfs/xfs_iunlink_item.h
> 
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/13] xfs: in memory inode unlink log items
  2020-08-18 18:17 ` [PATCH 00/13] xfs: in memory inode unlink log items Darrick J. Wong
@ 2020-08-18 20:01   ` Gao Xiang
  2020-08-18 21:42   ` Dave Chinner
  1 sibling, 0 replies; 51+ messages in thread
From: Gao Xiang @ 2020-08-18 20:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Tue, Aug 18, 2020 at 11:17:45AM -0700, Darrick J. Wong wrote:
> On Wed, Aug 12, 2020 at 07:25:43PM +1000, Dave Chinner wrote:
> > Hi folks,
> > 
> > This is a cleaned up version of the original RFC I posted here:
> > 
> > https://lore.kernel.org/linux-xfs/20200623095015.1934171-1-david@fromorbit.com/
> > 
> > The original description is preserved below for quick reference,
> > I'll just walk though the changes in this version:
> > 
> > - rebased on current TOT and xfs/for-next
> > - split up into many smaller patches
> > - includes Xiang's single unlinked list bucket modification
> > - uses a list_head for the in memory double unlinked inode list
> >   rather than aginos and lockless inode lookups
> > - much simpler as it doesn't need to look up inodes from agino
> >   values
> > - iunlink log item changed to take an xfs_inode pointer rather than
> >   an imap and agino values
> > - a handful of small cleanups that breaking up into small patches
> >   allowed.
> 
> Two questions: How does this patchset intersect with the other one that
> changes the iunlink series?  I guess the v4 of that series (when it
> appears) is intended to be applied directly after this one?

(confirmed from IRC) Yeah, I looked through this patchset these days
and sent out another rebased version and yes it can be applied directly
instead.

also put a link here:
https://lore.kernel.org/r/20200818133015.25398-1-hsiangkao@redhat.com

Sorry for that I shouldn't use --in-reply-to as deep as this way.

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/13] xfs: in memory inode unlink log items
  2020-08-18 18:17 ` [PATCH 00/13] xfs: in memory inode unlink log items Darrick J. Wong
  2020-08-18 20:01   ` Gao Xiang
@ 2020-08-18 21:42   ` Dave Chinner
  1 sibling, 0 replies; 51+ messages in thread
From: Dave Chinner @ 2020-08-18 21:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hsiangkao, linux-xfs

On Tue, Aug 18, 2020 at 11:17:45AM -0700, Darrick J. Wong wrote:
> On Wed, Aug 12, 2020 at 07:25:43PM +1000, Dave Chinner wrote:
> > Hi folks,
> > 
> > This is a cleaned up version of the original RFC I posted here:
> > 
> > https://lore.kernel.org/linux-xfs/20200623095015.1934171-1-david@fromorbit.com/
> > 
> > The original description is preserved below for quick reference,
> > I'll just walk though the changes in this version:
> > 
> > - rebased on current TOT and xfs/for-next
> > - split up into many smaller patches
> > - includes Xiang's single unlinked list bucket modification
> > - uses a list_head for the in memory double unlinked inode list
> >   rather than aginos and lockless inode lookups
> > - much simpler as it doesn't need to look up inodes from agino
> >   values
> > - iunlink log item changed to take an xfs_inode pointer rather than
> >   an imap and agino values
> > - a handful of small cleanups that breaking up into small patches
> >   allowed.
> 
> Two questions: How does this patchset intersect with the other one that
> changes the iunlink series?  I guess the v4 of that series (when it
> appears) is intended to be applied directly after this one?

*nod*

> The second is that I got this corruption warning on generic/043 with...

I haven't seen that. I'll see if I can reproduce it.

> FSTYP         -- xfs (debug)
> PLATFORM      -- Linux/x86_64 ca-nfsdev6-mtr01 5.9.0-rc1-djw #rc1 SMP PREEMPT Mon Aug 17 20:13:04 PDT 2020
> MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdd
> MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, /dev/sdd /opt
>
> [16533.664277] run fstests generic/043 at 2020-08-18 00:50:48

I've run that config through fstests, too. I'll go run this test in
a loop, see if I can trigger it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/13] xfs: xfs_iflock is no longer a completion
  2020-08-12  9:25 ` [PATCH 01/13] xfs: xfs_iflock is no longer a completion Dave Chinner
@ 2020-08-18 23:44   ` Darrick J. Wong
  2020-08-22  7:41   ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-18 23:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:44PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> With the recent rework of the inode cluster flushing, we no longer
> ever wait on the the inode flush "lock". It was never a lock in the
> first place, just a completion to allow callers to wait for inode IO
> to complete. We now never wait for flush completion as all inode
> flushing is non-blocking. Hence we can get rid of all the iflock
> infrastructure and instead just set and check a state flag.
> 
> Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
> xfs_iflock_nowait() test-and-set operations on that flag, and
> replace all the xfs_ifunlock() calls to clear operations.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Hm.  I /think/ this looks fairly straightforward, at least once I
realized that nobody calls xfs_iflock anymore.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_icache.c     | 17 ++++------
>  fs/xfs/xfs_inode.c      | 73 +++++++++++++++--------------------------
>  fs/xfs/xfs_inode.h      | 33 +------------------
>  fs/xfs/xfs_inode_item.c | 15 ++++-----
>  fs/xfs/xfs_inode_item.h |  4 +--
>  fs/xfs/xfs_mount.c      | 11 ++++---
>  fs/xfs/xfs_super.c      | 10 +++---
>  7 files changed, 55 insertions(+), 108 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 101028ebb571..aa6aad258670 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -52,7 +52,6 @@ xfs_inode_alloc(
>  
>  	XFS_STATS_INC(mp, vn_active);
>  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> -	ASSERT(!xfs_isiflocked(ip));
>  	ASSERT(ip->i_ino == 0);
>  
>  	/* initialise the xfs inode */
> @@ -123,7 +122,7 @@ void
>  xfs_inode_free(
>  	struct xfs_inode	*ip)
>  {
> -	ASSERT(!xfs_isiflocked(ip));
> +	ASSERT(!xfs_iflags_test(ip, XFS_IFLUSHING));
>  
>  	/*
>  	 * Because we use RCU freeing we need to ensure the inode always
> @@ -1035,23 +1034,21 @@ xfs_reclaim_inode(
>  
>  	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
>  		goto out;
> -	if (!xfs_iflock_nowait(ip))
> +	if (xfs_iflags_test_and_set(ip, XFS_IFLUSHING))
>  		goto out_iunlock;
>  
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
>  		xfs_iunpin_wait(ip);
> -		/* xfs_iflush_abort() drops the flush lock */
>  		xfs_iflush_abort(ip);
>  		goto reclaim;
>  	}
>  	if (xfs_ipincount(ip))
> -		goto out_ifunlock;
> +		goto out_clear_flush;
>  	if (!xfs_inode_clean(ip))
> -		goto out_ifunlock;
> +		goto out_clear_flush;
>  
> -	xfs_ifunlock(ip);
> +	xfs_iflags_clear(ip, XFS_IFLUSHING);
>  reclaim:
> -	ASSERT(!xfs_isiflocked(ip));
>  
>  	/*
>  	 * Because we use RCU freeing we need to ensure the inode always appears
> @@ -1101,8 +1098,8 @@ xfs_reclaim_inode(
>  	__xfs_inode_free(ip);
>  	return;
>  
> -out_ifunlock:
> -	xfs_ifunlock(ip);
> +out_clear_flush:
> +	xfs_iflags_clear(ip, XFS_IFLUSHING);
>  out_iunlock:
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  out:
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index c06129cffba9..2072bd25989a 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -598,22 +598,6 @@ xfs_lock_two_inodes(
>  	}
>  }
>  
> -void
> -__xfs_iflock(
> -	struct xfs_inode	*ip)
> -{
> -	wait_queue_head_t *wq = bit_waitqueue(&ip->i_flags, __XFS_IFLOCK_BIT);
> -	DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_IFLOCK_BIT);
> -
> -	do {
> -		prepare_to_wait_exclusive(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> -		if (xfs_isiflocked(ip))
> -			io_schedule();
> -	} while (!xfs_iflock_nowait(ip));
> -
> -	finish_wait(wq, &wait.wq_entry);
> -}
> -
>  STATIC uint
>  _xfs_dic2xflags(
>  	uint16_t		di_flags,
> @@ -2531,11 +2515,8 @@ xfs_ifree_mark_inode_stale(
>  	 * valid, the wrong inode or stale.
>  	 */
>  	spin_lock(&ip->i_flags_lock);
> -	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE)) {
> -		spin_unlock(&ip->i_flags_lock);
> -		rcu_read_unlock();
> -		return;
> -	}
> +	if (ip->i_ino != inum || __xfs_iflags_test(ip, XFS_ISTALE))
> +		goto out_iflags_unlock;
>  
>  	/*
>  	 * Don't try to lock/unlock the current inode, but we _cannot_ skip the
> @@ -2552,16 +2533,14 @@ xfs_ifree_mark_inode_stale(
>  		}
>  	}
>  	ip->i_flags |= XFS_ISTALE;
> -	spin_unlock(&ip->i_flags_lock);
> -	rcu_read_unlock();
>  
>  	/*
> -	 * If we can't get the flush lock, the inode is already attached.  All
> +	 * If the inode is flushing, it is already attached to the buffer.  All
>  	 * we needed to do here is mark the inode stale so buffer IO completion
>  	 * will remove it from the AIL.
>  	 */
>  	iip = ip->i_itemp;
> -	if (!xfs_iflock_nowait(ip)) {
> +	if (__xfs_iflags_test(ip, XFS_IFLUSHING)) {
>  		ASSERT(!list_empty(&iip->ili_item.li_bio_list));
>  		ASSERT(iip->ili_last_fields);
>  		goto out_iunlock;
> @@ -2573,10 +2552,12 @@ xfs_ifree_mark_inode_stale(
>  	 * commit as the flock synchronises removal of the inode from the
>  	 * cluster buffer against inode reclaim.
>  	 */
> -	if (!iip || list_empty(&iip->ili_item.li_bio_list)) {
> -		xfs_ifunlock(ip);
> +	if (!iip || list_empty(&iip->ili_item.li_bio_list))
>  		goto out_iunlock;
> -	}
> +
> +	__xfs_iflags_set(ip, XFS_IFLUSHING);
> +	spin_unlock(&ip->i_flags_lock);
> +	rcu_read_unlock();
>  
>  	/* we have a dirty inode in memory that has not yet been flushed. */
>  	spin_lock(&iip->ili_lock);
> @@ -2586,9 +2567,16 @@ xfs_ifree_mark_inode_stale(
>  	spin_unlock(&iip->ili_lock);
>  	ASSERT(iip->ili_last_fields);
>  
> +	if (ip != free_ip)
> +		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	return;
> +
>  out_iunlock:
>  	if (ip != free_ip)
>  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +out_iflags_unlock:
> +	spin_unlock(&ip->i_flags_lock);
> +	rcu_read_unlock();
>  }
>  
>  /*
> @@ -2631,8 +2619,9 @@ xfs_ifree_cluster(
>  
>  		/*
>  		 * We obtain and lock the backing buffer first in the process
> -		 * here, as we have to ensure that any dirty inode that we
> -		 * can't get the flush lock on is attached to the buffer.
> +		 * here to ensure dirty inodes attached to the buffer remain in
> +		 * the flushing state while we mark them stale.
> +		 *
>  		 * If we scan the in-memory inodes first, then buffer IO can
>  		 * complete before we get a lock on it, and hence we may fail
>  		 * to mark all the active inodes on the buffer stale.
> @@ -3443,7 +3432,7 @@ xfs_iflush(
>  	int			error;
>  
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
> -	ASSERT(xfs_isiflocked(ip));
> +	ASSERT(xfs_iflags_test(ip, XFS_IFLUSHING));
>  	ASSERT(ip->i_df.if_format != XFS_DINODE_FMT_BTREE ||
>  	       ip->i_df.if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
>  	ASSERT(iip->ili_item.li_buf == bp);
> @@ -3613,7 +3602,7 @@ xfs_iflush_cluster(
>  		/*
>  		 * Quick and dirty check to avoid locks if possible.
>  		 */
> -		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK))
> +		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLUSHING))
>  			continue;
>  		if (xfs_ipincount(ip))
>  			continue;
> @@ -3627,7 +3616,7 @@ xfs_iflush_cluster(
>  		 */
>  		spin_lock(&ip->i_flags_lock);
>  		ASSERT(!__xfs_iflags_test(ip, XFS_ISTALE));
> -		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK)) {
> +		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLUSHING)) {
>  			spin_unlock(&ip->i_flags_lock);
>  			continue;
>  		}
> @@ -3635,23 +3624,16 @@ xfs_iflush_cluster(
>  		/*
>  		 * ILOCK will pin the inode against reclaim and prevent
>  		 * concurrent transactions modifying the inode while we are
> -		 * flushing the inode.
> +		 * flushing the inode. If we get the lock, set the flushing
> +		 * state before we drop the i_flags_lock.
>  		 */
>  		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) {
>  			spin_unlock(&ip->i_flags_lock);
>  			continue;
>  		}
> +		__xfs_iflags_set(ip, XFS_IFLUSHING);
>  		spin_unlock(&ip->i_flags_lock);
>  
> -		/*
> -		 * Skip inodes that are already flush locked as they have
> -		 * already been written to the buffer.
> -		 */
> -		if (!xfs_iflock_nowait(ip)) {
> -			xfs_iunlock(ip, XFS_ILOCK_SHARED);
> -			continue;
> -		}
> -
>  		/*
>  		 * Abort flushing this inode if we are shut down because the
>  		 * inode may not currently be in the AIL. This can occur when
> @@ -3661,7 +3643,6 @@ xfs_iflush_cluster(
>  		 */
>  		if (XFS_FORCED_SHUTDOWN(mp)) {
>  			xfs_iunpin_wait(ip);
> -			/* xfs_iflush_abort() drops the flush lock */
>  			xfs_iflush_abort(ip);
>  			xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  			error = -EIO;
> @@ -3670,7 +3651,7 @@ xfs_iflush_cluster(
>  
>  		/* don't block waiting on a log force to unpin dirty inodes */
>  		if (xfs_ipincount(ip)) {
> -			xfs_ifunlock(ip);
> +			xfs_iflags_clear(ip, XFS_IFLUSHING);
>  			xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  			continue;
>  		}
> @@ -3678,7 +3659,7 @@ xfs_iflush_cluster(
>  		if (!xfs_inode_clean(ip))
>  			error = xfs_iflush(ip, bp);
>  		else
> -			xfs_ifunlock(ip);
> +			xfs_iflags_clear(ip, XFS_IFLUSHING);
>  		xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  		if (error)
>  			break;
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index e9a8bb184d1f..5ea962c6cf98 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -211,8 +211,7 @@ static inline bool xfs_inode_has_cow_data(struct xfs_inode *ip)
>  #define XFS_INEW		(1 << __XFS_INEW_BIT)
>  #define XFS_ITRUNCATED		(1 << 5) /* truncated down so flush-on-close */
>  #define XFS_IDIRTY_RELEASE	(1 << 6) /* dirty release already seen */
> -#define __XFS_IFLOCK_BIT	7	 /* inode is being flushed right now */
> -#define XFS_IFLOCK		(1 << __XFS_IFLOCK_BIT)
> +#define XFS_IFLUSHING		(1 << 7) /* inode is being flushed */
>  #define __XFS_IPINNED_BIT	8	 /* wakeup key for zero pin count */
>  #define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
>  #define XFS_IEOFBLOCKS		(1 << 9) /* has the preallocblocks tag set */
> @@ -233,36 +232,6 @@ static inline bool xfs_inode_has_cow_data(struct xfs_inode *ip)
>  	(XFS_IRECLAIMABLE | XFS_IRECLAIM | \
>  	 XFS_IDIRTY_RELEASE | XFS_ITRUNCATED)
>  
> -/*
> - * Synchronize processes attempting to flush the in-core inode back to disk.
> - */
> -
> -static inline int xfs_isiflocked(struct xfs_inode *ip)
> -{
> -	return xfs_iflags_test(ip, XFS_IFLOCK);
> -}
> -
> -extern void __xfs_iflock(struct xfs_inode *ip);
> -
> -static inline int xfs_iflock_nowait(struct xfs_inode *ip)
> -{
> -	return !xfs_iflags_test_and_set(ip, XFS_IFLOCK);
> -}
> -
> -static inline void xfs_iflock(struct xfs_inode *ip)
> -{
> -	if (!xfs_iflock_nowait(ip))
> -		__xfs_iflock(ip);
> -}
> -
> -static inline void xfs_ifunlock(struct xfs_inode *ip)
> -{
> -	ASSERT(xfs_isiflocked(ip));
> -	xfs_iflags_clear(ip, XFS_IFLOCK);
> -	smp_mb();
> -	wake_up_bit(&ip->i_flags, __XFS_IFLOCK_BIT);
> -}
> -
>  /*
>   * Flags for inode locking.
>   * Bit ranges:	1<<1  - 1<<16-1 -- iolock/ilock modes (bitfield)
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 6c65938cee1c..099ae8ee7908 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -491,8 +491,7 @@ xfs_inode_item_push(
>  	    (ip->i_flags & XFS_ISTALE))
>  		return XFS_ITEM_PINNED;
>  
> -	/* If the inode is already flush locked, we're already flushing. */
> -	if (xfs_isiflocked(ip))
> +	if (xfs_iflags_test(ip, XFS_IFLUSHING))
>  		return XFS_ITEM_FLUSHING;
>  
>  	if (!xfs_buf_trylock(bp))
> @@ -703,7 +702,7 @@ xfs_iflush_finish(
>  		iip->ili_last_fields = 0;
>  		iip->ili_flush_lsn = 0;
>  		spin_unlock(&iip->ili_lock);
> -		xfs_ifunlock(iip->ili_inode);
> +		xfs_iflags_clear(iip->ili_inode, XFS_IFLUSHING);
>  		if (drop_buffer)
>  			xfs_buf_rele(bp);
>  	}
> @@ -711,8 +710,8 @@ xfs_iflush_finish(
>  
>  /*
>   * Inode buffer IO completion routine.  It is responsible for removing inodes
> - * attached to the buffer from the AIL if they have not been re-logged, as well
> - * as completing the flush and unlocking the inode.
> + * attached to the buffer from the AIL if they have not been re-logged and
> + * completing the inode flush.
>   */
>  void
>  xfs_iflush_done(
> @@ -755,10 +754,10 @@ xfs_iflush_done(
>  }
>  
>  /*
> - * This is the inode flushing abort routine.  It is called from xfs_iflush when
> + * This is the inode flushing abort routine.  It is called when
>   * the filesystem is shutting down to clean up the inode state.  It is
>   * responsible for removing the inode item from the AIL if it has not been
> - * re-logged, and unlocking the inode's flush lock.
> + * re-logged and clearing the inode's flush state.
>   */
>  void
>  xfs_iflush_abort(
> @@ -790,7 +789,7 @@ xfs_iflush_abort(
>  		list_del_init(&iip->ili_item.li_bio_list);
>  		spin_unlock(&iip->ili_lock);
>  	}
> -	xfs_ifunlock(ip);
> +	xfs_iflags_clear(ip, XFS_IFLUSHING);
>  	if (bp)
>  		xfs_buf_rele(bp);
>  }
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 048b5e7dee90..23a7b4928727 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -25,8 +25,8 @@ struct xfs_inode_log_item {
>  	 *
>  	 * We need atomic changes between inode dirtying, inode flushing and
>  	 * inode completion, but these all hold different combinations of
> -	 * ILOCK and iflock and hence we need some other method of serialising
> -	 * updates to the flush state.
> +	 * ILOCK and IFLUSHING and hence we need some other method of
> +	 * serialising updates to the flush state.
>  	 */
>  	spinlock_t		ili_lock;	   /* flush state lock */
>  	unsigned int		ili_last_fields;   /* fields when flushed */
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index c8ae49a1e99c..bbfd1d5b1c04 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1059,11 +1059,12 @@ xfs_unmountfs(
>  	 * We can potentially deadlock here if we have an inode cluster
>  	 * that has been freed has its buffer still pinned in memory because
>  	 * the transaction is still sitting in a iclog. The stale inodes
> -	 * on that buffer will have their flush locks held until the
> -	 * transaction hits the disk and the callbacks run. the inode
> -	 * flush takes the flush lock unconditionally and with nothing to
> -	 * push out the iclog we will never get that unlocked. hence we
> -	 * need to force the log first.
> +	 * on that buffer will be pinned to the buffer until the
> +	 * transaction hits the disk and the callbacks run. Pushing the AIL will
> +	 * skip the stale inodes and may never see the pinned buffer, so
> +	 * nothing will push out the iclog and unpin the buffer. Hence we
> +	 * need to force the log here to ensure all items are flushed into the
> +	 * AIL before we go any further.
>  	 */
>  	xfs_log_force(mp, XFS_LOG_SYNC);
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 71ac6c1cdc36..68ec8db12cc7 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -654,11 +654,11 @@ xfs_fs_destroy_inode(
>  	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
>  
>  	/*
> -	 * We always use background reclaim here because even if the
> -	 * inode is clean, it still may be under IO and hence we have
> -	 * to take the flush lock. The background reclaim path handles
> -	 * this more efficiently than we can here, so simply let background
> -	 * reclaim tear down all inodes.
> +	 * We always use background reclaim here because even if the inode is
> +	 * clean, it still may be under IO and hence we have wait for IO
> +	 * completion to occur before we can reclaim the inode. The background
> +	 * reclaim path handles this more efficiently than we can here, so
> +	 * simply let background reclaim tear down all inodes.
>  	 */
>  	xfs_inode_set_reclaim_tag(ip);
>  }
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/13] xfs: factor the xfs_iunlink functions
  2020-08-12  9:25 ` [PATCH 03/13] xfs: factor the xfs_iunlink functions Dave Chinner
@ 2020-08-18 23:49   ` Darrick J. Wong
  2020-08-22  7:45   ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-18 23:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:46PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Prep work that separates the locking that protects the unlinked list
> from the actual operations being performed. This also helps document
> the fact they are performing list insert  and remove operations. No
> functional code change.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Seems pretty straightforward.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode.c | 92 ++++++++++++++++++++++++++++++----------------
>  1 file changed, 60 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 2072bd25989a..f2f502b65691 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2205,35 +2205,20 @@ xfs_iunlink_update_inode(
>  	return error;
>  }
>  
> -/*
> - * This is called when the inode's link count has gone to 0 or we are creating
> - * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> - *
> - * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> - * list when the inode is freed.
> - */
> -STATIC int
> -xfs_iunlink(
> +static int
> +xfs_iunlink_insert_inode(
>  	struct xfs_trans	*tp,
> +	struct xfs_buf		*agibp,
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_agi		*agi;
> -	struct xfs_buf		*agibp;
>  	xfs_agino_t		next_agino;
> -	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  	int			error;
>  
> -	ASSERT(VFS_I(ip)->i_nlink == 0);
> -	ASSERT(VFS_I(ip)->i_mode != 0);
> -	trace_xfs_iunlink(ip);
> -
> -	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> -	error = xfs_read_agi(mp, tp, agno, &agibp);
> -	if (error)
> -		return error;
>  	agi = agibp->b_addr;
>  
>  	/*
> @@ -2274,6 +2259,35 @@ xfs_iunlink(
>  	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
>  }
>  
> +/*
> + * This is called when the inode's link count has gone to 0 or we are creating
> + * a tmpfile via O_TMPFILE.  The inode @ip must have nlink == 0.
> + *
> + * We place the on-disk inode on a list in the AGI.  It will be pulled from this
> + * list when the inode is freed.
> + */
> +STATIC int
> +xfs_iunlink(
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_buf		*agibp;
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	int			error;
> +
> +	ASSERT(VFS_I(ip)->i_nlink == 0);
> +	ASSERT(VFS_I(ip)->i_mode != 0);
> +	trace_xfs_iunlink(ip);
> +
> +	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> +	error = xfs_read_agi(mp, tp, agno, &agibp);
> +	if (error)
> +		return error;
> +
> +	return xfs_iunlink_insert_inode(tp, agibp, ip);
> +}
> +
>  /* Return the imap, dinode pointer, and buffer for an inode. */
>  STATIC int
>  xfs_iunlink_map_ino(
> @@ -2388,32 +2402,23 @@ xfs_iunlink_map_prev(
>  	return 0;
>  }
>  
> -/*
> - * Pull the on-disk inode from the AGI unlinked list.
> - */
> -STATIC int
> -xfs_iunlink_remove(
> +static int
> +xfs_iunlink_remove_inode(
>  	struct xfs_trans	*tp,
> +	struct xfs_buf		*agibp,
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_agi		*agi;
> -	struct xfs_buf		*agibp;
>  	struct xfs_buf		*last_ibp;
>  	struct xfs_dinode	*last_dip = NULL;
> -	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		next_agino;
>  	xfs_agino_t		head_agino;
>  	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  	int			error;
>  
> -	trace_xfs_iunlink_remove(ip);
> -
> -	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> -	error = xfs_read_agi(mp, tp, agno, &agibp);
> -	if (error)
> -		return error;
>  	agi = agibp->b_addr;
>  
>  	/*
> @@ -2482,6 +2487,29 @@ xfs_iunlink_remove(
>  			next_agino);
>  }
>  
> +/*
> + * Pull the on-disk inode from the AGI unlinked list.
> + */
> +STATIC int
> +xfs_iunlink_remove(
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_buf		*agibp;
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	int			error;
> +
> +	trace_xfs_iunlink_remove(ip);
> +
> +	/* Get the agi buffer first.  It ensures lock ordering on the list. */
> +	error = xfs_read_agi(mp, tp, agno, &agibp);
> +	if (error)
> +		return error;
> +
> +	return xfs_iunlink_remove_inode(tp, agibp, ip);
> +}
> +
>  /*
>   * Look up the inode number specified and if it is not already marked XFS_ISTALE
>   * mark it stale. We should only find clean inodes in this lookup that aren't
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 04/13] xfs: arrange all unlinked inodes into one list
  2020-08-12  9:25 ` [PATCH 04/13] xfs: arrange all unlinked inodes into one list Dave Chinner
@ 2020-08-18 23:59   ` Darrick J. Wong
  2020-08-19  0:45     ` Dave Chinner
  2020-08-19  0:58     ` Gao Xiang
  0 siblings, 2 replies; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-18 23:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:47PM +1000, Dave Chinner wrote:
> From: Gao Xiang <hsiangkao@redhat.com>
> 
> We currently keep unlinked lists short on disk by hashing the inodes
> across multiple buckets. We don't need to ikeep them short anymore
> as we no longer need to traverse the entire to remove an inode from
> it. The in-memory back reference index provides the previous inode
> in the list for us instead.
> 
> Log recovery still has to handle existing filesystems that use all
> 64 on-disk buckets so we detect and handle this case specially so
> that so inode eviction can still work properly in recovery.
> 
> [dchinner: imported into parent patch series early on and modified
> to fit cleanly. ]
> 
> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode.c | 49 +++++++++++++++++++++++++++-------------------
>  1 file changed, 29 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index f2f502b65691..fa92bdf6e0da 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -33,6 +33,7 @@
>  #include "xfs_symlink.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>  #include "xfs_bmap_btree.h"
>  #include "xfs_reflink.h"
>  
> @@ -2092,25 +2093,32 @@ xfs_iunlink_update_bucket(
>  	struct xfs_trans	*tp,
>  	xfs_agnumber_t		agno,
>  	struct xfs_buf		*agibp,
> -	unsigned int		bucket_index,
> +	xfs_agino_t		old_agino,
>  	xfs_agino_t		new_agino)
>  {
> +	struct xlog		*log = tp->t_mountp->m_log;
>  	struct xfs_agi		*agi = agibp->b_addr;
>  	xfs_agino_t		old_value;
> -	int			offset;
> +	unsigned int		bucket_index;
> +	int                     offset;
>  
>  	ASSERT(xfs_verify_agino_or_null(tp->t_mountp, agno, new_agino));
>  
> +	bucket_index = 0;
> +	/* During recovery, the old multiple bucket index can be applied */
> +	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {

Does the flag test need parentheses?

It feels a little funny that we pass in old_agino (having gotten it from
agi_unlinked) and then compare it with agi_unlinked, but as the commit
log points out, I guess this is a wart of having to support the old
unlinked list behavior.  It makes sense to me that if we're going to
change the unlinked list behavior we could be a little more careful
about double-checking things.

Question: if a newer kernel crashes with a super-long unlinked list and
the fs gets recovered on an old kernel, will this lead to insanely high
recovery times?  I think the answer is no, because recovery is single
threaded and the hash only existed to reduce AGI contention during
normal unlinking operations?

--D

> +		ASSERT(old_agino != NULLAGINO);
> +
> +		if (be32_to_cpu(agi->agi_unlinked[0]) != old_agino)
> +			bucket_index = old_agino % XFS_AGI_UNLINKED_BUCKETS;
> +	}
> +
>  	old_value = be32_to_cpu(agi->agi_unlinked[bucket_index]);
>  	trace_xfs_iunlink_update_bucket(tp->t_mountp, agno, bucket_index,
>  			old_value, new_agino);
>  
> -	/*
> -	 * We should never find the head of the list already set to the value
> -	 * passed in because either we're adding or removing ourselves from the
> -	 * head of the list.
> -	 */
> -	if (old_value == new_agino) {
> +	/* check if the old agi_unlinked head is as expected */
> +	if (old_value != old_agino) {
>  		xfs_buf_mark_corrupt(agibp);
>  		return -EFSCORRUPTED;
>  	}
> @@ -2216,17 +2224,18 @@ xfs_iunlink_insert_inode(
>  	xfs_agino_t		next_agino;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> -	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  	int			error;
>  
>  	agi = agibp->b_addr;
>  
>  	/*
> -	 * Get the index into the agi hash table for the list this inode will
> -	 * go on.  Make sure the pointer isn't garbage and that this inode
> -	 * isn't already on the list.
> +	 * We don't need to traverse the on disk unlinked list to find the
> +	 * previous inode in the list when removing inodes anymore, so we don't
> +	 * need multiple on-disk lists anymore. Hence we always use bucket 0.
> +	 * Make sure the pointer isn't garbage and that this inode isn't already
> +	 * on the list.
>  	 */
> -	next_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
> +	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
>  	if (next_agino == agino ||
>  	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
>  		xfs_buf_mark_corrupt(agibp);
> @@ -2256,7 +2265,7 @@ xfs_iunlink_insert_inode(
>  	}
>  
>  	/* Point the head of the list to point to this inode. */
> -	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index, agino);
> +	return xfs_iunlink_update_bucket(tp, agno, agibp, next_agino, agino);
>  }
>  
>  /*
> @@ -2416,16 +2425,17 @@ xfs_iunlink_remove_inode(
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		next_agino;
>  	xfs_agino_t		head_agino;
> -	short			bucket_index = agino % XFS_AGI_UNLINKED_BUCKETS;
>  	int			error;
>  
>  	agi = agibp->b_addr;
>  
>  	/*
> -	 * Get the index into the agi hash table for the list this inode will
> -	 * go on.  Make sure the head pointer isn't garbage.
> +	 * We don't need to traverse the on disk unlinked list to find the
> +	 * previous inode in the list when removing inodes anymore, so we don't
> +	 * need multiple on-disk lists anymore. Hence we always use bucket 0.
> +	 * Make sure the head pointer isn't garbage.
>  	 */
> -	head_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
> +	head_agino = be32_to_cpu(agi->agi_unlinked[0]);
>  	if (!xfs_verify_agino(mp, agno, head_agino)) {
>  		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
>  				agi, sizeof(*agi));
> @@ -2483,8 +2493,7 @@ xfs_iunlink_remove_inode(
>  	}
>  
>  	/* Point the head of the list to the next unlinked inode. */
> -	return xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
> -			next_agino);
> +	return xfs_iunlink_update_bucket(tp, agno, agibp, agino, next_agino);
>  }
>  
>  /*
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/13] xfs: add unlink list pointers to xfs_inode
  2020-08-12  9:25 ` [PATCH 05/13] xfs: add unlink list pointers to xfs_inode Dave Chinner
@ 2020-08-19  0:02   ` Darrick J. Wong
  2020-08-19  0:47     ` Dave Chinner
  2020-08-22  9:03   ` Christoph Hellwig
  1 sibling, 1 reply; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-19  0:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:48PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To move away from using the on disk inode buffers to track and log
> unlinked inodes, we need pointers to track them in memory. Because
> we have arbitrary remove order from the list, it needs to be a
> double linked list.
> 
> We start by noting that inodes are always in memory when they are
> active on the unlinked list, and hence we can track these inodes
> without needing to take references to the inodes or store them in
> the list. We cannot, however, use inode locks to protect the inodes
> on the list - the list needs an external lock to serialise all
> inserts and removals. We can use the existing AGI buffer lock for
> this right now as that already serialises all unlinked list
> traversals and modifications.
> 
> Hence we can convert the in-memory unlinked list to a simple
> list_head list in the perag. We can use list_empty() to detect an
> empty unlinked list, likewise we can detect the end of the list when
> the inode next pointer points back to the perag list_head. This
> makes insert, remove and traversal.
> 
> The only complication here is log recovery of old filesystems that
> have multiple lists. These always remove from the head of the list,
> so we can easily construct just enough of the unlinked list for
> recovery from any list to work correctly.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Hm.  This is orthogonal to this patch, but should we get meaner about
failing the mount if the AGI read fails or the unlinked walk fails?

For this patch,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_icache.c      |   1 +
>  fs/xfs/xfs_inode.c       |  21 ++++-
>  fs/xfs/xfs_inode.h       |   1 +
>  fs/xfs/xfs_log_recover.c | 179 +++++++++++++++++++++++----------------
>  fs/xfs/xfs_mount.c       |   1 +
>  fs/xfs/xfs_mount.h       |   1 +
>  6 files changed, 130 insertions(+), 74 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 5cdded02cdc8..0c04a66bf88d 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -66,6 +66,7 @@ xfs_inode_alloc(
>  	memset(&ip->i_d, 0, sizeof(ip->i_d));
>  	ip->i_sick = 0;
>  	ip->i_checked = 0;
> +	INIT_LIST_HEAD(&ip->i_unlink);
>  	INIT_WORK(&ip->i_ioend_work, xfs_end_io);
>  	INIT_LIST_HEAD(&ip->i_ioend_list);
>  	spin_lock_init(&ip->i_ioend_lock);
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index fa92bdf6e0da..dcf80ac51267 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2294,7 +2294,17 @@ xfs_iunlink(
>  	if (error)
>  		return error;
>  
> -	return xfs_iunlink_insert_inode(tp, agibp, ip);
> +	/*
> +	 * Insert the inode into the on disk unlinked list, and if that
> +	 * succeeds, then insert it into the in memory list. We do it in this
> +	 * order so that the modifications required to the on disk list are not
> +	 * impacted by already having this inode in the list.
> +	 */
> +	error = xfs_iunlink_insert_inode(tp, agibp, ip);
> +	if (!error)
> +		list_add(&ip->i_unlink, &agibp->b_pag->pag_ici_unlink_list);
> +
> +	return error;
>  }
>  
>  /* Return the imap, dinode pointer, and buffer for an inode. */
> @@ -2516,7 +2526,14 @@ xfs_iunlink_remove(
>  	if (error)
>  		return error;
>  
> -	return xfs_iunlink_remove_inode(tp, agibp, ip);
> +	/*
> +	 * Remove the inode from the on-disk list and then remove it from the
> +	 * in-memory list. This order of operations ensures we can look up both
> +	 * next and previous inode in the on-disk list via the in-memory list.
> +	 */
> +	error = xfs_iunlink_remove_inode(tp, agibp, ip);
> +	list_del(&ip->i_unlink);
> +	return error;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 5ea962c6cf98..73f36908a1ce 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -56,6 +56,7 @@ typedef struct xfs_inode {
>  	uint64_t		i_delayed_blks;	/* count of delay alloc blks */
>  
>  	struct xfs_icdinode	i_d;		/* most of ondisk inode */
> +	struct list_head	i_unlink;
>  
>  	/* VFS inode */
>  	struct inode		i_vnode;	/* embedded VFS inode */
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index e2ec91b2d0f4..b3481f4e2f96 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2682,11 +2682,11 @@ xlog_recover_clear_agi_bucket(
>  	return;
>  }
>  
> -STATIC xfs_agino_t
> -xlog_recover_process_one_iunlink(
> +static struct xfs_inode *
> +xlog_recover_get_one_iunlink(
>  	struct xfs_mount		*mp,
>  	xfs_agnumber_t			agno,
> -	xfs_agino_t			agino,
> +	xfs_agino_t			*agino,
>  	int				bucket)
>  {
>  	struct xfs_buf			*ibp;
> @@ -2695,48 +2695,35 @@ xlog_recover_process_one_iunlink(
>  	xfs_ino_t			ino;
>  	int				error;
>  
> -	ino = XFS_AGINO_TO_INO(mp, agno, agino);
> +	ino = XFS_AGINO_TO_INO(mp, agno, *agino);
>  	error = xfs_iget(mp, NULL, ino, 0, 0, &ip);
>  	if (error)
> -		goto fail;
> +		return NULL;
>  
>  	/*
> -	 * Get the on disk inode to find the next inode in the bucket.
> +	 * Get the on disk inode to find the next inode in the bucket. Should
> +	 * not fail because we just read the inode from this buffer, but if it
> +	 * does then we still have to allow the caller to set up and release
> +	 * the inode we just looked up. Make sure the list walk terminates here,
> +	 * though.
>  	 */
>  	error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &ibp, 0);
> -	if (error)
> -		goto fail_iput;
> +	if (error) {
> +		ASSERT(0);
> +		*agino = NULLAGINO;
> +		return ip;
> +	}
> +
>  
>  	xfs_iflags_clear(ip, XFS_IRECOVERY);
>  	ASSERT(VFS_I(ip)->i_nlink == 0);
>  	ASSERT(VFS_I(ip)->i_mode != 0);
>  
> -	/* setup for the next pass */
> -	agino = be32_to_cpu(dip->di_next_unlinked);
> +	/* Get the next inode we will be looking up. */
> +	*agino = be32_to_cpu(dip->di_next_unlinked);
>  	xfs_buf_relse(ibp);
>  
> -	/*
> -	 * Prevent any DMAPI event from being sent when the reference on
> -	 * the inode is dropped.
> -	 */
> -	ip->i_d.di_dmevmask = 0;
> -
> -	xfs_irele(ip);
> -	return agino;
> -
> - fail_iput:
> -	xfs_irele(ip);
> - fail:
> -	/*
> -	 * We can't read in the inode this bucket points to, or this inode
> -	 * is messed up.  Just ditch this bucket of inodes.  We will lose
> -	 * some inodes and space, but at least we won't hang.
> -	 *
> -	 * Call xlog_recover_clear_agi_bucket() to perform a transaction to
> -	 * clear the inode pointer in the bucket.
> -	 */
> -	xlog_recover_clear_agi_bucket(mp, agno, bucket);
> -	return NULLAGINO;
> +	return ip;
>  }
>  
>  /*
> @@ -2762,58 +2749,106 @@ xlog_recover_process_one_iunlink(
>   * scheduled on this CPU to ensure other scheduled work can run without undue
>   * latency.
>   */
> -STATIC void
> -xlog_recover_process_iunlinks(
> -	struct xlog	*log)
> +static int
> +xlog_recover_iunlinks_ag(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno)
>  {
> -	xfs_mount_t	*mp;
> -	xfs_agnumber_t	agno;
> -	xfs_agi_t	*agi;
> -	xfs_buf_t	*agibp;
> -	xfs_agino_t	agino;
> -	int		bucket;
> -	int		error;
> +	struct xfs_agi		*agi;
> +	struct xfs_buf		*agibp;
> +	int			bucket;
> +	int			error;
>  
> -	mp = log->l_mp;
> +	/*
> +	 * Find the agi for this ag.
> +	 */
> +	error = xfs_read_agi(mp, NULL, agno, &agibp);
> +	if (error) {
>  
> -	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
>  		/*
> -		 * Find the agi for this ag.
> +		 * AGI is b0rked. Don't process it.
> +		 *
> +		 * We should probably mark the filesystem as corrupt after we've
> +		 * recovered all the ag's we can....
>  		 */
> -		error = xfs_read_agi(mp, NULL, agno, &agibp);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Unlock the buffer so that it can be acquired in the normal course of
> +	 * the transaction to truncate and free each inode.  Because we are not
> +	 * racing with anyone else here for the AGI buffer, we don't even need
> +	 * to hold it locked to read the initial unlinked bucket entries out of
> +	 * the buffer. We keep buffer reference though, so that it stays pinned
> +	 * in memory while we need the buffer.
> +	 */
> +	agi = agibp->b_addr;
> +	xfs_buf_unlock(agibp);
> +
> +	/*
> +	 * The unlinked inode list is maintained on incore inodes as a double
> +	 * linked list. We don't have any of that state in memory, so we have to
> +	 * create it as we go. This is simple as we are only removing from the
> +	 * head of the list and that means we only need to pull the current
> +	 * inode in and the next inode.  Inodes are unlinked when their
> +	 * reference count goes to zero, so we can overlap the xfs_iget() and
> +	 * xfs_irele() calls so we always have the first two inodes on the list
> +	 * in memory. Hence we can fake up the necessary in memory state for the
> +	 * unlink to "just work".
> +	 */
> +	for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
> +		struct xfs_inode	*ip, *prev_ip = NULL;
> +		xfs_agino_t		agino;
> +
> +		agino = be32_to_cpu(agi->agi_unlinked[bucket]);
> +		while (agino != NULLAGINO) {
> +			ip = xlog_recover_get_one_iunlink(mp, agno, &agino,
> +							  bucket);
> +			if (!ip) {
> +				/*
> +				 * something busted, but still got to release
> +				 * prev_ip, so make it look like it's at the end
> +				 * of the list before it gets released.
> +				 */
> +				error = -EFSCORRUPTED;
> +				break;
> +			}
> +			list_add_tail(&ip->i_unlink,
> +					&agibp->b_pag->pag_ici_unlink_list);
> +			if (prev_ip)
> +				xfs_irele(prev_ip);
> +			prev_ip = ip;
> +			cond_resched();
> +		}
> +		if (prev_ip)
> +			xfs_irele(prev_ip);
>  		if (error) {
>  			/*
> -			 * AGI is b0rked. Don't process it.
> -			 *
> -			 * We should probably mark the filesystem as corrupt
> -			 * after we've recovered all the ag's we can....
> +			 * We can't read an inode this bucket points to, or an
> +			 * inode is messed up.  Just ditch this bucket of
> +			 * inodes.  We will lose some inodes and space, but at
> +			 * least we won't hang.
>  			 */
> -			continue;
> -		}
> -		/*
> -		 * Unlock the buffer so that it can be acquired in the normal
> -		 * course of the transaction to truncate and free each inode.
> -		 * Because we are not racing with anyone else here for the AGI
> -		 * buffer, we don't even need to hold it locked to read the
> -		 * initial unlinked bucket entries out of the buffer. We keep
> -		 * buffer reference though, so that it stays pinned in memory
> -		 * while we need the buffer.
> -		 */
> -		agi = agibp->b_addr;
> -		xfs_buf_unlock(agibp);
> -
> -		for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
> -			agino = be32_to_cpu(agi->agi_unlinked[bucket]);
> -			while (agino != NULLAGINO) {
> -				agino = xlog_recover_process_one_iunlink(mp,
> -							agno, agino, bucket);
> -				cond_resched();
> -			}
> +			xlog_recover_clear_agi_bucket(mp, agno, bucket);
> +			break;
>  		}
> -		xfs_buf_rele(agibp);
>  	}
> +	xfs_buf_rele(agibp);
> +	return error;
>  }
>  
> +void
> +xlog_recover_process_iunlinks(
> +       struct xlog             *log)
> +{
> +       struct xfs_mount        *mp = log->l_mp;
> +       xfs_agnumber_t          agno;
> +
> +       for (agno = 0; agno < mp->m_sb.sb_agcount; agno++)
> +               xlog_recover_iunlinks_ag(mp, agno);
> +}
> +
> +
>  STATIC void
>  xlog_unpack_data(
>  	struct xlog_rec_header	*rhead,
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index bbfd1d5b1c04..2def15297a5f 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -200,6 +200,7 @@ xfs_initialize_perag(
>  		pag->pag_mount = mp;
>  		spin_lock_init(&pag->pag_ici_lock);
>  		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
> +		INIT_LIST_HEAD(&pag->pag_ici_unlink_list);
>  		if (xfs_buf_hash_init(pag))
>  			goto out_free_pag;
>  		init_waitqueue_head(&pag->pagb_wait);
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index a72cfcaa4ad1..c35a6c463529 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -355,6 +355,7 @@ typedef struct xfs_perag {
>  	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
>  	int		pag_ici_reclaimable;	/* reclaimable inodes */
>  	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
> +	struct list_head pag_ici_unlink_list;	/* unlinked inode list */
>  
>  	/* buffer cache index */
>  	spinlock_t	pag_buf_lock;	/* lock for pag_buf_hash */
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/13] xfs: replace iunlink backref lookups with list lookups
  2020-08-12  9:25 ` [PATCH 06/13] xfs: replace iunlink backref lookups with list lookups Dave Chinner
@ 2020-08-19  0:13   ` Darrick J. Wong
  2020-08-19  0:52     ` Dave Chinner
  0 siblings, 1 reply; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-19  0:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:49PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now we have an in memory linked list of all the inodes on the
> unlinked list, use that to look up inodes in the list that we need
> to modify when adding or removing from the list.
> 
> This means we are no longer using the backref cache to maintain the
> previous inode lookups, so we can remove all that infrastructure
> now.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_error.c |   2 -
>  fs/xfs/xfs_inode.c | 327 ++++++++-------------------------------------
>  fs/xfs/xfs_inode.h |   3 -
>  fs/xfs/xfs_mount.c |   5 -
>  fs/xfs/xfs_trace.h |   1 -
>  5 files changed, 54 insertions(+), 284 deletions(-)
> 
> diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
> index 7f6e20899473..829a89418830 100644
> --- a/fs/xfs/xfs_error.c
> +++ b/fs/xfs/xfs_error.c
> @@ -162,7 +162,6 @@ XFS_ERRORTAG_ATTR_RW(log_item_pin,	XFS_ERRTAG_LOG_ITEM_PIN);
>  XFS_ERRORTAG_ATTR_RW(buf_lru_ref,	XFS_ERRTAG_BUF_LRU_REF);
>  XFS_ERRORTAG_ATTR_RW(force_repair,	XFS_ERRTAG_FORCE_SCRUB_REPAIR);
>  XFS_ERRORTAG_ATTR_RW(bad_summary,	XFS_ERRTAG_FORCE_SUMMARY_RECALC);
> -XFS_ERRORTAG_ATTR_RW(iunlink_fallback,	XFS_ERRTAG_IUNLINK_FALLBACK);
>  XFS_ERRORTAG_ATTR_RW(buf_ioerror,	XFS_ERRTAG_BUF_IOERROR);
>  
>  static struct attribute *xfs_errortag_attrs[] = {
> @@ -200,7 +199,6 @@ static struct attribute *xfs_errortag_attrs[] = {
>  	XFS_ERRORTAG_ATTR_LIST(buf_lru_ref),
>  	XFS_ERRORTAG_ATTR_LIST(force_repair),
>  	XFS_ERRORTAG_ATTR_LIST(bad_summary),
> -	XFS_ERRORTAG_ATTR_LIST(iunlink_fallback),
>  	XFS_ERRORTAG_ATTR_LIST(buf_ioerror),
>  	NULL,
>  };
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index dcf80ac51267..2c930de99561 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1893,196 +1893,29 @@ xfs_inactive(
>   * because we must walk that list to find the inode that points to the inode
>   * being removed from the unlinked hash bucket list.
>   *
> - * What if we modelled the unlinked list as a collection of records capturing
> - * "X.next_unlinked = Y" relations?  If we indexed those records on Y, we'd
> - * have a fast way to look up unlinked list predecessors, which avoids the
> - * slow list walk.  That's exactly what we do here (in-core) with a per-AG
> - * rhashtable.
> + * However, inodes that are on the unlinked list are also guaranteed to be in
> + * memory as they are loaded and then pinned in memory by whatever holds
> + * references to the inode to perform the unlink. Same goes for the O_TMPFILE
> + * usage of the unlinked list - those files are pinned in memory by an open file
> + * descriptor. Hence the inodes on the list are pinned in memory until they are
> + * removed from the list.
>   *
> - * Because this is a backref cache, we ignore operational failures since the
> - * iunlink code can fall back to the slow bucket walk.  The only errors that
> - * should bubble out are for obviously incorrect situations.
> + * That means we can simply use an in-memory double linked list to track inodes
> + * on the unlinked list. As we've removed the scalability problem resulting from
> + * removal on a single linked list requiring traversal, we also no longer use
> + * the on-disk hash to keep traversals short. We just use a single list on disk
> + * now, and track the previous inode in the list in memory.
>   *
> - * All users of the backref cache MUST hold the AGI buffer lock to serialize
> - * access or have otherwise provided for concurrency control.
> - */
> -
> -/* Capture a "X.next_unlinked = Y" relationship. */
> -struct xfs_iunlink {
> -	struct rhash_head	iu_rhash_head;
> -	xfs_agino_t		iu_agino;		/* X */
> -	xfs_agino_t		iu_next_unlinked;	/* Y */
> -};
> -
> -/* Unlinked list predecessor lookup hashtable construction */
> -static int
> -xfs_iunlink_obj_cmpfn(
> -	struct rhashtable_compare_arg	*arg,
> -	const void			*obj)
> -{
> -	const xfs_agino_t		*key = arg->key;
> -	const struct xfs_iunlink	*iu = obj;
> -
> -	if (iu->iu_next_unlinked != *key)
> -		return 1;
> -	return 0;
> -}
> -
> -static const struct rhashtable_params xfs_iunlink_hash_params = {
> -	.min_size		= XFS_AGI_UNLINKED_BUCKETS,
> -	.key_len		= sizeof(xfs_agino_t),
> -	.key_offset		= offsetof(struct xfs_iunlink,
> -					   iu_next_unlinked),
> -	.head_offset		= offsetof(struct xfs_iunlink, iu_rhash_head),
> -	.automatic_shrinking	= true,
> -	.obj_cmpfn		= xfs_iunlink_obj_cmpfn,
> -};
> -
> -/*
> - * Return X, where X.next_unlinked == @agino.  Returns NULLAGINO if no such
> - * relation is found.
> + * To provide the guarantee that inodes are always on this in memory list, log
> + * recovery does what is necessary to populate the list sufficient to perform
> + * removal from the head of the list correctly. As such, we can now always rely
> + * on the in-memory list and if it differs from what we find on disk then we
> + * have a memory corruption problem or a software bug and so mismatches are now
> + * considered EFSCORRUPTION errors and are not recoverable.
> + *
> + * All users of the unlinked list MUST hold the AGI buffer lock to serialize
> + * access to the list.
>   */
> -static xfs_agino_t
> -xfs_iunlink_lookup_backref(
> -	struct xfs_perag	*pag,
> -	xfs_agino_t		agino)
> -{
> -	struct xfs_iunlink	*iu;
> -
> -	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
> -			xfs_iunlink_hash_params);
> -	return iu ? iu->iu_agino : NULLAGINO;
> -}
> -
> -/*
> - * Take ownership of an iunlink cache entry and insert it into the hash table.
> - * If successful, the entry will be owned by the cache; if not, it is freed.
> - * Either way, the caller does not own @iu after this call.
> - */
> -static int
> -xfs_iunlink_insert_backref(
> -	struct xfs_perag	*pag,
> -	struct xfs_iunlink	*iu)
> -{
> -	int			error;
> -
> -	error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> -			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> -	/*
> -	 * Fail loudly if there already was an entry because that's a sign of
> -	 * corruption of in-memory data.  Also fail loudly if we see an error
> -	 * code we didn't anticipate from the rhashtable code.  Currently we
> -	 * only anticipate ENOMEM.
> -	 */
> -	if (error) {
> -		WARN(error != -ENOMEM, "iunlink cache insert error %d", error);
> -		kmem_free(iu);
> -	}
> -	/*
> -	 * Absorb any runtime errors that aren't a result of corruption because
> -	 * this is a cache and we can always fall back to bucket list scanning.
> -	 */
> -	if (error != 0 && error != -EEXIST)
> -		error = 0;
> -	return error;
> -}
> -
> -/* Remember that @prev_agino.next_unlinked = @this_agino. */
> -static int
> -xfs_iunlink_add_backref(
> -	struct xfs_perag	*pag,
> -	xfs_agino_t		prev_agino,
> -	xfs_agino_t		this_agino)
> -{
> -	struct xfs_iunlink	*iu;
> -
> -	if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
> -		return 0;
> -
> -	iu = kmem_zalloc(sizeof(*iu), KM_NOFS);
> -	iu->iu_agino = prev_agino;
> -	iu->iu_next_unlinked = this_agino;
> -
> -	return xfs_iunlink_insert_backref(pag, iu);
> -}
> -
> -/*
> - * Replace X.next_unlinked = @agino with X.next_unlinked = @next_unlinked.
> - * If @next_unlinked is NULLAGINO, we drop the backref and exit.  If there
> - * wasn't any such entry then we don't bother.
> - */
> -static int
> -xfs_iunlink_change_backref(
> -	struct xfs_perag	*pag,
> -	xfs_agino_t		agino,
> -	xfs_agino_t		next_unlinked)
> -{
> -	struct xfs_iunlink	*iu;
> -	int			error;
> -
> -	/* Look up the old entry; if there wasn't one then exit. */
> -	iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
> -			xfs_iunlink_hash_params);
> -	if (!iu)
> -		return 0;
> -
> -	/*
> -	 * Remove the entry.  This shouldn't ever return an error, but if we
> -	 * couldn't remove the old entry we don't want to add it again to the
> -	 * hash table, and if the entry disappeared on us then someone's
> -	 * violated the locking rules and we need to fail loudly.  Either way
> -	 * we cannot remove the inode because internal state is or would have
> -	 * been corrupt.
> -	 */
> -	error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
> -			&iu->iu_rhash_head, xfs_iunlink_hash_params);
> -	if (error)
> -		return error;
> -
> -	/* If there is no new next entry just free our item and return. */
> -	if (next_unlinked == NULLAGINO) {
> -		kmem_free(iu);
> -		return 0;
> -	}
> -
> -	/* Update the entry and re-add it to the hash table. */
> -	iu->iu_next_unlinked = next_unlinked;
> -	return xfs_iunlink_insert_backref(pag, iu);
> -}
> -
> -/* Set up the in-core predecessor structures. */
> -int
> -xfs_iunlink_init(
> -	struct xfs_perag	*pag)
> -{
> -	return rhashtable_init(&pag->pagi_unlinked_hash,
> -			&xfs_iunlink_hash_params);
> -}
> -
> -/* Free the in-core predecessor structures. */
> -static void
> -xfs_iunlink_free_item(
> -	void			*ptr,
> -	void			*arg)
> -{
> -	struct xfs_iunlink	*iu = ptr;
> -	bool			*freed_anything = arg;
> -
> -	*freed_anything = true;
> -	kmem_free(iu);
> -}
> -
> -void
> -xfs_iunlink_destroy(
> -	struct xfs_perag	*pag)
> -{
> -	bool			freed_anything = false;
> -
> -	rhashtable_free_and_destroy(&pag->pagi_unlinked_hash,
> -			xfs_iunlink_free_item, &freed_anything);
> -
> -	ASSERT(freed_anything == false || XFS_FORCED_SHUTDOWN(pag->pag_mount));
> -}
>  
>  /*
>   * Point the AGI unlinked bucket at an inode and log the results.  The caller
> @@ -2221,6 +2054,7 @@ xfs_iunlink_insert_inode(
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_agi		*agi;
> +	struct xfs_inode	*nip;
>  	xfs_agino_t		next_agino;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> @@ -2242,9 +2076,13 @@ xfs_iunlink_insert_inode(
>  		return -EFSCORRUPTED;
>  	}
>  
> -	if (next_agino != NULLAGINO) {
> +	nip = list_first_entry_or_null(&agibp->b_pag->pag_ici_unlink_list,
> +					struct xfs_inode, i_unlink);
> +	if (nip) {
>  		xfs_agino_t		old_agino;
>  
> +		ASSERT(next_agino == XFS_INO_TO_AGINO(mp, nip->i_ino));
> +
>  		/*
>  		 * There is already another inode in the bucket, so point this
>  		 * inode to the current head of the list.
> @@ -2254,14 +2092,8 @@ xfs_iunlink_insert_inode(
>  		if (error)
>  			return error;
>  		ASSERT(old_agino == NULLAGINO);
> -
> -		/*
> -		 * agino has been unlinked, add a backref from the next inode
> -		 * back to agino.
> -		 */
> -		error = xfs_iunlink_add_backref(agibp->b_pag, agino, next_agino);
> -		if (error)
> -			return error;
> +	} else {
> +		ASSERT(next_agino == NULLAGINO);
>  	}
>  
>  	/* Point the head of the list to point to this inode. */
> @@ -2354,70 +2186,24 @@ xfs_iunlink_map_prev(
>  	xfs_agnumber_t		agno,
>  	xfs_agino_t		head_agino,
>  	xfs_agino_t		target_agino,
> -	xfs_agino_t		*agino,
> +	xfs_agino_t		agino,
>  	struct xfs_imap		*imap,
>  	struct xfs_dinode	**dipp,
>  	struct xfs_buf		**bpp,
>  	struct xfs_perag	*pag)
>  {
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	xfs_agino_t		next_agino;
>  	int			error;
>  
>  	ASSERT(head_agino != target_agino);
>  	*bpp = NULL;
>  
> -	/* See if our backref cache can find it faster. */
> -	*agino = xfs_iunlink_lookup_backref(pag, target_agino);
> -	if (*agino != NULLAGINO) {
> -		error = xfs_iunlink_map_ino(tp, agno, *agino, imap, dipp, bpp);
> -		if (error)
> -			return error;
> -
> -		if (be32_to_cpu((*dipp)->di_next_unlinked) == target_agino)
> -			return 0;
> -
> -		/*
> -		 * If we get here the cache contents were corrupt, so drop the
> -		 * buffer and fall back to walking the bucket list.
> -		 */
> -		xfs_trans_brelse(tp, *bpp);
> -		*bpp = NULL;
> -		WARN_ON_ONCE(1);
> -	}
> -
> -	trace_xfs_iunlink_map_prev_fallback(mp, agno);
> -
> -	/* Otherwise, walk the entire bucket until we find it. */
> -	next_agino = head_agino;
> -	while (next_agino != target_agino) {
> -		xfs_agino_t	unlinked_agino;
> -
> -		if (*bpp)
> -			xfs_trans_brelse(tp, *bpp);
> -
> -		*agino = next_agino;
> -		error = xfs_iunlink_map_ino(tp, agno, next_agino, imap, dipp,
> -				bpp);
> -		if (error)
> -			return error;
> -
> -		unlinked_agino = be32_to_cpu((*dipp)->di_next_unlinked);
> -		/*
> -		 * Make sure this pointer is valid and isn't an obvious
> -		 * infinite loop.
> -		 */
> -		if (!xfs_verify_agino(mp, agno, unlinked_agino) ||
> -		    next_agino == unlinked_agino) {
> -			XFS_CORRUPTION_ERROR(__func__,
> -					XFS_ERRLEVEL_LOW, mp,
> -					*dipp, sizeof(**dipp));
> -			error = -EFSCORRUPTED;
> -			return error;
> -		}
> -		next_agino = unlinked_agino;
> -	}
> +	ASSERT(agino != NULLAGINO);
> +	error = xfs_iunlink_map_ino(tp, agno, agino, imap, dipp, bpp);
> +	if (error)
> +		return error;
>  
> +	if (be32_to_cpu((*dipp)->di_next_unlinked) != target_agino)
> +		return -EFSCORRUPTED;

Why drop the corruption report here?

--D

>  	return 0;
>  }
>  
> @@ -2461,27 +2247,31 @@ xfs_iunlink_remove_inode(
>  	if (error)
>  		return error;
>  
> -	/*
> -	 * If there was a backref pointing from the next inode back to this
> -	 * one, remove it because we've removed this inode from the list.
> -	 *
> -	 * Later, if this inode was in the middle of the list we'll update
> -	 * this inode's backref to point from the next inode.
> -	 */
> -	if (next_agino != NULLAGINO) {
> -		error = xfs_iunlink_change_backref(agibp->b_pag, next_agino,
> -				NULLAGINO);
> -		if (error)
> -			return error;
> +#ifdef DEBUG
> +	{
> +	struct xfs_inode *nip = list_next_entry(ip, i_unlink);
> +	if (nip)
> +		ASSERT(next_agino == XFS_INO_TO_AGINO(mp, nip->i_ino));
> +	else
> +		ASSERT(next_agino == NULLAGINO);
>  	}
> +#endif
> +
> +	if (ip != list_first_entry(&agibp->b_pag->pag_ici_unlink_list,
> +					struct xfs_inode, i_unlink)) {
>  
> -	if (head_agino != agino) {
> +		struct xfs_inode *pip;
>  		struct xfs_imap	imap;
>  		xfs_agino_t	prev_agino;
>  
> +		ASSERT(head_agino != agino);
> +
> +		pip = list_prev_entry(ip, i_unlink);
> +		prev_agino = XFS_INO_TO_AGINO(mp, pip->i_ino);
> +
>  		/* We need to search the list for the inode being freed. */
>  		error = xfs_iunlink_map_prev(tp, agno, head_agino, agino,
> -				&prev_agino, &imap, &last_dip, &last_ibp,
> +				prev_agino, &imap, &last_dip, &last_ibp,
>  				agibp->b_pag);
>  		if (error)
>  			return error;
> @@ -2490,16 +2280,7 @@ xfs_iunlink_remove_inode(
>  		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
>  				last_dip, &imap, next_agino);
>  
> -		/*
> -		 * Now we deal with the backref for this inode.  If this inode
> -		 * pointed at a real inode, change the backref that pointed to
> -		 * us to point to our old next.  If this inode was the end of
> -		 * the list, delete the backref that pointed to us.  Note that
> -		 * change_backref takes care of deleting the backref if
> -		 * next_agino is NULLAGINO.
> -		 */
> -		return xfs_iunlink_change_backref(agibp->b_pag, agino,
> -				next_agino);
> +		return 0;
>  	}
>  
>  	/* Point the head of the list to the next unlinked inode. */
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 73f36908a1ce..7f8fbb7c8594 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -464,9 +464,6 @@ extern struct kmem_zone	*xfs_inode_zone;
>  /* The default CoW extent size hint. */
>  #define XFS_DEFAULT_COWEXTSZ_HINT 32
>  
> -int xfs_iunlink_init(struct xfs_perag *pag);
> -void xfs_iunlink_destroy(struct xfs_perag *pag);
> -
>  void xfs_end_io(struct work_struct *work);
>  
>  int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 2def15297a5f..f28c969af272 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -146,7 +146,6 @@ xfs_free_perag(
>  		spin_unlock(&mp->m_perag_lock);
>  		ASSERT(pag);
>  		ASSERT(atomic_read(&pag->pag_ref) == 0);
> -		xfs_iunlink_destroy(pag);
>  		xfs_buf_hash_destroy(pag);
>  		call_rcu(&pag->rcu_head, __xfs_free_perag);
>  	}
> @@ -224,9 +223,6 @@ xfs_initialize_perag(
>  		/* first new pag is fully initialized */
>  		if (first_initialised == NULLAGNUMBER)
>  			first_initialised = index;
> -		error = xfs_iunlink_init(pag);
> -		if (error)
> -			goto out_hash_destroy;
>  		spin_lock_init(&pag->pag_state_lock);
>  	}
>  
> @@ -249,7 +245,6 @@ xfs_initialize_perag(
>  		if (!pag)
>  			break;
>  		xfs_buf_hash_destroy(pag);
> -		xfs_iunlink_destroy(pag);
>  		kmem_free(pag);
>  	}
>  	return error;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index abb1d859f226..acddc60f6d88 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -3514,7 +3514,6 @@ DEFINE_EVENT(xfs_ag_inode_class, name, \
>  	TP_ARGS(ip))
>  DEFINE_AGINODE_EVENT(xfs_iunlink);
>  DEFINE_AGINODE_EVENT(xfs_iunlink_remove);
> -DEFINE_AG_EVENT(xfs_iunlink_map_prev_fallback);
>  
>  DECLARE_EVENT_CLASS(xfs_fs_corrupt_class,
>  	TP_PROTO(struct xfs_mount *mp, unsigned int flags),
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/13] xfs: mapping unlinked inodes is now redundant
  2020-08-12  9:25 ` [PATCH 07/13] xfs: mapping unlinked inodes is now redundant Dave Chinner
@ 2020-08-19  0:14   ` Darrick J. Wong
  0 siblings, 0 replies; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-19  0:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:50PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We now have a direct pointer to the xfs_inodes in the unlinked
> lists, so we can use the imap built into the inode to read the
> underlying cluster buffer. Hence we can remove all the "lookup by
> agino" code that currently exists in the iunlink list processing.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks pretty simple,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode.c | 88 ++++++----------------------------------------
>  1 file changed, 10 insertions(+), 78 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 2c930de99561..bacd5ae9f5a7 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2139,74 +2139,6 @@ xfs_iunlink(
>  	return error;
>  }
>  
> -/* Return the imap, dinode pointer, and buffer for an inode. */
> -STATIC int
> -xfs_iunlink_map_ino(
> -	struct xfs_trans	*tp,
> -	xfs_agnumber_t		agno,
> -	xfs_agino_t		agino,
> -	struct xfs_imap		*imap,
> -	struct xfs_dinode	**dipp,
> -	struct xfs_buf		**bpp)
> -{
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	int			error;
> -
> -	imap->im_blkno = 0;
> -	error = xfs_imap(mp, tp, XFS_AGINO_TO_INO(mp, agno, agino), imap, 0);
> -	if (error) {
> -		xfs_warn(mp, "%s: xfs_imap returned error %d.",
> -				__func__, error);
> -		return error;
> -	}
> -
> -	error = xfs_imap_to_bp(mp, tp, imap, dipp, bpp, 0);
> -	if (error) {
> -		xfs_warn(mp, "%s: xfs_imap_to_bp returned error %d.",
> -				__func__, error);
> -		return error;
> -	}
> -
> -	return 0;
> -}
> -
> -/*
> - * Walk the unlinked chain from @head_agino until we find the inode that
> - * points to @target_agino.  Return the inode number, map, dinode pointer,
> - * and inode cluster buffer of that inode as @agino, @imap, @dipp, and @bpp.
> - *
> - * @tp, @pag, @head_agino, and @target_agino are input parameters.
> - * @agino, @imap, @dipp, and @bpp are all output parameters.
> - *
> - * Do not call this function if @target_agino is the head of the list.
> - */
> -STATIC int
> -xfs_iunlink_map_prev(
> -	struct xfs_trans	*tp,
> -	xfs_agnumber_t		agno,
> -	xfs_agino_t		head_agino,
> -	xfs_agino_t		target_agino,
> -	xfs_agino_t		agino,
> -	struct xfs_imap		*imap,
> -	struct xfs_dinode	**dipp,
> -	struct xfs_buf		**bpp,
> -	struct xfs_perag	*pag)
> -{
> -	int			error;
> -
> -	ASSERT(head_agino != target_agino);
> -	*bpp = NULL;
> -
> -	ASSERT(agino != NULLAGINO);
> -	error = xfs_iunlink_map_ino(tp, agno, agino, imap, dipp, bpp);
> -	if (error)
> -		return error;
> -
> -	if (be32_to_cpu((*dipp)->di_next_unlinked) != target_agino)
> -		return -EFSCORRUPTED;
> -	return 0;
> -}
> -
>  static int
>  xfs_iunlink_remove_inode(
>  	struct xfs_trans	*tp,
> @@ -2215,8 +2147,6 @@ xfs_iunlink_remove_inode(
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_agi		*agi;
> -	struct xfs_buf		*last_ibp;
> -	struct xfs_dinode	*last_dip = NULL;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		next_agino;
> @@ -2260,25 +2190,27 @@ xfs_iunlink_remove_inode(
>  	if (ip != list_first_entry(&agibp->b_pag->pag_ici_unlink_list,
>  					struct xfs_inode, i_unlink)) {
>  
> -		struct xfs_inode *pip;
> -		struct xfs_imap	imap;
> -		xfs_agino_t	prev_agino;
> +		struct xfs_inode	*pip;
> +		xfs_agino_t		prev_agino;
> +		struct xfs_buf		*last_ibp;
> +		struct xfs_dinode	*last_dip = NULL;
>  
>  		ASSERT(head_agino != agino);
>  
>  		pip = list_prev_entry(ip, i_unlink);
>  		prev_agino = XFS_INO_TO_AGINO(mp, pip->i_ino);
>  
> -		/* We need to search the list for the inode being freed. */
> -		error = xfs_iunlink_map_prev(tp, agno, head_agino, agino,
> -				prev_agino, &imap, &last_dip, &last_ibp,
> -				agibp->b_pag);
> +		error = xfs_imap_to_bp(mp, tp, &pip->i_imap, &last_dip, 
> +						&last_ibp, 0);
>  		if (error)
>  			return error;
>  
> +		if (be32_to_cpu(last_dip->di_next_unlinked) != agino)
> +			return -EFSCORRUPTED;
> +
>  		/* Point the previous inode on the list to the next inode. */
>  		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
> -				last_dip, &imap, next_agino);
> +				last_dip, &pip->i_imap, next_agino);
>  
>  		return 0;
>  	}
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 08/13] xfs: updating i_next_unlinked doesn't need to return old value
  2020-08-12  9:25 ` [PATCH 08/13] xfs: updating i_next_unlinked doesn't need to return old value Dave Chinner
@ 2020-08-19  0:19   ` Darrick J. Wong
  0 siblings, 0 replies; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-19  0:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:51PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We already know what the next inode in the unlinked list is supposed
> to be from the in-memory list, so we do not need to look it up first
> from the current inode to be able to update in memory list
> pointers...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode.c | 63 +++++++++++-----------------------------------
>  1 file changed, 14 insertions(+), 49 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index bacd5ae9f5a7..4dde1970f7cd 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1998,13 +1998,11 @@ xfs_iunlink_update_inode(
>  	struct xfs_trans	*tp,
>  	struct xfs_inode	*ip,
>  	xfs_agnumber_t		agno,
> -	xfs_agino_t		next_agino,
> -	xfs_agino_t		*old_next_agino)
> +	xfs_agino_t		next_agino)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_dinode	*dip;
>  	struct xfs_buf		*ibp;
> -	xfs_agino_t		old_value;
>  	int			error;
>  
>  	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
> @@ -2013,37 +2011,10 @@ xfs_iunlink_update_inode(
>  	if (error)
>  		return error;
>  
> -	/* Make sure the old pointer isn't garbage. */
> -	old_value = be32_to_cpu(dip->di_next_unlinked);
> -	if (!xfs_verify_agino_or_null(mp, agno, old_value)) {
> -		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
> -				sizeof(*dip), __this_address);
> -		error = -EFSCORRUPTED;
> -		goto out;
> -	}
> -
> -	/*
> -	 * Since we're updating a linked list, we should never find that the
> -	 * current pointer is the same as the new value, unless we're
> -	 * terminating the list.
> -	 */
> -	*old_next_agino = old_value;
> -	if (old_value == next_agino) {
> -		if (next_agino != NULLAGINO) {
> -			xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__,
> -					dip, sizeof(*dip), __this_address);
> -			error = -EFSCORRUPTED;
> -		}
> -		goto out;
> -	}
> -
>  	/* Ok, update the new pointer. */
>  	xfs_iunlink_update_dinode(tp, agno, XFS_INO_TO_AGINO(mp, ip->i_ino),
>  			ibp, dip, &ip->i_imap, next_agino);
>  	return 0;
> -out:
> -	xfs_trans_brelse(tp, ibp);
> -	return error;
>  }
>  
>  static int
> @@ -2079,19 +2050,15 @@ xfs_iunlink_insert_inode(
>  	nip = list_first_entry_or_null(&agibp->b_pag->pag_ici_unlink_list,
>  					struct xfs_inode, i_unlink);
>  	if (nip) {
> -		xfs_agino_t		old_agino;
> -
>  		ASSERT(next_agino == XFS_INO_TO_AGINO(mp, nip->i_ino));
>  
>  		/*
>  		 * There is already another inode in the bucket, so point this
>  		 * inode to the current head of the list.
>  		 */
> -		error = xfs_iunlink_update_inode(tp, ip, agno, next_agino,
> -				&old_agino);
> +		error = xfs_iunlink_update_inode(tp, ip, agno, next_agino);
>  		if (error)
>  			return error;
> -		ASSERT(old_agino == NULLAGINO);
>  	} else {
>  		ASSERT(next_agino == NULLAGINO);
>  	}
> @@ -2149,7 +2116,7 @@ xfs_iunlink_remove_inode(
>  	struct xfs_agi		*agi;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> -	xfs_agino_t		next_agino;
> +	xfs_agino_t		next_agino = NULLAGINO;
>  	xfs_agino_t		head_agino;
>  	int			error;
>  
> @@ -2169,23 +2136,21 @@ xfs_iunlink_remove_inode(
>  	}
>  
>  	/*
> -	 * Set our inode's next_unlinked pointer to NULL and then return
> -	 * the old pointer value so that we can update whatever was previous
> -	 * to us in the list to point to whatever was next in the list.
> +	 * Get the next agino in the list. If we are at the end of the list,
> +	 * then the previous inode's i_next_unlinked filed will get cleared.

                                                    "field"

With that fixed,

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

>  	 */
> -	error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO, &next_agino);
> +	if (ip != list_last_entry(&agibp->b_pag->pag_ici_unlink_list,
> +					struct xfs_inode, i_unlink)) {
> +		struct xfs_inode *nip = list_next_entry(ip, i_unlink);
> +
> +		next_agino = XFS_INO_TO_AGINO(mp, nip->i_ino);
> +	}
> +
> +	/* Clear the on disk next unlinked pointer for this inode. */
> +	error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO);
>  	if (error)
>  		return error;
>  
> -#ifdef DEBUG
> -	{
> -	struct xfs_inode *nip = list_next_entry(ip, i_unlink);
> -	if (nip)
> -		ASSERT(next_agino == XFS_INO_TO_AGINO(mp, nip->i_ino));
> -	else
> -		ASSERT(next_agino == NULLAGINO);
> -	}
> -#endif
>  
>  	if (ip != list_first_entry(&agibp->b_pag->pag_ici_unlink_list,
>  					struct xfs_inode, i_unlink)) {
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 09/13] xfs: validate the unlinked list pointer on update
  2020-08-12  9:25 ` [PATCH 09/13] xfs: validate the unlinked list pointer on update Dave Chinner
@ 2020-08-19  0:23   ` Darrick J. Wong
  0 siblings, 0 replies; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-19  0:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:52PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Factor this check into xfs_iunlink_update_inode() when are updating
> the code. This replaces the checks that were removed in previous
> patches as bits of functionality were removed from the update
> process.

I had wondered about that, though I saw it end up in xfs_iunlink_item.c
so I hadn't thought too much about that.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode.c | 38 ++++++++++++++------------------------
>  1 file changed, 14 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 4dde1970f7cd..b098e5df07e7 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1998,6 +1998,7 @@ xfs_iunlink_update_inode(
>  	struct xfs_trans	*tp,
>  	struct xfs_inode	*ip,
>  	xfs_agnumber_t		agno,
> +	xfs_agino_t		old_agino,
>  	xfs_agino_t		next_agino)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> @@ -2011,6 +2012,13 @@ xfs_iunlink_update_inode(
>  	if (error)
>  		return error;
>  
> +	if (be32_to_cpu(dip->di_next_unlinked) != old_agino) {
> +		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
> +					sizeof(*dip), __this_address);
> +		xfs_trans_brelse(tp, ibp);
> +		return -EFSCORRUPTED;
> +	}
> +
>  	/* Ok, update the new pointer. */
>  	xfs_iunlink_update_dinode(tp, agno, XFS_INO_TO_AGINO(mp, ip->i_ino),
>  			ibp, dip, &ip->i_imap, next_agino);
> @@ -2056,7 +2064,8 @@ xfs_iunlink_insert_inode(
>  		 * There is already another inode in the bucket, so point this
>  		 * inode to the current head of the list.
>  		 */
> -		error = xfs_iunlink_update_inode(tp, ip, agno, next_agino);
> +		error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO,
> +						 next_agino);
>  		if (error)
>  			return error;
>  	} else {
> @@ -2147,37 +2156,18 @@ xfs_iunlink_remove_inode(
>  	}
>  
>  	/* Clear the on disk next unlinked pointer for this inode. */
> -	error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO);
> +	error = xfs_iunlink_update_inode(tp, ip, agno, next_agino, NULLAGINO);
>  	if (error)
>  		return error;
>  
>  
>  	if (ip != list_first_entry(&agibp->b_pag->pag_ici_unlink_list,
>  					struct xfs_inode, i_unlink)) {
> -
> -		struct xfs_inode	*pip;
> -		xfs_agino_t		prev_agino;
> -		struct xfs_buf		*last_ibp;
> -		struct xfs_dinode	*last_dip = NULL;
> +		struct xfs_inode *pip = list_prev_entry(ip, i_unlink);
>  
>  		ASSERT(head_agino != agino);
> -
> -		pip = list_prev_entry(ip, i_unlink);
> -		prev_agino = XFS_INO_TO_AGINO(mp, pip->i_ino);
> -
> -		error = xfs_imap_to_bp(mp, tp, &pip->i_imap, &last_dip, 
> -						&last_ibp, 0);
> -		if (error)
> -			return error;
> -
> -		if (be32_to_cpu(last_dip->di_next_unlinked) != agino)
> -			return -EFSCORRUPTED;
> -
> -		/* Point the previous inode on the list to the next inode. */
> -		xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
> -				last_dip, &pip->i_imap, next_agino);
> -
> -		return 0;
> +		return xfs_iunlink_update_inode(tp, pip, agno, agino,
> +						next_agino);
>  	}
>  
>  	/* Point the head of the list to the next unlinked inode. */
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] xfs: re-order AGI updates in unlink list updates
  2020-08-12  9:25 ` [PATCH 10/13] xfs: re-order AGI updates in unlink list updates Dave Chinner
@ 2020-08-19  0:29   ` Darrick J. Wong
  2020-08-19  1:01     ` Dave Chinner
  0 siblings, 1 reply; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-19  0:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:53PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We always access and check the AGI bucket entry for the unlinked
> list even if we are not going to need it either for lookup or remove
> purposes. Move the code that accesses the AGI to the code that
> modifes the AGI, hence keeping the AGI accesses local to the code
> that needs to modify it.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode.c | 84 ++++++++++++++++------------------------------
>  1 file changed, 28 insertions(+), 56 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index b098e5df07e7..4f616e1b64dc 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1918,44 +1918,53 @@ xfs_inactive(
>   */
>  
>  /*
> - * Point the AGI unlinked bucket at an inode and log the results.  The caller
> - * is responsible for validating the old value.
> + * Point the AGI unlinked bucket at an inode and log the results. The caller
> + * passes in the expected current agino the bucket points at via @cur_agino so
> + * we can validate that we are about to remove the inode we expect to be
> + * removing from the AGI bucket.
>   */
> -STATIC int
> +static int
>  xfs_iunlink_update_bucket(
>  	struct xfs_trans	*tp,
>  	xfs_agnumber_t		agno,
>  	struct xfs_buf		*agibp,
> -	xfs_agino_t		old_agino,
> +	xfs_agino_t		cur_agino,

Hm.  So I think I understand the new role of this function better now
that this patch moves into this function the checking of the bucket
pointer and whatnot.  Would it be difficult to merge this patch with
patch 4?

--D

>  	xfs_agino_t		new_agino)
>  {
> -	struct xlog		*log = tp->t_mountp->m_log;
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xlog		*log = mp->m_log;
>  	struct xfs_agi		*agi = agibp->b_addr;
> -	xfs_agino_t		old_value;
> +	xfs_agino_t		old_agino;
>  	unsigned int		bucket_index;
>  	int                     offset;
>  
> -	ASSERT(xfs_verify_agino_or_null(tp->t_mountp, agno, new_agino));
> +	ASSERT(xfs_verify_agino_or_null(mp, agno, new_agino));
>  
> +	/*
> +	 * We don't need to traverse the on disk unlinked list to find the
> +	 * previous inode in the list when removing inodes anymore, so we don't
> +	 * use multiple on-disk lists anymore. Hence we always use bucket 0
> +	 * unless we are in log recovery in which case we might be recovering an
> +	 * old filesystem that has multiple buckets.
> +	 */
>  	bucket_index = 0;
> -	/* During recovery, the old multiple bucket index can be applied */
>  	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
> -		ASSERT(old_agino != NULLAGINO);
> +		ASSERT(cur_agino != NULLAGINO);
>  
> -		if (be32_to_cpu(agi->agi_unlinked[0]) != old_agino)
> -			bucket_index = old_agino % XFS_AGI_UNLINKED_BUCKETS;
> +		if (be32_to_cpu(agi->agi_unlinked[0]) != cur_agino)
> +			bucket_index = cur_agino % XFS_AGI_UNLINKED_BUCKETS;
>  	}
>  
> -	old_value = be32_to_cpu(agi->agi_unlinked[bucket_index]);
> -	trace_xfs_iunlink_update_bucket(tp->t_mountp, agno, bucket_index,
> -			old_value, new_agino);
> -
> -	/* check if the old agi_unlinked head is as expected */
> -	if (old_value != old_agino) {
> +	old_agino = be32_to_cpu(agi->agi_unlinked[bucket_index]);
> +	if (new_agino == old_agino || cur_agino != old_agino ||
> +	    !xfs_verify_agino_or_null(mp, agno, old_agino)) {
>  		xfs_buf_mark_corrupt(agibp);
>  		return -EFSCORRUPTED;
>  	}
>  
> +	trace_xfs_iunlink_update_bucket(mp, agno, bucket_index,
> +			old_agino, new_agino);
> +
>  	agi->agi_unlinked[bucket_index] = cpu_to_be32(new_agino);
>  	offset = offsetof(struct xfs_agi, agi_unlinked) +
>  			(sizeof(xfs_agino_t) * bucket_index);
> @@ -2032,44 +2041,25 @@ xfs_iunlink_insert_inode(
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_agi		*agi;
>  	struct xfs_inode	*nip;
> -	xfs_agino_t		next_agino;
> +	xfs_agino_t		next_agino = NULLAGINO;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	int			error;
>  
> -	agi = agibp->b_addr;
> -
> -	/*
> -	 * We don't need to traverse the on disk unlinked list to find the
> -	 * previous inode in the list when removing inodes anymore, so we don't
> -	 * need multiple on-disk lists anymore. Hence we always use bucket 0.
> -	 * Make sure the pointer isn't garbage and that this inode isn't already
> -	 * on the list.
> -	 */
> -	next_agino = be32_to_cpu(agi->agi_unlinked[0]);
> -	if (next_agino == agino ||
> -	    !xfs_verify_agino_or_null(mp, agno, next_agino)) {
> -		xfs_buf_mark_corrupt(agibp);
> -		return -EFSCORRUPTED;
> -	}
> -
>  	nip = list_first_entry_or_null(&agibp->b_pag->pag_ici_unlink_list,
>  					struct xfs_inode, i_unlink);
>  	if (nip) {
> -		ASSERT(next_agino == XFS_INO_TO_AGINO(mp, nip->i_ino));
>  
>  		/*
>  		 * There is already another inode in the bucket, so point this
>  		 * inode to the current head of the list.
>  		 */
> +		next_agino = XFS_INO_TO_AGINO(mp, nip->i_ino);
>  		error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO,
>  						 next_agino);
>  		if (error)
>  			return error;
> -	} else {
> -		ASSERT(next_agino == NULLAGINO);
>  	}
>  
>  	/* Point the head of the list to point to this inode. */
> @@ -2122,28 +2112,11 @@ xfs_iunlink_remove_inode(
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_agi		*agi;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		next_agino = NULLAGINO;
> -	xfs_agino_t		head_agino;
>  	int			error;
>  
> -	agi = agibp->b_addr;
> -
> -	/*
> -	 * We don't need to traverse the on disk unlinked list to find the
> -	 * previous inode in the list when removing inodes anymore, so we don't
> -	 * need multiple on-disk lists anymore. Hence we always use bucket 0.
> -	 * Make sure the head pointer isn't garbage.
> -	 */
> -	head_agino = be32_to_cpu(agi->agi_unlinked[0]);
> -	if (!xfs_verify_agino(mp, agno, head_agino)) {
> -		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
> -				agi, sizeof(*agi));
> -		return -EFSCORRUPTED;
> -	}
> -
>  	/*
>  	 * Get the next agino in the list. If we are at the end of the list,
>  	 * then the previous inode's i_next_unlinked filed will get cleared.
> @@ -2165,7 +2138,6 @@ xfs_iunlink_remove_inode(
>  					struct xfs_inode, i_unlink)) {
>  		struct xfs_inode *pip = list_prev_entry(ip, i_unlink);
>  
> -		ASSERT(head_agino != agino);
>  		return xfs_iunlink_update_inode(tp, pip, agno, agino,
>  						next_agino);
>  	}
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 11/13] xfs: combine iunlink inode update functions
  2020-08-12  9:25 ` [PATCH 11/13] xfs: combine iunlink inode update functions Dave Chinner
@ 2020-08-19  0:30   ` Darrick J. Wong
  0 siblings, 0 replies; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-19  0:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:54PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Combine the logging of the inode unlink list update into the
> calling function that looks up the buffer we end up logging. These
> do not need to be separate functions as they are both short, simple
> operations and there's only a single call path through them. This
> new function will end up being the core of the iunlink log item
> processing...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

This was pretty easy to follow,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode.c | 58 ++++++++++++++++------------------------------
>  1 file changed, 20 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 4f616e1b64dc..82242d15b1d7 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1972,38 +1972,12 @@ xfs_iunlink_update_bucket(
>  	return 0;
>  }
>  
> -/* Set an on-disk inode's next_unlinked pointer. */
> -STATIC void
> -xfs_iunlink_update_dinode(
> -	struct xfs_trans	*tp,
> -	xfs_agnumber_t		agno,
> -	xfs_agino_t		agino,
> -	struct xfs_buf		*ibp,
> -	struct xfs_dinode	*dip,
> -	struct xfs_imap		*imap,
> -	xfs_agino_t		next_agino)
> -{
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	int			offset;
> -
> -	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
> -
> -	trace_xfs_iunlink_update_dinode(mp, agno, agino,
> -			be32_to_cpu(dip->di_next_unlinked), next_agino);
> -
> -	dip->di_next_unlinked = cpu_to_be32(next_agino);
> -	offset = imap->im_boffset +
> -			offsetof(struct xfs_dinode, di_next_unlinked);
> -
> -	/* need to recalc the inode CRC if appropriate */
> -	xfs_dinode_calc_crc(mp, dip);
> -	xfs_trans_inode_buf(tp, ibp);
> -	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
> -}
> -
> -/* Set an in-core inode's unlinked pointer and return the old value. */
> +/*
> + * Look up the inode cluster buffer and log the on-disk unlinked inode change
> + * we need to make.
> + */
>  STATIC int
> -xfs_iunlink_update_inode(
> +xfs_iunlink_log_inode(
>  	struct xfs_trans	*tp,
>  	struct xfs_inode	*ip,
>  	xfs_agnumber_t		agno,
> @@ -2013,6 +1987,7 @@ xfs_iunlink_update_inode(
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_dinode	*dip;
>  	struct xfs_buf		*ibp;
> +	int			offset;
>  	int			error;
>  
>  	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
> @@ -2028,9 +2003,17 @@ xfs_iunlink_update_inode(
>  		return -EFSCORRUPTED;
>  	}
>  
> -	/* Ok, update the new pointer. */
> -	xfs_iunlink_update_dinode(tp, agno, XFS_INO_TO_AGINO(mp, ip->i_ino),
> -			ibp, dip, &ip->i_imap, next_agino);
> +	trace_xfs_iunlink_update_dinode(mp, agno,
> +			XFS_INO_TO_AGINO(mp, ip->i_ino),
> +			be32_to_cpu(dip->di_next_unlinked), next_agino);
> +
> +	dip->di_next_unlinked = cpu_to_be32(next_agino);
> +	offset = ip->i_imap.im_boffset +
> +			offsetof(struct xfs_dinode, di_next_unlinked);
> +
> +	xfs_dinode_calc_crc(mp, dip);
> +	xfs_trans_inode_buf(tp, ibp);
> +	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
>  	return 0;
>  }
>  
> @@ -2056,7 +2039,7 @@ xfs_iunlink_insert_inode(
>  		 * inode to the current head of the list.
>  		 */
>  		next_agino = XFS_INO_TO_AGINO(mp, nip->i_ino);
> -		error = xfs_iunlink_update_inode(tp, ip, agno, NULLAGINO,
> +		error = xfs_iunlink_log_inode(tp, ip, agno, NULLAGINO,
>  						 next_agino);
>  		if (error)
>  			return error;
> @@ -2129,7 +2112,7 @@ xfs_iunlink_remove_inode(
>  	}
>  
>  	/* Clear the on disk next unlinked pointer for this inode. */
> -	error = xfs_iunlink_update_inode(tp, ip, agno, next_agino, NULLAGINO);
> +	error = xfs_iunlink_log_inode(tp, ip, agno, next_agino, NULLAGINO);
>  	if (error)
>  		return error;
>  
> @@ -2138,8 +2121,7 @@ xfs_iunlink_remove_inode(
>  					struct xfs_inode, i_unlink)) {
>  		struct xfs_inode *pip = list_prev_entry(ip, i_unlink);
>  
> -		return xfs_iunlink_update_inode(tp, pip, agno, agino,
> -						next_agino);
> +		return xfs_iunlink_log_inode(tp, pip, agno, agino, next_agino);
>  	}
>  
>  	/* Point the head of the list to the next unlinked inode. */
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 12/13] xfs: add in-memory iunlink log item
  2020-08-12  9:25 ` [PATCH 12/13] xfs: add in-memory iunlink log item Dave Chinner
@ 2020-08-19  0:35   ` Darrick J. Wong
  0 siblings, 0 replies; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-19  0:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:55PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have a clean operation to update the di_next_unlinked
> field of inode cluster buffers, we can easily defer this operation
> to transaction commit time so we can order the inode cluster buffer
> locking consistently.
> 
> TO do this, we introduce a new in-memory log item to track the
> unlinked list item modification that we are going to make. This
> follows the same observations as the in-memory double linked list
> used to track unlinked inodes in that the inodes on the list are
> pinned in memory and cannot go away, and hence we can simply
> reference them for the duration of the transaction without needing
> to take active references or pin them or look them up.
> 
> This allows us to pass the xfs_inode to the transaction commit code
> along with the modification to be made, and then order the logged
> modifications via the ->iop_sort and ->iop_precommit operations
> for the new log item type. As this is an in-memory log item, it
> doesn't have formatting, CIL or AIL operational hooks - it exists
> purely to run the inode unlink modifications and is then removed
> from the transaction item list and freed once the precommit
> operation has run.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

/me would sorta like it if Brian took another look at the patch that
adds the precommit hooks, log item sorting, and whatnot, but this part
at least looks fairly simple.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/Makefile           |   1 +
>  fs/xfs/xfs_inode.c        |  61 ++------------
>  fs/xfs/xfs_iunlink_item.c | 168 ++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_iunlink_item.h |  25 ++++++
>  fs/xfs/xfs_super.c        |  10 +++
>  5 files changed, 209 insertions(+), 56 deletions(-)
>  create mode 100644 fs/xfs/xfs_iunlink_item.c
>  create mode 100644 fs/xfs/xfs_iunlink_item.h
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 04611a1068b4..febdf034ca94 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -105,6 +105,7 @@ xfs-y				+= xfs_log.o \
>  				   xfs_icreate_item.o \
>  				   xfs_inode_item.o \
>  				   xfs_inode_item_recover.o \
> +				   xfs_iunlink_item.o \
>  				   xfs_refcount_item.o \
>  				   xfs_rmap_item.o \
>  				   xfs_log_recover.o \
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 82242d15b1d7..ce128ff12762 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -36,6 +36,7 @@
>  #include "xfs_log_priv.h"
>  #include "xfs_bmap_btree.h"
>  #include "xfs_reflink.h"
> +#include "xfs_iunlink_item.h"
>  
>  kmem_zone_t *xfs_inode_zone;
>  
> @@ -1972,51 +1973,6 @@ xfs_iunlink_update_bucket(
>  	return 0;
>  }
>  
> -/*
> - * Look up the inode cluster buffer and log the on-disk unlinked inode change
> - * we need to make.
> - */
> -STATIC int
> -xfs_iunlink_log_inode(
> -	struct xfs_trans	*tp,
> -	struct xfs_inode	*ip,
> -	xfs_agnumber_t		agno,
> -	xfs_agino_t		old_agino,
> -	xfs_agino_t		next_agino)
> -{
> -	struct xfs_mount	*mp = tp->t_mountp;
> -	struct xfs_dinode	*dip;
> -	struct xfs_buf		*ibp;
> -	int			offset;
> -	int			error;
> -
> -	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
> -
> -	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp, 0);
> -	if (error)
> -		return error;
> -
> -	if (be32_to_cpu(dip->di_next_unlinked) != old_agino) {
> -		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
> -					sizeof(*dip), __this_address);
> -		xfs_trans_brelse(tp, ibp);
> -		return -EFSCORRUPTED;
> -	}
> -
> -	trace_xfs_iunlink_update_dinode(mp, agno,
> -			XFS_INO_TO_AGINO(mp, ip->i_ino),
> -			be32_to_cpu(dip->di_next_unlinked), next_agino);
> -
> -	dip->di_next_unlinked = cpu_to_be32(next_agino);
> -	offset = ip->i_imap.im_boffset +
> -			offsetof(struct xfs_dinode, di_next_unlinked);
> -
> -	xfs_dinode_calc_crc(mp, dip);
> -	xfs_trans_inode_buf(tp, ibp);
> -	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
> -	return 0;
> -}
> -
>  static int
>  xfs_iunlink_insert_inode(
>  	struct xfs_trans	*tp,
> @@ -2028,7 +1984,6 @@ xfs_iunlink_insert_inode(
>  	xfs_agino_t		next_agino = NULLAGINO;
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> -	int			error;
>  
>  	nip = list_first_entry_or_null(&agibp->b_pag->pag_ici_unlink_list,
>  					struct xfs_inode, i_unlink);
> @@ -2039,10 +1994,7 @@ xfs_iunlink_insert_inode(
>  		 * inode to the current head of the list.
>  		 */
>  		next_agino = XFS_INO_TO_AGINO(mp, nip->i_ino);
> -		error = xfs_iunlink_log_inode(tp, ip, agno, NULLAGINO,
> -						 next_agino);
> -		if (error)
> -			return error;
> +		xfs_iunlink_log(tp, ip, NULLAGINO, next_agino);
>  	}
>  
>  	/* Point the head of the list to point to this inode. */
> @@ -2098,7 +2050,6 @@ xfs_iunlink_remove_inode(
>  	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
>  	xfs_agino_t		next_agino = NULLAGINO;
> -	int			error;
>  
>  	/*
>  	 * Get the next agino in the list. If we are at the end of the list,
> @@ -2112,16 +2063,14 @@ xfs_iunlink_remove_inode(
>  	}
>  
>  	/* Clear the on disk next unlinked pointer for this inode. */
> -	error = xfs_iunlink_log_inode(tp, ip, agno, next_agino, NULLAGINO);
> -	if (error)
> -		return error;
> -
> +	xfs_iunlink_log(tp, ip, next_agino, NULLAGINO);
>  
>  	if (ip != list_first_entry(&agibp->b_pag->pag_ici_unlink_list,
>  					struct xfs_inode, i_unlink)) {
>  		struct xfs_inode *pip = list_prev_entry(ip, i_unlink);
>  
> -		return xfs_iunlink_log_inode(tp, pip, agno, agino, next_agino);
> +		xfs_iunlink_log(tp, pip, agino, next_agino);
> +		return 0;
>  	}
>  
>  	/* Point the head of the list to the next unlinked inode. */
> diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
> new file mode 100644
> index 000000000000..2ee05f98aa97
> --- /dev/null
> +++ b/fs/xfs/xfs_iunlink_item.c
> @@ -0,0 +1,168 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2020, Red Hat, Inc.
> + * All Rights Reserved.
> + */
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_inode.h"
> +#include "xfs_trans.h"
> +#include "xfs_trans_priv.h"
> +#include "xfs_iunlink_item.h"
> +#include "xfs_trace.h"
> +#include "xfs_error.h"
> +
> +struct kmem_cache	*xfs_iunlink_zone;
> +
> +static inline struct xfs_iunlink_item *IUL_ITEM(struct xfs_log_item *lip)
> +{
> +	return container_of(lip, struct xfs_iunlink_item, iu_item);
> +}
> +
> +static void
> +xfs_iunlink_item_release(
> +	struct xfs_log_item	*lip)
> +{
> +	kmem_cache_free(xfs_iunlink_zone, IUL_ITEM(lip));
> +}
> +
> +
> +static uint64_t
> +xfs_iunlink_item_sort(
> +	struct xfs_log_item	*lip)
> +{
> +	return IUL_ITEM(lip)->iu_ip->i_ino;
> +}
> +
> +/*
> + * Look up the inode cluster buffer and log the on-disk unlinked inode change
> + * we need to make.
> + */
> +static int
> +xfs_iunlink_log_inode(
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip,
> +	xfs_agino_t		old_agino,
> +	xfs_agino_t		next_agino)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> +	struct xfs_dinode	*dip;
> +	struct xfs_buf		*ibp;
> +	int			offset;
> +	int			error;
> +
> +	ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
> +
> +	error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &ibp, 0);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Don't bother updating the unlinked field on stale buffers as
> +	 * it will never get to disk anyway.
> +	 */
> +	if (ibp->b_flags & XBF_STALE)
> +		return 0;
> +
> +	if (be32_to_cpu(dip->di_next_unlinked) != old_agino) {
> +		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
> +					sizeof(*dip), __this_address);
> +		xfs_trans_brelse(tp, ibp);
> +		return -EFSCORRUPTED;
> +	}
> +
> +	trace_xfs_iunlink_update_dinode(mp, agno,
> +			XFS_INO_TO_AGINO(mp, ip->i_ino),
> +			be32_to_cpu(dip->di_next_unlinked), next_agino);
> +
> +	dip->di_next_unlinked = cpu_to_be32(next_agino);
> +	offset = ip->i_imap.im_boffset +
> +			offsetof(struct xfs_dinode, di_next_unlinked);
> +
> +	xfs_dinode_calc_crc(mp, dip);
> +	xfs_trans_inode_buf(tp, ibp);
> +	xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
> +	return 0;
> +}
> +
> +/*
> + * On precommit, we grab the inode cluster buffer for the inode number
> + * we were passed, then update the next unlinked field for that inode in
> + * the buffer and log the buffer. This ensures that the inode cluster buffer
> + * was logged in the correct order w.r.t. other inode cluster buffers.
> + *
> + * Note: if the inode cluster buffer is marked stale, this transaction is
> + * actually freeing the inode cluster. In that case, do not relog the buffer
> + * as this removes the stale state from it. That then causes the post-commit
> + * processing that is dependent on the cluster buffer being stale to go wrong
> + * and we'll leave stale inodes in the AIL that cannot be removed, hanging the
> + * log.
> + */
> +static int
> +xfs_iunlink_item_precommit(
> +	struct xfs_trans	*tp,
> +	struct xfs_log_item	*lip)
> +{
> +	struct xfs_iunlink_item	*iup = IUL_ITEM(lip);
> +	int			error;
> +
> +	error = xfs_iunlink_log_inode(tp, iup->iu_ip, iup->iu_old_agino,
> +					iup->iu_next_agino);
> +
> +	/*
> +	 * This log item only exists to perform this action. We now remove
> +	 * it from the transaction and free it as it should never reach the
> +	 * CIL.
> +	 */
> +	list_del(&lip->li_trans);
> +	xfs_iunlink_item_release(lip);
> +	return error;
> +}
> +
> +static const struct xfs_item_ops xfs_iunlink_item_ops = {
> +	.iop_release	= xfs_iunlink_item_release,
> +	.iop_sort	= xfs_iunlink_item_sort,
> +	.iop_precommit	= xfs_iunlink_item_precommit,
> +};
> +
> +
> +/*
> + * Initialize the inode log item for a newly allocated (in-core) inode.
> + *
> + * Inode extents can only reside within an AG. Hence specify the starting
> + * block for the inode chunk by offset within an AG as well as the
> + * length of the allocated extent.
> + *
> + * This joins the item to the transaction and marks it dirty so
> + * that we don't need a separate call to do this, nor does the
> + * caller need to know anything about the iunlink item.
> + */
> +void
> +xfs_iunlink_log(
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip,
> +	xfs_agino_t		old_agino,
> +	xfs_agino_t		next_agino)
> +{
> +	struct xfs_iunlink_item	*iup;
> +
> +	iup = kmem_cache_zalloc(xfs_iunlink_zone, GFP_KERNEL | __GFP_NOFAIL);
> +
> +	xfs_log_item_init(tp->t_mountp, &iup->iu_item, XFS_LI_IUNLINK,
> +			  &xfs_iunlink_item_ops);
> +
> +	iup->iu_ip = ip;
> +	iup->iu_next_agino = next_agino;
> +	iup->iu_old_agino = old_agino;
> +
> +	xfs_trans_add_item(tp, &iup->iu_item);
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	set_bit(XFS_LI_DIRTY, &iup->iu_item.li_flags);
> +}
> +
> diff --git a/fs/xfs/xfs_iunlink_item.h b/fs/xfs/xfs_iunlink_item.h
> new file mode 100644
> index 000000000000..f2b95032cf6b
> --- /dev/null
> +++ b/fs/xfs/xfs_iunlink_item.h
> @@ -0,0 +1,25 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2020, Red Hat, Inc.
> + * All Rights Reserved.
> + */
> +#ifndef XFS_IUNLINK_ITEM_H
> +#define XFS_IUNLINK_ITEM_H	1
> +
> +struct xfs_trans;
> +struct xfs_inode;
> +
> +/* in memory log item structure */
> +struct xfs_iunlink_item {
> +	struct xfs_log_item	iu_item;
> +	struct xfs_inode	*iu_ip;
> +	xfs_agino_t		iu_next_agino;
> +	xfs_agino_t		iu_old_agino;
> +};
> +
> +extern kmem_zone_t *xfs_iunlink_zone;
> +
> +void xfs_iunlink_log(struct xfs_trans *tp, struct xfs_inode *ip,
> +			xfs_agino_t old_agino, xfs_agino_t next_agino);
> +
> +#endif	/* XFS_IUNLINK_ITEM_H */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 68ec8db12cc7..b8f66ccc7090 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -35,6 +35,7 @@
>  #include "xfs_refcount_item.h"
>  #include "xfs_bmap_item.h"
>  #include "xfs_reflink.h"
> +#include "xfs_iunlink_item.h"
>  
>  #include <linux/magic.h>
>  #include <linux/fs_context.h>
> @@ -1969,8 +1970,16 @@ xfs_init_zones(void)
>  	if (!xfs_bui_zone)
>  		goto out_destroy_bud_zone;
>  
> +	xfs_iunlink_zone = kmem_cache_create("xfs_iul_item",
> +					     sizeof(struct xfs_iunlink_item),
> +					     0, 0, NULL);
> +	if (!xfs_iunlink_zone)
> +		goto out_destroy_bui_zone;
> +
>  	return 0;
>  
> + out_destroy_bui_zone:
> +	kmem_cache_destroy(xfs_bui_zone);
>   out_destroy_bud_zone:
>  	kmem_cache_destroy(xfs_bud_zone);
>   out_destroy_cui_zone:
> @@ -2017,6 +2026,7 @@ xfs_destroy_zones(void)
>  	 * destroy caches.
>  	 */
>  	rcu_barrier();
> +	kmem_cache_destroy(xfs_iunlink_zone);
>  	kmem_cache_destroy(xfs_bui_zone);
>  	kmem_cache_destroy(xfs_bud_zone);
>  	kmem_cache_destroy(xfs_cui_zone);
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 04/13] xfs: arrange all unlinked inodes into one list
  2020-08-18 23:59   ` Darrick J. Wong
@ 2020-08-19  0:45     ` Dave Chinner
  2020-08-19  0:58     ` Gao Xiang
  1 sibling, 0 replies; 51+ messages in thread
From: Dave Chinner @ 2020-08-19  0:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Aug 18, 2020 at 04:59:59PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 12, 2020 at 07:25:47PM +1000, Dave Chinner wrote:
> > From: Gao Xiang <hsiangkao@redhat.com>
> > 
> > We currently keep unlinked lists short on disk by hashing the inodes
> > across multiple buckets. We don't need to ikeep them short anymore
> > as we no longer need to traverse the entire to remove an inode from
> > it. The in-memory back reference index provides the previous inode
> > in the list for us instead.
> > 
> > Log recovery still has to handle existing filesystems that use all
> > 64 on-disk buckets so we detect and handle this case specially so
> > that so inode eviction can still work properly in recovery.
> > 
> > [dchinner: imported into parent patch series early on and modified
> > to fit cleanly. ]
> > 
> > Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_inode.c | 49 +++++++++++++++++++++++++++-------------------
> >  1 file changed, 29 insertions(+), 20 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index f2f502b65691..fa92bdf6e0da 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -33,6 +33,7 @@
> >  #include "xfs_symlink.h"
> >  #include "xfs_trans_priv.h"
> >  #include "xfs_log.h"
> > +#include "xfs_log_priv.h"
> >  #include "xfs_bmap_btree.h"
> >  #include "xfs_reflink.h"
> >  
> > @@ -2092,25 +2093,32 @@ xfs_iunlink_update_bucket(
> >  	struct xfs_trans	*tp,
> >  	xfs_agnumber_t		agno,
> >  	struct xfs_buf		*agibp,
> > -	unsigned int		bucket_index,
> > +	xfs_agino_t		old_agino,
> >  	xfs_agino_t		new_agino)
> >  {
> > +	struct xlog		*log = tp->t_mountp->m_log;
> >  	struct xfs_agi		*agi = agibp->b_addr;
> >  	xfs_agino_t		old_value;
> > -	int			offset;
> > +	unsigned int		bucket_index;
> > +	int                     offset;
> >  
> >  	ASSERT(xfs_verify_agino_or_null(tp->t_mountp, agno, new_agino));
> >  
> > +	bucket_index = 0;
> > +	/* During recovery, the old multiple bucket index can be applied */
> > +	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
> 
> Does the flag test need parentheses?

Yes, will fix.

> It feels a little funny that we pass in old_agino (having gotten it from
> agi_unlinked) and then compare it with agi_unlinked, but as the commit
> log points out, I guess this is a wart of having to support the old
> unlinked list behavior.  It makes sense to me that if we're going to
> change the unlinked list behavior we could be a little more careful
> about double-checking things.
> 
> Question: if a newer kernel crashes with a super-long unlinked list and
> the fs gets recovered on an old kernel, will this lead to insanely high
> recovery times?  I think the answer is no, because recovery is single
> threaded and the hash only existed to reduce AGI contention during
> normal unlinking operations?

Right, the answer is no because log recovery even on old kernels
always recovers the inode at the head of the list. It does no
traversal, so it doesn't matter if it's recovering one list or 64
lists, the recovery time is the same.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 13/13] xfs: reorder iunlink remove operation in xfs_ifree
  2020-08-12  9:25 ` [PATCH 13/13] xfs: reorder iunlink remove operation in xfs_ifree Dave Chinner
  2020-08-12 11:12   ` Gao Xiang
@ 2020-08-19  0:45   ` Darrick J. Wong
  1 sibling, 0 replies; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-19  0:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:56PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The O_TMPFILE creation implementation creates a specific order of
> operations for inode allocation/freeing and unlinked list
> modification. Currently both are serialised by the AGI, so the order
> doesn't strictly matter as long as the are both in the same
> transaction.
> 
> However, if we want to move the unlinked list insertions largely
> out from under the AGI lock, then we have to be concerned about the
> order in which we do unlinked list modification operations.
> O_TMPFILE creation tells us this order is inode allocation/free,
> then unlinked list modification.
> 
> Change xfs_ifree() to use this same ordering on unlinked list
> removal. THis way we always guarantee that when we enter the

"This"...

> iunlinked list removal code from this path, we have the already

"have the already locked" ... what do we have locked?  The AGI?

> locked and we don't have to worry about lock nesting AGI reads
> inside unlink list locks because it's already locked and attached to
> the transaction.
> 
> We can do this safely as the inode freeing and unlinked list removal
> are done in the same transaction and hence are atomic operations
> with resepect to log recovery.

"respect"...

With the commit log edited a bit,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode.c | 22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ce128ff12762..7ee778bcde06 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2283,14 +2283,13 @@ xfs_ifree_cluster(
>  }
>  
>  /*
> - * This is called to return an inode to the inode free list.
> - * The inode should already be truncated to 0 length and have
> - * no pages associated with it.  This routine also assumes that
> - * the inode is already a part of the transaction.
> + * This is called to return an inode to the inode free list.  The inode should
> + * already be truncated to 0 length and have no pages associated with it.  This
> + * routine also assumes that the inode is already a part of the transaction.
>   *
> - * The on-disk copy of the inode will have been added to the list
> - * of unlinked inodes in the AGI. We need to remove the inode from
> - * that list atomically with respect to freeing it here.
> + * The on-disk copy of the inode will have been added to the list of unlinked
> + * inodes in the AGI. We need to remove the inode from that list atomically with
> + * respect to freeing it here.
>   */
>  int
>  xfs_ifree(
> @@ -2308,13 +2307,16 @@ xfs_ifree(
>  	ASSERT(ip->i_d.di_nblocks == 0);
>  
>  	/*
> -	 * Pull the on-disk inode from the AGI unlinked list.
> +	 * Free the inode first so that we guarantee that the AGI lock is going
> +	 * to be taken before we remove the inode from the unlinked list. This
> +	 * makes the AGI lock -> unlinked list modification order the same as
> +	 * used in O_TMPFILE creation.
>  	 */
> -	error = xfs_iunlink_remove(tp, ip);
> +	error = xfs_difree(tp, ip->i_ino, &xic);
>  	if (error)
>  		return error;
>  
> -	error = xfs_difree(tp, ip->i_ino, &xic);
> +	error = xfs_iunlink_remove(tp, ip);
>  	if (error)
>  		return error;
>  
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/13] xfs: add unlink list pointers to xfs_inode
  2020-08-19  0:02   ` Darrick J. Wong
@ 2020-08-19  0:47     ` Dave Chinner
  0 siblings, 0 replies; 51+ messages in thread
From: Dave Chinner @ 2020-08-19  0:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Aug 18, 2020 at 05:02:51PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 12, 2020 at 07:25:48PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > To move away from using the on disk inode buffers to track and log
> > unlinked inodes, we need pointers to track them in memory. Because
> > we have arbitrary remove order from the list, it needs to be a
> > double linked list.
> > 
> > We start by noting that inodes are always in memory when they are
> > active on the unlinked list, and hence we can track these inodes
> > without needing to take references to the inodes or store them in
> > the list. We cannot, however, use inode locks to protect the inodes
> > on the list - the list needs an external lock to serialise all
> > inserts and removals. We can use the existing AGI buffer lock for
> > this right now as that already serialises all unlinked list
> > traversals and modifications.
> > 
> > Hence we can convert the in-memory unlinked list to a simple
> > list_head list in the perag. We can use list_empty() to detect an
> > empty unlinked list, likewise we can detect the end of the list when
> > the inode next pointer points back to the perag list_head. This
> > makes insert, remove and traversal.
> > 
> > The only complication here is log recovery of old filesystems that
> > have multiple lists. These always remove from the head of the list,
> > so we can easily construct just enough of the unlinked list for
> > recovery from any list to work correctly.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Hm.  This is orthogonal to this patch, but should we get meaner about
> failing the mount if the AGI read fails or the unlinked walk fails?

I don't think it matters. We can leak the unlinked lists and still
have the filesystem operate correctly, so I don't think failing the
mount is necessary or desirable (i.e. how many existing filesystems
will now suddenly refuse to mount?)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/13] xfs: replace iunlink backref lookups with list lookups
  2020-08-19  0:13   ` Darrick J. Wong
@ 2020-08-19  0:52     ` Dave Chinner
  0 siblings, 0 replies; 51+ messages in thread
From: Dave Chinner @ 2020-08-19  0:52 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Aug 18, 2020 at 05:13:22PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 12, 2020 at 07:25:49PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now we have an in memory linked list of all the inodes on the
> > unlinked list, use that to look up inodes in the list that we need
> > to modify when adding or removing from the list.
> > 
> > This means we are no longer using the backref cache to maintain the
> > previous inode lookups, so we can remove all that infrastructure
> > now.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>

....

> > @@ -2354,70 +2186,24 @@ xfs_iunlink_map_prev(
> >  	xfs_agnumber_t		agno,
> >  	xfs_agino_t		head_agino,
> >  	xfs_agino_t		target_agino,
> > -	xfs_agino_t		*agino,
> > +	xfs_agino_t		agino,
> >  	struct xfs_imap		*imap,
> >  	struct xfs_dinode	**dipp,
> >  	struct xfs_buf		**bpp,
> >  	struct xfs_perag	*pag)
> >  {
> > -	struct xfs_mount	*mp = tp->t_mountp;
> > -	xfs_agino_t		next_agino;
> >  	int			error;
> >  
> >  	ASSERT(head_agino != target_agino);
> >  	*bpp = NULL;
> >  
> > -	/* See if our backref cache can find it faster. */
> > -	*agino = xfs_iunlink_lookup_backref(pag, target_agino);
> > -	if (*agino != NULLAGINO) {
> > -		error = xfs_iunlink_map_ino(tp, agno, *agino, imap, dipp, bpp);
> > -		if (error)
> > -			return error;
> > -
> > -		if (be32_to_cpu((*dipp)->di_next_unlinked) == target_agino)
> > -			return 0;
> > -
> > -		/*
> > -		 * If we get here the cache contents were corrupt, so drop the
> > -		 * buffer and fall back to walking the bucket list.
> > -		 */
> > -		xfs_trans_brelse(tp, *bpp);
> > -		*bpp = NULL;
> > -		WARN_ON_ONCE(1);
> > -	}
> > -
> > -	trace_xfs_iunlink_map_prev_fallback(mp, agno);
> > -
> > -	/* Otherwise, walk the entire bucket until we find it. */
> > -	next_agino = head_agino;
> > -	while (next_agino != target_agino) {
> > -		xfs_agino_t	unlinked_agino;
> > -
> > -		if (*bpp)
> > -			xfs_trans_brelse(tp, *bpp);
> > -
> > -		*agino = next_agino;
> > -		error = xfs_iunlink_map_ino(tp, agno, next_agino, imap, dipp,
> > -				bpp);
> > -		if (error)
> > -			return error;
> > -
> > -		unlinked_agino = be32_to_cpu((*dipp)->di_next_unlinked);
> > -		/*
> > -		 * Make sure this pointer is valid and isn't an obvious
> > -		 * infinite loop.
> > -		 */
> > -		if (!xfs_verify_agino(mp, agno, unlinked_agino) ||
> > -		    next_agino == unlinked_agino) {
> > -			XFS_CORRUPTION_ERROR(__func__,
> > -					XFS_ERRLEVEL_LOW, mp,
> > -					*dipp, sizeof(**dipp));
> > -			error = -EFSCORRUPTED;
> > -			return error;
> > -		}
> > -		next_agino = unlinked_agino;
> > -	}
> > +	ASSERT(agino != NULLAGINO);
> > +	error = xfs_iunlink_map_ino(tp, agno, agino, imap, dipp, bpp);
> > +	if (error)
> > +		return error;
> >  
> > +	if (be32_to_cpu((*dipp)->di_next_unlinked) != target_agino)
> > +		return -EFSCORRUPTED;
> 
> Why drop the corruption report here?

For simplicity of refactoring. It comes back later on in the series
as this check gets recombined with another similar function. i.e.
See xfs_iunlink_log_inode() at the end of the series....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 04/13] xfs: arrange all unlinked inodes into one list
  2020-08-18 23:59   ` Darrick J. Wong
  2020-08-19  0:45     ` Dave Chinner
@ 2020-08-19  0:58     ` Gao Xiang
  2020-08-22  9:01       ` Christoph Hellwig
  1 sibling, 1 reply; 51+ messages in thread
From: Gao Xiang @ 2020-08-19  0:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Tue, Aug 18, 2020 at 04:59:59PM -0700, Darrick J. Wong wrote:

...

> > +	bucket_index = 0;
> > +	/* During recovery, the old multiple bucket index can be applied */
> > +	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
> 
> Does the flag test need parentheses?

Yeah, that would be better.

> 
> It feels a little funny that we pass in old_agino (having gotten it from
> agi_unlinked) and then compare it with agi_unlinked, but as the commit
> log points out, I guess this is a wart of having to support the old
> unlinked list behavior.  It makes sense to me that if we're going to
> change the unlinked list behavior we could be a little more careful
> about double-checking things.
> 
> Question: if a newer kernel crashes with a super-long unlinked list and
> the fs gets recovered on an old kernel, will this lead to insanely high
> recovery times?  I think the answer is no, because recovery is single
> threaded and the hash only existed to reduce AGI contention during
> normal unlinking operations?

btw, if my understanding is correct, as I mentioned starting from my v1,
this new feature isn't forward compatible since old kernel hardcode
agino % XFS_AGI_UNLINKED_BUCKETS but not tracing original bucket_index
from its logging recovery code. So yeah, a bit awkward from its original
design...

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/13] xfs: re-order AGI updates in unlink list updates
  2020-08-19  0:29   ` Darrick J. Wong
@ 2020-08-19  1:01     ` Dave Chinner
  0 siblings, 0 replies; 51+ messages in thread
From: Dave Chinner @ 2020-08-19  1:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Aug 18, 2020 at 05:29:48PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 12, 2020 at 07:25:53PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > We always access and check the AGI bucket entry for the unlinked
> > list even if we are not going to need it either for lookup or remove
> > purposes. Move the code that accesses the AGI to the code that
> > modifes the AGI, hence keeping the AGI accesses local to the code
> > that needs to modify it.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_inode.c | 84 ++++++++++++++++------------------------------
> >  1 file changed, 28 insertions(+), 56 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index b098e5df07e7..4f616e1b64dc 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1918,44 +1918,53 @@ xfs_inactive(
> >   */
> >  
> >  /*
> > - * Point the AGI unlinked bucket at an inode and log the results.  The caller
> > - * is responsible for validating the old value.
> > + * Point the AGI unlinked bucket at an inode and log the results. The caller
> > + * passes in the expected current agino the bucket points at via @cur_agino so
> > + * we can validate that we are about to remove the inode we expect to be
> > + * removing from the AGI bucket.
> >   */
> > -STATIC int
> > +static int
> >  xfs_iunlink_update_bucket(
> >  	struct xfs_trans	*tp,
> >  	xfs_agnumber_t		agno,
> >  	struct xfs_buf		*agibp,
> > -	xfs_agino_t		old_agino,
> > +	xfs_agino_t		cur_agino,
> 
> Hm.  So I think I understand the new role of this function better now
> that this patch moves into this function the checking of the bucket
> pointer and whatnot.  Would it be difficult to merge this patch with
> patch 4?

I really didn't want to remove the code that used the "head_agino"
for verification until I had moved all the list traversal
functionality to use the in memory unlinked list and had verified
that was correct....

I think merging them it could be done, but it will most likely
result in having to rebase and retest every subsequent patch...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/13] xfs: xfs_iflock is no longer a completion
  2020-08-12  9:25 ` [PATCH 01/13] xfs: xfs_iflock is no longer a completion Dave Chinner
  2020-08-18 23:44   ` Darrick J. Wong
@ 2020-08-22  7:41   ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2020-08-22  7:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:44PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> With the recent rework of the inode cluster flushing, we no longer
> ever wait on the the inode flush "lock". It was never a lock in the
> first place, just a completion to allow callers to wait for inode IO
> to complete. We now never wait for flush completion as all inode
> flushing is non-blocking. Hence we can get rid of all the iflock
> infrastructure and instead just set and check a state flag.
> 
> Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the
> xfs_iflock_nowait() test-and-set operations on that flag, and
> replace all the xfs_ifunlock() calls to clear operations.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c     | 17 ++++------
>  fs/xfs/xfs_inode.c      | 73 +++++++++++++++--------------------------
>  fs/xfs/xfs_inode.h      | 33 +------------------
>  fs/xfs/xfs_inode_item.c | 15 ++++-----
>  fs/xfs/xfs_inode_item.h |  4 +--
>  fs/xfs/xfs_mount.c      | 11 ++++---
>  fs/xfs/xfs_super.c      | 10 +++---
>  7 files changed, 55 insertions(+), 108 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 101028ebb571..aa6aad258670 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -52,7 +52,6 @@ xfs_inode_alloc(
>  
>  	XFS_STATS_INC(mp, vn_active);
>  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> -	ASSERT(!xfs_isiflocked(ip));
>  	ASSERT(ip->i_ino == 0);
>  
>  	/* initialise the xfs inode */
> @@ -123,7 +122,7 @@ void
>  xfs_inode_free(
>  	struct xfs_inode	*ip)
>  {
> -	ASSERT(!xfs_isiflocked(ip));
> +	ASSERT(!xfs_iflags_test(ip, XFS_IFLUSHING));
>  
>  	/*
>  	 * Because we use RCU freeing we need to ensure the inode always
> @@ -1035,23 +1034,21 @@ xfs_reclaim_inode(
>  
>  	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
>  		goto out;
> -	if (!xfs_iflock_nowait(ip))
> +	if (xfs_iflags_test_and_set(ip, XFS_IFLUSHING))
>  		goto out_iunlock;
>  
>  	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
>  		xfs_iunpin_wait(ip);
> -		/* xfs_iflush_abort() drops the flush lock */

Maybe keeps this as

		/*  xfs_iflush_abort() clears XFS_IFLUSHING */

> @@ -3661,7 +3643,6 @@ xfs_iflush_cluster(
>  		 */
>  		if (XFS_FORCED_SHUTDOWN(mp)) {
>  			xfs_iunpin_wait(ip);
> -			/* xfs_iflush_abort() drops the flush lock */

Same here.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/13] xfs: factor the xfs_iunlink functions
  2020-08-12  9:25 ` [PATCH 03/13] xfs: factor the xfs_iunlink functions Dave Chinner
  2020-08-18 23:49   ` Darrick J. Wong
@ 2020-08-22  7:45   ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2020-08-22  7:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 04/13] xfs: arrange all unlinked inodes into one list
  2020-08-19  0:58     ` Gao Xiang
@ 2020-08-22  9:01       ` Christoph Hellwig
  2020-08-23 17:24         ` Gao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2020-08-22  9:01 UTC (permalink / raw)
  To: Gao Xiang; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs

On Wed, Aug 19, 2020 at 08:58:30AM +0800, Gao Xiang wrote:
> btw, if my understanding is correct, as I mentioned starting from my v1,
> this new feature isn't forward compatible since old kernel hardcode
> agino % XFS_AGI_UNLINKED_BUCKETS but not tracing original bucket_index
> from its logging recovery code. So yeah, a bit awkward from its original
> design...

I think we should add a log_incompat feature just to be safe.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/13] xfs: add unlink list pointers to xfs_inode
  2020-08-12  9:25 ` [PATCH 05/13] xfs: add unlink list pointers to xfs_inode Dave Chinner
  2020-08-19  0:02   ` Darrick J. Wong
@ 2020-08-22  9:03   ` Christoph Hellwig
  2020-08-25  5:17     ` Dave Chinner
  1 sibling, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2020-08-22  9:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 12, 2020 at 07:25:48PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To move away from using the on disk inode buffers to track and log
> unlinked inodes, we need pointers to track them in memory. Because
> we have arbitrary remove order from the list, it needs to be a
> double linked list.
> 
> We start by noting that inodes are always in memory when they are
> active on the unlinked list, and hence we can track these inodes
> without needing to take references to the inodes or store them in
> the list. We cannot, however, use inode locks to protect the inodes
> on the list - the list needs an external lock to serialise all
> inserts and removals. We can use the existing AGI buffer lock for
> this right now as that already serialises all unlinked list
> traversals and modifications.
> 
> Hence we can convert the in-memory unlinked list to a simple
> list_head list in the perag. We can use list_empty() to detect an
> empty unlinked list, likewise we can detect the end of the list when
> the inode next pointer points back to the perag list_head. This
> makes insert, remove and traversal.
> 
> The only complication here is log recovery of old filesystems that
> have multiple lists. These always remove from the head of the list,
> so we can easily construct just enough of the unlinked list for
> recovery from any list to work correctly.

I'd much prefer not bloating the inode for the relative rate case of
inode unlinked while still open.  Can't we just allocate a temporary
structure with the list_head and inode pointer instead?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/13] xfs: add log item precommit operation
  2020-08-12  9:25 ` [PATCH 02/13] xfs: add log item precommit operation Dave Chinner
@ 2020-08-22  9:06   ` Christoph Hellwig
  0 siblings, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2020-08-22  9:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

While looking to find how this is used, I noticed that it only
starts to get used in patch 12.  Maybe move it just before that
to help the reviewers keep context? 

> +	struct xfs_log_item	*lia = container_of(a,
> +					struct xfs_log_item, li_trans);
> +	struct xfs_log_item	*lib = container_of(b,
> +					struct xfs_log_item, li_trans);

lib as a variable name reads a little strange :)

What about li1 and li2?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 04/13] xfs: arrange all unlinked inodes into one list
  2020-08-22  9:01       ` Christoph Hellwig
@ 2020-08-23 17:24         ` Gao Xiang
  2020-08-24  8:19           ` [RFC PATCH] xfs: use log_incompat feature instead of speculate matching Gao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Gao Xiang @ 2020-08-23 17:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, Dave Chinner, linux-xfs

Hi Christoph,

On Sat, Aug 22, 2020 at 10:01:45AM +0100, Christoph Hellwig wrote:
> On Wed, Aug 19, 2020 at 08:58:30AM +0800, Gao Xiang wrote:
> > btw, if my understanding is correct, as I mentioned starting from my v1,
> > this new feature isn't forward compatible since old kernel hardcode
> > agino % XFS_AGI_UNLINKED_BUCKETS but not tracing original bucket_index
> > from its logging recovery code. So yeah, a bit awkward from its original
> > design...
> 
> I think we should add a log_incompat feature just to be safe.

Thanks for your suggestion.
Okay, if no other concern, I will try to look into that tomorrow...

Thanks,
Gao Xiang

> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC PATCH] xfs: use log_incompat feature instead of speculate matching
  2020-08-23 17:24         ` Gao Xiang
@ 2020-08-24  8:19           ` Gao Xiang
  2020-08-24  8:34             ` Gao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Gao Xiang @ 2020-08-24  8:19 UTC (permalink / raw)
  To: linux-xfs, Christoph Hellwig; +Cc: Dave Chinner, Darrick J. Wong, Gao Xiang

From: Gao Xiang <hsiangkao@redhat.com>

Use a log_incompat feature just to be safe.
If the current mount is in RO state, it will defer
to next RW remount.

Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---

After some careful thinking, I think it's probably not working for
supported V4 XFS filesystem. So, I think we'd probably insist on the
previous way (correct me if I'm wrong)...

(since xfs_sb_to_disk() refuses to set up any feature bits for non V5
 fses. That is another awkward setting here (doesn't write out/check
 feature bits for V4 even though using V4 sb reserved fields) and
 unless let V4 completely RO since this commit. )

Just send out as a RFC patch. Not fully tested after I thought as above.

 fs/xfs/libxfs/xfs_format.h | 4 +++-
 fs/xfs/xfs_inode.c         | 3 ++-
 fs/xfs/xfs_mount.c         | 9 +++++++++
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 31b7ece985bb..9f6c2766f6a6 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -479,7 +479,9 @@ xfs_sb_has_incompat_feature(
 	return (sbp->sb_features_incompat & feature) != 0;
 }
 
-#define XFS_SB_FEAT_INCOMPAT_LOG_ALL 0
+#define XFS_SB_FEAT_INCOMPAT_LOG_NEW_UNLINK	(1 << 0)
+#define XFS_SB_FEAT_INCOMPAT_LOG_ALL	\
+		(XFS_SB_FEAT_INCOMPAT_LOG_NEW_UNLINK)
 #define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_LOG_ALL
 static inline bool
 xfs_sb_has_incompat_log_feature(
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7ee778bcde06..e8eee1437611 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1952,7 +1952,8 @@ xfs_iunlink_update_bucket(
 	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
 		ASSERT(cur_agino != NULLAGINO);
 
-		if (be32_to_cpu(agi->agi_unlinked[0]) != cur_agino)
+		if (!(mp->m_sb.sb_features_log_incompat &
+		      XFS_SB_FEAT_INCOMPAT_LOG_NEW_UNLINK))
 			bucket_index = cur_agino % XFS_AGI_UNLINKED_BUCKETS;
 	}
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index f28c969af272..91d8b22524c6 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -836,6 +836,15 @@ xfs_mountfs(
 		goto out_fail_wait;
 	}
 
+	if (!(sbp->sb_features_log_incompat &
+	      XFS_SB_FEAT_INCOMPAT_LOG_NEW_UNLINK) &&
+	    !(mp->m_flags & XFS_MOUNT_RDONLY)) {
+		xfs_warn(mp, "will switch to long iunlinked list on r/w");
+		sbp->sb_features_log_incompat |=
+				XFS_SB_FEAT_INCOMPAT_LOG_NEW_UNLINK;
+		mp->m_update_sb = true;
+	}
+
 	/* Make sure the summary counts are ok. */
 	error = xfs_check_summary_counts(mp);
 	if (error)
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH] xfs: use log_incompat feature instead of speculate matching
  2020-08-24  8:19           ` [RFC PATCH] xfs: use log_incompat feature instead of speculate matching Gao Xiang
@ 2020-08-24  8:34             ` Gao Xiang
  2020-08-24 15:08               ` Darrick J. Wong
  0 siblings, 1 reply; 51+ messages in thread
From: Gao Xiang @ 2020-08-24  8:34 UTC (permalink / raw)
  To: linux-xfs, Christoph Hellwig; +Cc: Dave Chinner, Darrick J. Wong

On Mon, Aug 24, 2020 at 04:19:00PM +0800, Gao Xiang wrote:
> From: Gao Xiang <hsiangkao@redhat.com>
> 
> Use a log_incompat feature just to be safe.
> If the current mount is in RO state, it will defer
> to next RW remount.
> 
> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> ---
> 
> After some careful thinking, I think it's probably not working for
> supported V4 XFS filesystem. So, I think we'd probably insist on the
> previous way (correct me if I'm wrong)...
> 
> (since xfs_sb_to_disk() refuses to set up any feature bits for non V5
>  fses. That is another awkward setting here (doesn't write out/check
>  feature bits for V4 even though using V4 sb reserved fields) and
>  unless let V4 completely RO since this commit. )
> 
> Just send out as a RFC patch. Not fully tested after I thought as above.

Unless we also use sb_features2 for V4 filesystem to entirely
refuse to mount such V4 filesystem...
Some more opinions on this?

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH] xfs: use log_incompat feature instead of speculate matching
  2020-08-24  8:34             ` Gao Xiang
@ 2020-08-24 15:08               ` Darrick J. Wong
  2020-08-24 15:41                 ` Gao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-24 15:08 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Christoph Hellwig, Dave Chinner

On Mon, Aug 24, 2020 at 04:34:02PM +0800, Gao Xiang wrote:
> On Mon, Aug 24, 2020 at 04:19:00PM +0800, Gao Xiang wrote:
> > From: Gao Xiang <hsiangkao@redhat.com>
> > 
> > Use a log_incompat feature just to be safe.
> > If the current mount is in RO state, it will defer
> > to next RW remount.
> > 
> > Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> > ---
> > 
> > After some careful thinking, I think it's probably not working for
> > supported V4 XFS filesystem. So, I think we'd probably insist on the
> > previous way (correct me if I'm wrong)...
> > 
> > (since xfs_sb_to_disk() refuses to set up any feature bits for non V5
> >  fses. That is another awkward setting here (doesn't write out/check
> >  feature bits for V4 even though using V4 sb reserved fields) and
> >  unless let V4 completely RO since this commit. )
> > 
> > Just send out as a RFC patch. Not fully tested after I thought as above.
> 
> Unless we also use sb_features2 for V4 filesystem to entirely
> refuse to mount such V4 filesystem...
> Some more opinions on this?

Frankly, V4 is pretty old, so I wouldn't bother.  We only build new
features for V5 format.

--D

> Thanks,
> Gao Xiang
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH] xfs: use log_incompat feature instead of speculate matching
  2020-08-24 15:08               ` Darrick J. Wong
@ 2020-08-24 15:41                 ` Gao Xiang
  2020-08-25 10:06                   ` [PATCH] " Gao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Gao Xiang @ 2020-08-24 15:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Christoph Hellwig, Dave Chinner

On Mon, Aug 24, 2020 at 08:08:32AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 24, 2020 at 04:34:02PM +0800, Gao Xiang wrote:
> > On Mon, Aug 24, 2020 at 04:19:00PM +0800, Gao Xiang wrote:
> > > From: Gao Xiang <hsiangkao@redhat.com>
> > > 
> > > Use a log_incompat feature just to be safe.
> > > If the current mount is in RO state, it will defer
> > > to next RW remount.
> > > 
> > > Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> > > ---
> > > 
> > > After some careful thinking, I think it's probably not working for
> > > supported V4 XFS filesystem. So, I think we'd probably insist on the
> > > previous way (correct me if I'm wrong)...
> > > 
> > > (since xfs_sb_to_disk() refuses to set up any feature bits for non V5
> > >  fses. That is another awkward setting here (doesn't write out/check
> > >  feature bits for V4 even though using V4 sb reserved fields) and
> > >  unless let V4 completely RO since this commit. )
> > > 
> > > Just send out as a RFC patch. Not fully tested after I thought as above.
> > 
> > Unless we also use sb_features2 for V4 filesystem to entirely
> > refuse to mount such V4 filesystem...
> > Some more opinions on this?
> 
> Frankly, V4 is pretty old, so I wouldn't bother.  We only build new
> features for V5 format.

Okay, let me go further with log_incompat (v5) and sb_features2 (v4)
later.

Thanks,
Gao Xiang

> 
> --D
> 
> > Thanks,
> > Gao Xiang
> > 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/13] xfs: add unlink list pointers to xfs_inode
  2020-08-22  9:03   ` Christoph Hellwig
@ 2020-08-25  5:17     ` Dave Chinner
  2020-08-27  7:21       ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Dave Chinner @ 2020-08-25  5:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Sat, Aug 22, 2020 at 10:03:39AM +0100, Christoph Hellwig wrote:
> On Wed, Aug 12, 2020 at 07:25:48PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > To move away from using the on disk inode buffers to track and log
> > unlinked inodes, we need pointers to track them in memory. Because
> > we have arbitrary remove order from the list, it needs to be a
> > double linked list.
> > 
> > We start by noting that inodes are always in memory when they are
> > active on the unlinked list, and hence we can track these inodes
> > without needing to take references to the inodes or store them in
> > the list. We cannot, however, use inode locks to protect the inodes
> > on the list - the list needs an external lock to serialise all
> > inserts and removals. We can use the existing AGI buffer lock for
> > this right now as that already serialises all unlinked list
> > traversals and modifications.
> > 
> > Hence we can convert the in-memory unlinked list to a simple
> > list_head list in the perag. We can use list_empty() to detect an
> > empty unlinked list, likewise we can detect the end of the list when
> > the inode next pointer points back to the perag list_head. This
> > makes insert, remove and traversal.
> > 
> > The only complication here is log recovery of old filesystems that
> > have multiple lists. These always remove from the head of the list,
> > so we can easily construct just enough of the unlinked list for
> > recovery from any list to work correctly.
> 
> I'd much prefer not bloating the inode for the relative rate case of
> inode unlinked while still open.  Can't we just allocate a temporary
> structure with the list_head and inode pointer instead?

That's precisely the complexity this code gets rid of. i.e. the
complex reverse pointer mapping rhashtable that had to be able to
handle rhashtable memory allocation failures and so required
fallbacks to straight buffer based unlink list walking. I much
prefer that we burn a little bit more memory for *much* simpler,
faster and more flexible code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH] xfs: use log_incompat feature instead of speculate matching
  2020-08-24 15:41                 ` Gao Xiang
@ 2020-08-25 10:06                   ` Gao Xiang
  2020-08-25 14:54                     ` Darrick J. Wong
  0 siblings, 1 reply; 51+ messages in thread
From: Gao Xiang @ 2020-08-25 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Dave Chinner, Darrick J. Wong, Christoph Hellwig, Gao Xiang

Add a log_incompat (v5) or sb_features2 (v4) feature
of a single long iunlinked list just to be safe. Hence,
older kernels will refuse to replay log for v5 images
or mount entirely for v4 images.

If the current mount is in RO state, it will defer
to the next RW (re)mount to add such flag instead.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
---
Different combinations have been tested (v4/v5 and before/after patch).

Based on the top of
`[PATCH 13/13] xfs: reorder iunlink remove operation in xfs_ifree`
https://lore.kernel.org/r/20200812092556.2567285-14-david@fromorbit.com

Either folding or rearranging this patch would be okay.

Maybe xfsprogs could be also patched as well to change the default
feature setting, but let me send out this first...

(It's possible that I'm still missing something...
 Kindly point out any time.)

 fs/xfs/libxfs/xfs_format.h | 29 +++++++++++++++++++++++++++--
 fs/xfs/xfs_inode.c         |  2 +-
 fs/xfs/xfs_mount.c         |  6 ++++++
 3 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 31b7ece985bb..a859fe601f6e 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -79,12 +79,14 @@ struct xfs_ifork;
 #define XFS_SB_VERSION2_PROJID32BIT	0x00000080	/* 32 bit project id */
 #define XFS_SB_VERSION2_CRCBIT		0x00000100	/* metadata CRCs */
 #define XFS_SB_VERSION2_FTYPE		0x00000200	/* inode type in dir */
+#define XFS_SB_VERSION2_NEW_IUNLINK	0x00000400	/* (v4) new iunlink */
 
 #define	XFS_SB_VERSION2_OKBITS		\
 	(XFS_SB_VERSION2_LAZYSBCOUNTBIT	| \
 	 XFS_SB_VERSION2_ATTR2BIT	| \
 	 XFS_SB_VERSION2_PROJID32BIT	| \
-	 XFS_SB_VERSION2_FTYPE)
+	 XFS_SB_VERSION2_FTYPE		| \
+	 XFS_SB_VERSION2_NEW_IUNLINK)
 
 /* Maximum size of the xfs filesystem label, no terminating NULL */
 #define XFSLABEL_MAX			12
@@ -479,7 +481,9 @@ xfs_sb_has_incompat_feature(
 	return (sbp->sb_features_incompat & feature) != 0;
 }
 
-#define XFS_SB_FEAT_INCOMPAT_LOG_ALL 0
+#define XFS_SB_FEAT_INCOMPAT_LOG_NEW_IUNLINK	(1 << 0)
+#define XFS_SB_FEAT_INCOMPAT_LOG_ALL	\
+		(XFS_SB_FEAT_INCOMPAT_LOG_NEW_IUNLINK)
 #define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_LOG_ALL
 static inline bool
 xfs_sb_has_incompat_log_feature(
@@ -563,6 +567,27 @@ static inline bool xfs_sb_version_hasreflink(struct xfs_sb *sbp)
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_REFLINK);
 }
 
+static inline bool xfs_sb_has_new_iunlink(struct xfs_sb *sbp)
+{
+	if (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5)
+		return sbp->sb_features_log_incompat &
+			XFS_SB_FEAT_INCOMPAT_LOG_NEW_IUNLINK;
+
+	return xfs_sb_version_hasmorebits(sbp) &&
+		(sbp->sb_features2 & XFS_SB_VERSION2_NEW_IUNLINK);
+}
+
+static inline void xfs_sb_add_new_iunlink(struct xfs_sb *sbp)
+{
+	if (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) {
+		sbp->sb_features_log_incompat |=
+			XFS_SB_FEAT_INCOMPAT_LOG_NEW_IUNLINK;
+		return;
+	}
+	sbp->sb_versionnum |= XFS_SB_VERSION_MOREBITSBIT;
+	sbp->sb_features2 |= XFS_SB_VERSION2_NEW_IUNLINK;
+}
+
 /*
  * end of superblock version macros
  */
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7ee778bcde06..1656ed7dcadf 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1952,7 +1952,7 @@ xfs_iunlink_update_bucket(
 	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
 		ASSERT(cur_agino != NULLAGINO);
 
-		if (be32_to_cpu(agi->agi_unlinked[0]) != cur_agino)
+		if (!xfs_sb_has_new_iunlink(&mp->m_sb))
 			bucket_index = cur_agino % XFS_AGI_UNLINKED_BUCKETS;
 	}
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index f28c969af272..a3b2e3c3d32f 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -836,6 +836,12 @@ xfs_mountfs(
 		goto out_fail_wait;
 	}
 
+	if (!xfs_sb_has_new_iunlink(sbp)) {
+		xfs_warn(mp, "will switch to long iunlinked list on r/w");
+		xfs_sb_add_new_iunlink(sbp);
+		mp->m_update_sb = true;
+	}
+
 	/* Make sure the summary counts are ok. */
 	error = xfs_check_summary_counts(mp);
 	if (error)
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] xfs: use log_incompat feature instead of speculate matching
  2020-08-25 10:06                   ` [PATCH] " Gao Xiang
@ 2020-08-25 14:54                     ` Darrick J. Wong
  2020-08-25 15:30                       ` Gao Xiang
  2020-08-27  7:19                       ` Christoph Hellwig
  0 siblings, 2 replies; 51+ messages in thread
From: Darrick J. Wong @ 2020-08-25 14:54 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Dave Chinner, Christoph Hellwig

On Tue, Aug 25, 2020 at 06:06:01PM +0800, Gao Xiang wrote:
> Add a log_incompat (v5) or sb_features2 (v4) feature
> of a single long iunlinked list just to be safe. Hence,
> older kernels will refuse to replay log for v5 images
> or mount entirely for v4 images.
> 
> If the current mount is in RO state, it will defer
> to the next RW (re)mount to add such flag instead.

This commit log needs to state /why/ we need a new feature flag in
addition to summarizing what is being added here.  For example,

"Introduce a new feature flag to collapse the unlinked hash to a single
bucket.  Doing so removes the need to lock the AGI in addition to the
previous and next items in the unlinked list.  Older kernels will think
that inodes are in the wrong unlinked hash bucket and declare the fs
corrupt, so the new feature is needed to prevent them from touching the
filesystem."

(or whatever the real reason is, I'm attending DebConf and LPC and
wasn't following 100%...)

Note that the above was a guess, because I actually can't tell if this
feature is needed to prevent old kernels from tripping over our new
strategy, or to prevent new kernels from running off the road if an old
kernel wrote all the hash buckets.  I would've thought both cases would
be fine...?

> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> ---
> Different combinations have been tested (v4/v5 and before/after patch).
> 
> Based on the top of
> `[PATCH 13/13] xfs: reorder iunlink remove operation in xfs_ifree`
> https://lore.kernel.org/r/20200812092556.2567285-14-david@fromorbit.com
> 
> Either folding or rearranging this patch would be okay.
> 
> Maybe xfsprogs could be also patched as well to change the default
> feature setting, but let me send out this first...
> 
> (It's possible that I'm still missing something...
>  Kindly point out any time.)
> 
>  fs/xfs/libxfs/xfs_format.h | 29 +++++++++++++++++++++++++++--
>  fs/xfs/xfs_inode.c         |  2 +-
>  fs/xfs/xfs_mount.c         |  6 ++++++
>  3 files changed, 34 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 31b7ece985bb..a859fe601f6e 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -79,12 +79,14 @@ struct xfs_ifork;
>  #define XFS_SB_VERSION2_PROJID32BIT	0x00000080	/* 32 bit project id */
>  #define XFS_SB_VERSION2_CRCBIT		0x00000100	/* metadata CRCs */
>  #define XFS_SB_VERSION2_FTYPE		0x00000200	/* inode type in dir */
> +#define XFS_SB_VERSION2_NEW_IUNLINK	0x00000400	/* (v4) new iunlink */
>  
>  #define	XFS_SB_VERSION2_OKBITS		\
>  	(XFS_SB_VERSION2_LAZYSBCOUNTBIT	| \
>  	 XFS_SB_VERSION2_ATTR2BIT	| \
>  	 XFS_SB_VERSION2_PROJID32BIT	| \
> -	 XFS_SB_VERSION2_FTYPE)
> +	 XFS_SB_VERSION2_FTYPE		| \
> +	 XFS_SB_VERSION2_NEW_IUNLINK)

NAK on this part; as I said earlier, don't add things to V4 filesystems.

If the rest of you have compelling reasons to want V4 support, now is
the time to speak up.

>  /* Maximum size of the xfs filesystem label, no terminating NULL */
>  #define XFSLABEL_MAX			12
> @@ -479,7 +481,9 @@ xfs_sb_has_incompat_feature(
>  	return (sbp->sb_features_incompat & feature) != 0;
>  }
>  
> -#define XFS_SB_FEAT_INCOMPAT_LOG_ALL 0
> +#define XFS_SB_FEAT_INCOMPAT_LOG_NEW_IUNLINK	(1 << 0)
> +#define XFS_SB_FEAT_INCOMPAT_LOG_ALL	\
> +		(XFS_SB_FEAT_INCOMPAT_LOG_NEW_IUNLINK)

There's a trick here: Define the feature flag at the very start of your
patchset, then make the last patch in the set add it to the _ALL macro
so that people bisecting their way through the git tree (with this
feature turned on) won't unwittingly build a kernel with the feature
half built and blow their filesystem to pieces.

>  #define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_LOG_ALL
>  static inline bool
>  xfs_sb_has_incompat_log_feature(
> @@ -563,6 +567,27 @@ static inline bool xfs_sb_version_hasreflink(struct xfs_sb *sbp)
>  		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_REFLINK);
>  }
>  
> +static inline bool xfs_sb_has_new_iunlink(struct xfs_sb *sbp)
> +{
> +	if (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5)
> +		return sbp->sb_features_log_incompat &
> +			XFS_SB_FEAT_INCOMPAT_LOG_NEW_IUNLINK;
> +
> +	return xfs_sb_version_hasmorebits(sbp) &&
> +		(sbp->sb_features2 & XFS_SB_VERSION2_NEW_IUNLINK);
> +}
> +
> +static inline void xfs_sb_add_new_iunlink(struct xfs_sb *sbp)
> +{
> +	if (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) {
> +		sbp->sb_features_log_incompat |=
> +			XFS_SB_FEAT_INCOMPAT_LOG_NEW_IUNLINK;
> +		return;
> +	}
> +	sbp->sb_versionnum |= XFS_SB_VERSION_MOREBITSBIT;
> +	sbp->sb_features2 |= XFS_SB_VERSION2_NEW_IUNLINK;

All metadata updates need to be logged.  Dave just spent a bunch of time
heckling me for that in the y2038 patchset. ;)

Also, I don't think it's a good idea to enable new incompat features
automatically, since this makes the fs unmountable on old kernels.

> +}
> +
>  /*
>   * end of superblock version macros
>   */
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 7ee778bcde06..1656ed7dcadf 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1952,7 +1952,7 @@ xfs_iunlink_update_bucket(
>  	if (!log || log->l_flags & XLOG_RECOVERY_NEEDED) {
>  		ASSERT(cur_agino != NULLAGINO);
>  
> -		if (be32_to_cpu(agi->agi_unlinked[0]) != cur_agino)
> +		if (!xfs_sb_has_new_iunlink(&mp->m_sb))
>  			bucket_index = cur_agino % XFS_AGI_UNLINKED_BUCKETS;

Oh, is this the one change added by the feature? :)

>  	}
>  
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index f28c969af272..a3b2e3c3d32f 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -836,6 +836,12 @@ xfs_mountfs(
>  		goto out_fail_wait;
>  	}
>  
> +	if (!xfs_sb_has_new_iunlink(sbp)) {
> +		xfs_warn(mp, "will switch to long iunlinked list on r/w");
> +		xfs_sb_add_new_iunlink(sbp);
> +		mp->m_update_sb = true;
> +	}
> +
>  	/* Make sure the summary counts are ok. */
>  	error = xfs_check_summary_counts(mp);
>  	if (error)
> -- 
> 2.18.1
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] xfs: use log_incompat feature instead of speculate matching
  2020-08-25 14:54                     ` Darrick J. Wong
@ 2020-08-25 15:30                       ` Gao Xiang
  2020-08-27  7:19                       ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Gao Xiang @ 2020-08-25 15:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Dave Chinner, Christoph Hellwig

Hi Darrick,

On Tue, Aug 25, 2020 at 07:54:58AM -0700, Darrick J. Wong wrote:
> On Tue, Aug 25, 2020 at 06:06:01PM +0800, Gao Xiang wrote:
> > Add a log_incompat (v5) or sb_features2 (v4) feature
> > of a single long iunlinked list just to be safe. Hence,
> > older kernels will refuse to replay log for v5 images
> > or mount entirely for v4 images.
> > 
> > If the current mount is in RO state, it will defer
> > to the next RW (re)mount to add such flag instead.
> 
> This commit log needs to state /why/ we need a new feature flag in
> addition to summarizing what is being added here.  For example,
> 
> "Introduce a new feature flag to collapse the unlinked hash to a single
> bucket.  Doing so removes the need to lock the AGI in addition to the
> previous and next items in the unlinked list.  Older kernels will think
> that inodes are in the wrong unlinked hash bucket and declare the fs
> corrupt, so the new feature is needed to prevent them from touching the
> filesystem."
> 
> (or whatever the real reason is, I'm attending DebConf and LPC and
> wasn't following 100%...)
> 
> Note that the above was a guess, because I actually can't tell if this
> feature is needed to prevent old kernels from tripping over our new
> strategy, or to prevent new kernels from running off the road if an old
> kernel wrote all the hash buckets.  I would've thought both cases would
> be fine...?

To prevent old kernels from tripping over our new strategy.

Images generated by old kernels would be fine.

> 

...

> >  #define	XFS_SB_VERSION2_OKBITS		\
> >  	(XFS_SB_VERSION2_LAZYSBCOUNTBIT	| \
> >  	 XFS_SB_VERSION2_ATTR2BIT	| \
> >  	 XFS_SB_VERSION2_PROJID32BIT	| \
> > -	 XFS_SB_VERSION2_FTYPE)
> > +	 XFS_SB_VERSION2_FTYPE		| \
> > +	 XFS_SB_VERSION2_NEW_IUNLINK)
> 
> NAK on this part; as I said earlier, don't add things to V4 filesystems.
> 
> If the rest of you have compelling reasons to want V4 support, now is
> the time to speak up.

The simple reason is that the current xfs_iunlink() code only generates
unlinked list in the new way but no multiple buckets. So, we must have
a choice for V4 since it's still supported by the current kernel:

 1) add some feature to entirely refuse new v4 images on older kernels;
 2) allow speculate matching so older kernel would bail out as fs corruption
    (but I have no idea if it has any harm);  

> 
> >  /* Maximum size of the xfs filesystem label, no terminating NULL */
> >  #define XFSLABEL_MAX			12
> > @@ -479,7 +481,9 @@ xfs_sb_has_incompat_feature(
> >  	return (sbp->sb_features_incompat & feature) != 0;
> >  }
> >  
> > -#define XFS_SB_FEAT_INCOMPAT_LOG_ALL 0
> > +#define XFS_SB_FEAT_INCOMPAT_LOG_NEW_IUNLINK	(1 << 0)
> > +#define XFS_SB_FEAT_INCOMPAT_LOG_ALL	\
> > +		(XFS_SB_FEAT_INCOMPAT_LOG_NEW_IUNLINK)
> 
> There's a trick here: Define the feature flag at the very start of your
> patchset, then make the last patch in the set add it to the _ALL macro
> so that people bisecting their way through the git tree (with this
> feature turned on) won't unwittingly build a kernel with the feature
> half built and blow their filesystem to pieces.

hmmm... not quite get the point though.
For this specific patch, I think it'll be folded into some patch or
rearranged.

It should not be a followed-up patch (we must do some decision in advance
 -- whether or not add this feature).

> 
> >  #define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_LOG_ALL
> >  static inline bool
> >  xfs_sb_has_incompat_log_feature(
> > @@ -563,6 +567,27 @@ static inline bool xfs_sb_version_hasreflink(struct xfs_sb *sbp)
> >  		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_REFLINK);
> >  }
> >  
> > +static inline bool xfs_sb_has_new_iunlink(struct xfs_sb *sbp)
> > +{
> > +	if (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5)
> > +		return sbp->sb_features_log_incompat &
> > +			XFS_SB_FEAT_INCOMPAT_LOG_NEW_IUNLINK;
> > +
> > +	return xfs_sb_version_hasmorebits(sbp) &&
> > +		(sbp->sb_features2 & XFS_SB_VERSION2_NEW_IUNLINK);
> > +}
> > +
> > +static inline void xfs_sb_add_new_iunlink(struct xfs_sb *sbp)
> > +{
> > +	if (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) {
> > +		sbp->sb_features_log_incompat |=
> > +			XFS_SB_FEAT_INCOMPAT_LOG_NEW_IUNLINK;
> > +		return;
> > +	}
> > +	sbp->sb_versionnum |= XFS_SB_VERSION_MOREBITSBIT;
> > +	sbp->sb_features2 |= XFS_SB_VERSION2_NEW_IUNLINK;
> 
> All metadata updates need to be logged.  Dave just spent a bunch of time
> heckling me for that in the y2038 patchset. ;)

hmmm... xfs_sync_sb in xfs_mountfs() will generate a sb transaction,
right? I don't get the risk here.

> 
> Also, I don't think it's a good idea to enable new incompat features
> automatically, since this makes the fs unmountable on old kernels.

As I said above, new xfs_iunlink() doesn't support multiple buckets
anymore (just support it for log recovery). So this feature would be
needed.

If supporting old multiple buckets xfs_iunlink() is needed, that's
a quite large modification of this entire patchset.

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] xfs: use log_incompat feature instead of speculate matching
  2020-08-25 14:54                     ` Darrick J. Wong
  2020-08-25 15:30                       ` Gao Xiang
@ 2020-08-27  7:19                       ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2020-08-27  7:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Gao Xiang, linux-xfs, Dave Chinner, Christoph Hellwig

On Tue, Aug 25, 2020 at 07:54:58AM -0700, Darrick J. Wong wrote:
> > +	 XFS_SB_VERSION2_FTYPE		| \
> > +	 XFS_SB_VERSION2_NEW_IUNLINK)
> 
> NAK on this part; as I said earlier, don't add things to V4 filesystems.
> 
> If the rest of you have compelling reasons to want V4 support, now is
> the time to speak up.

I think that it because the series uses the longer unhashed chains
unconditionally.  And given that old kernels can't deal with that
at all I suspect that needs to be changed and support the old buckets
for old file systems.  And once that is done I don't think we should
enable the long single chain without the log item anyway, and remove
another possibly combination of flags.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/13] xfs: add unlink list pointers to xfs_inode
  2020-08-25  5:17     ` Dave Chinner
@ 2020-08-27  7:21       ` Christoph Hellwig
  0 siblings, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2020-08-27  7:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs

On Tue, Aug 25, 2020 at 03:17:53PM +1000, Dave Chinner wrote:
> That's precisely the complexity this code gets rid of. i.e. the
> complex reverse pointer mapping rhashtable that had to be able to
> handle rhashtable memory allocation failures and so required
> fallbacks to straight buffer based unlink list walking. I much
> prefer that we burn a little bit more memory for *much* simpler,
> faster and more flexible code....

It's not like we care about memory allocation failures else where.
I suspect an xarray could work pretty well here, but I can look into
that myself once I find a little time.

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2020-08-27  7:21 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-12  9:25 [PATCH 00/13] xfs: in memory inode unlink log items Dave Chinner
2020-08-12  9:25 ` [PATCH 01/13] xfs: xfs_iflock is no longer a completion Dave Chinner
2020-08-18 23:44   ` Darrick J. Wong
2020-08-22  7:41   ` Christoph Hellwig
2020-08-12  9:25 ` [PATCH 02/13] xfs: add log item precommit operation Dave Chinner
2020-08-22  9:06   ` Christoph Hellwig
2020-08-12  9:25 ` [PATCH 03/13] xfs: factor the xfs_iunlink functions Dave Chinner
2020-08-18 23:49   ` Darrick J. Wong
2020-08-22  7:45   ` Christoph Hellwig
2020-08-12  9:25 ` [PATCH 04/13] xfs: arrange all unlinked inodes into one list Dave Chinner
2020-08-18 23:59   ` Darrick J. Wong
2020-08-19  0:45     ` Dave Chinner
2020-08-19  0:58     ` Gao Xiang
2020-08-22  9:01       ` Christoph Hellwig
2020-08-23 17:24         ` Gao Xiang
2020-08-24  8:19           ` [RFC PATCH] xfs: use log_incompat feature instead of speculate matching Gao Xiang
2020-08-24  8:34             ` Gao Xiang
2020-08-24 15:08               ` Darrick J. Wong
2020-08-24 15:41                 ` Gao Xiang
2020-08-25 10:06                   ` [PATCH] " Gao Xiang
2020-08-25 14:54                     ` Darrick J. Wong
2020-08-25 15:30                       ` Gao Xiang
2020-08-27  7:19                       ` Christoph Hellwig
2020-08-12  9:25 ` [PATCH 05/13] xfs: add unlink list pointers to xfs_inode Dave Chinner
2020-08-19  0:02   ` Darrick J. Wong
2020-08-19  0:47     ` Dave Chinner
2020-08-22  9:03   ` Christoph Hellwig
2020-08-25  5:17     ` Dave Chinner
2020-08-27  7:21       ` Christoph Hellwig
2020-08-12  9:25 ` [PATCH 06/13] xfs: replace iunlink backref lookups with list lookups Dave Chinner
2020-08-19  0:13   ` Darrick J. Wong
2020-08-19  0:52     ` Dave Chinner
2020-08-12  9:25 ` [PATCH 07/13] xfs: mapping unlinked inodes is now redundant Dave Chinner
2020-08-19  0:14   ` Darrick J. Wong
2020-08-12  9:25 ` [PATCH 08/13] xfs: updating i_next_unlinked doesn't need to return old value Dave Chinner
2020-08-19  0:19   ` Darrick J. Wong
2020-08-12  9:25 ` [PATCH 09/13] xfs: validate the unlinked list pointer on update Dave Chinner
2020-08-19  0:23   ` Darrick J. Wong
2020-08-12  9:25 ` [PATCH 10/13] xfs: re-order AGI updates in unlink list updates Dave Chinner
2020-08-19  0:29   ` Darrick J. Wong
2020-08-19  1:01     ` Dave Chinner
2020-08-12  9:25 ` [PATCH 11/13] xfs: combine iunlink inode update functions Dave Chinner
2020-08-19  0:30   ` Darrick J. Wong
2020-08-12  9:25 ` [PATCH 12/13] xfs: add in-memory iunlink log item Dave Chinner
2020-08-19  0:35   ` Darrick J. Wong
2020-08-12  9:25 ` [PATCH 13/13] xfs: reorder iunlink remove operation in xfs_ifree Dave Chinner
2020-08-12 11:12   ` Gao Xiang
2020-08-19  0:45   ` Darrick J. Wong
2020-08-18 18:17 ` [PATCH 00/13] xfs: in memory inode unlink log items Darrick J. Wong
2020-08-18 20:01   ` Gao Xiang
2020-08-18 21:42   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).