All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/4] xfs: track and skip realloc of busy inodes
@ 2022-02-17 17:25 Brian Foster
  2022-02-17 17:25 ` [PATCH RFC 1/4] xfs: require an rcu grace period before inode recycle Brian Foster
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Brian Foster @ 2022-02-17 17:25 UTC (permalink / raw)
  To: linux-xfs

Hi all,

This RFC cleans up some of the previous experimentation I was doing and
turns it into more of a usable prototype. The first couple of patches
are pretty straightforward. Patch 1 is a variant of the previously
posted patch to stamp inodes with a grace period at destroy time. Patch
2 tags inodes with still pending grace periods as busy in the radix
tree. Patch 3 is a quick hack to allow the inode selection algorithms to
fall back to chunk allocation and retry. Patch 4 updates the finobt
selection algorithms to filter records with busy inodes and fall back to
chunk allocation until a usable record is found.

The current status of this work is that it should be functionally
effective in terms of preventing allocation of busy inodes. This can be
measured by lack of RCU stalls, also identified by the tracepoint added
in patch 1 (this should probably assert or warn in the new inode
allocation recycle case since the goal is for that to never happen).
Performance is significantly improved from previous tests with only
patch 1, but still reduced from mainline. However, reduced performance
is to be expected because mainline unsafely reuses inodes rather
aggressively. Therefore, the goal is for something that
preserves/maintains closer to pure inode allocation performance.

I expect that the allocation algorithm can be adapted further to provide
incremental improvements in performance. The first and most obvious step
is to defer freeing of inode chunks to the background to mitigate
repetitive finobt search failure -> new chunk allocation retry sequences
that may be seen with the current prototype. Other improvements may be
possible to make the search algorithm itself more effective. I'm sending
this as a prototype for early feedback and thoughts on approach,
prospective improvements, etc. Thoughts, reviews, flames appreciated.

Brian

Brian Foster (4):
  xfs: require an rcu grace period before inode recycle
  xfs: tag reclaimable inodes with pending RCU grace periods as busy
  xfs: crude chunk allocation retry mechanism
  xfs: skip busy inodes on finobt inode allocation

 fs/xfs/libxfs/xfs_ialloc.c | 99 +++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_icache.c        | 55 +++++++++++++++++----
 fs/xfs/xfs_inode.h         |  3 +-
 fs/xfs/xfs_trace.h         |  8 ++-
 4 files changed, 146 insertions(+), 19 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH RFC 1/4] xfs: require an rcu grace period before inode recycle
  2022-02-17 17:25 [PATCH RFC 0/4] xfs: track and skip realloc of busy inodes Brian Foster
@ 2022-02-17 17:25 ` Brian Foster
  2022-02-17 17:25 ` [PATCH RFC 2/4] xfs: tag reclaimable inodes with pending RCU grace periods as busy Brian Foster
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Brian Foster @ 2022-02-17 17:25 UTC (permalink / raw)
  To: linux-xfs

The XFS inode allocation algorithm aggressively reuses recently
freed inodes. This is historical behavior that has been in place for
quite some time, since XFS was imported to mainline Linux. Once the
VFS adopted RCUwalk path lookups (also some time ago), this behavior
became slightly incompatible because the inode recycle path doesn't
isolate concurrent access to the inode from the VFS.

This has recently manifested as problems in the VFS when XFS happens
to change the type or properties of a recently unlinked inode while
still involved in an RCU lookup. For example, if the VFS refers to a
previous incarnation of a symlink inode, obtains the ->get_link()
callback from inode_operations, and the latter happens to change to
a non-symlink type via a recycle event, the ->get_link() callback
pointer is reset to NULL and the lookup results in a crash.

To avoid this class of problem, isolate in-core inodes for recycling
with an RCU grace period. This is the same level of protection the
VFS expects for inactivated inodes that are never reused, and so
guarantees no further concurrent access before the type or
properties of the inode change. We don't want an unconditional
synchronize_rcu() event here because that would result in a
significant performance impact to mixed inode allocation workloads.

Fortunately, we can take advantage of the recently added deferred
inactivation mechanism to mitigate the need for an RCU wait in some
cases. Deferred inactivation queues and batches the on-disk freeing
of recently destroyed inodes, and so for non-sustained inactivation
workloads increases the likelihood that a grace period has elapsed
by the time an inode is freed and observable by the allocation code
as a recycle candidate. We have to respect the lifecycle rules
regardless of whether an inode was inactivated or not because of the
fields modified by inode reinit. Therefore, capture the current RCU
grace period cookie as early as possible at destroy time and use it
at lookup time to conditionally sync an RCU grace period if one
hadn't elapsed before the recycle attempt. Slightly adjust struct
xfs_inode to fit the new field into padding holes that conveniently
preexist in the same cacheline as the deferred inactivation list.

Note that this patch alone introduces a significant negative impact
to mixed file allocation and removal workloads because the
allocation algorithm aggressively attempts to reuse recently freed
inodes. This results in frequent RCU grace period synchronization
stalls in the allocation path. This problem is mitigated by
forthcoming patches to track and avoid recycling of inodes with
pending RCU grace periods.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 37 ++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_inode.h  |  3 ++-
 fs/xfs/xfs_trace.h  |  8 ++++++--
 3 files changed, 40 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 9644f938990c..693896bc690f 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -351,6 +351,19 @@ xfs_iget_recycle(
 	spin_unlock(&ip->i_flags_lock);
 	rcu_read_unlock();
 
+	/*
+	 * VFS RCU pathwalk lookups dictate the same lifecycle rules for an
+	 * inode recycle as for freeing an inode. I.e., we cannot reinit the
+	 * inode structure until a grace period has elapsed from the last
+	 * ->destroy_inode() call. In most cases a grace period has already
+	 * elapsed if the inode was inactivated, but synchronize here as a
+	 * last resort to guarantee correctness.
+	 */
+	if (!poll_state_synchronize_rcu(ip->i_destroy_gp)) {
+		cond_synchronize_rcu(ip->i_destroy_gp);
+		trace_xfs_iget_recycle_stall(ip);
+	}
+
 	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
 	error = xfs_reinit_inode(mp, inode);
 	if (error) {
@@ -1789,7 +1802,8 @@ xfs_check_delalloc(
 /* Schedule the inode for reclaim. */
 static void
 xfs_inodegc_set_reclaimable(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	unsigned long		destroy_gp)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
@@ -1805,6 +1819,8 @@ xfs_inodegc_set_reclaimable(
 	spin_lock(&ip->i_flags_lock);
 
 	trace_xfs_inode_set_reclaimable(ip);
+	if (destroy_gp)
+		ip->i_destroy_gp = destroy_gp;
 	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
 	ip->i_flags |= XFS_IRECLAIMABLE;
 	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
@@ -1826,7 +1842,8 @@ xfs_inodegc_inactivate(
 {
 	trace_xfs_inode_inactivating(ip);
 	xfs_inactive(ip);
-	xfs_inodegc_set_reclaimable(ip);
+	/* inactive inodes are assigned rcu state when first queued */
+	xfs_inodegc_set_reclaimable(ip, 0);
 }
 
 void
@@ -1997,7 +2014,8 @@ xfs_inodegc_want_flush_work(
  */
 static void
 xfs_inodegc_queue(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	unsigned long		destroy_gp)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_inodegc	*gc;
@@ -2007,6 +2025,7 @@ xfs_inodegc_queue(
 	trace_xfs_inode_set_need_inactive(ip);
 	spin_lock(&ip->i_flags_lock);
 	ip->i_flags |= XFS_NEED_INACTIVE;
+	ip->i_destroy_gp = destroy_gp;
 	spin_unlock(&ip->i_flags_lock);
 
 	gc = get_cpu_ptr(mp->m_inodegc);
@@ -2086,6 +2105,7 @@ xfs_inode_mark_reclaimable(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	bool			need_inactive;
+	unsigned long		destroy_gp;
 
 	XFS_STATS_INC(mp, vn_reclaim);
 
@@ -2094,15 +2114,22 @@ xfs_inode_mark_reclaimable(
 	 */
 	ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_ALL_IRECLAIM_FLAGS));
 
+	/*
+	 * Poll the RCU subsystem as early as possible and pass the grace
+	 * period cookie along to assign to the inode. This grace period must
+	 * expire before the struct inode can be recycled.
+	 */
+	destroy_gp = start_poll_synchronize_rcu();
+
 	need_inactive = xfs_inode_needs_inactive(ip);
 	if (need_inactive) {
-		xfs_inodegc_queue(ip);
+		xfs_inodegc_queue(ip, destroy_gp);
 		return;
 	}
 
 	/* Going straight to reclaim, so drop the dquots. */
 	xfs_qm_dqdetach(ip);
-	xfs_inodegc_set_reclaimable(ip);
+	xfs_inodegc_set_reclaimable(ip, destroy_gp);
 }
 
 /*
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index b7e8f14d9fca..6ca60373ff58 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -40,8 +40,9 @@ typedef struct xfs_inode {
 	/* Transaction and locking information. */
 	struct xfs_inode_log_item *i_itemp;	/* logging information */
 	mrlock_t		i_lock;		/* inode lock */
-	atomic_t		i_pincount;	/* inode pin count */
 	struct llist_node	i_gclist;	/* deferred inactivation list */
+	unsigned long		i_destroy_gp;	/* destroy rcugp cookie */
+	atomic_t		i_pincount;	/* inode pin count */
 
 	/*
 	 * Bitsets of inode metadata that have been checked and/or are sick.
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 4a8076ef8cb4..28ac861c3565 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -727,16 +727,19 @@ DECLARE_EVENT_CLASS(xfs_inode_class,
 		__field(dev_t, dev)
 		__field(xfs_ino_t, ino)
 		__field(unsigned long, iflags)
+		__field(unsigned long, destroy_gp)
 	),
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
 		__entry->iflags = ip->i_flags;
+		__entry->destroy_gp = ip->i_destroy_gp;
 	),
-	TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx",
+	TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx destroy_gp 0x%lx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
-		  __entry->iflags)
+		  __entry->iflags,
+		  __entry->destroy_gp)
 )
 
 #define DEFINE_INODE_EVENT(name) \
@@ -746,6 +749,7 @@ DEFINE_EVENT(xfs_inode_class, name, \
 DEFINE_INODE_EVENT(xfs_iget_skip);
 DEFINE_INODE_EVENT(xfs_iget_recycle);
 DEFINE_INODE_EVENT(xfs_iget_recycle_fail);
+DEFINE_INODE_EVENT(xfs_iget_recycle_stall);
 DEFINE_INODE_EVENT(xfs_iget_hit);
 DEFINE_INODE_EVENT(xfs_iget_miss);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC 2/4] xfs: tag reclaimable inodes with pending RCU grace periods as busy
  2022-02-17 17:25 [PATCH RFC 0/4] xfs: track and skip realloc of busy inodes Brian Foster
  2022-02-17 17:25 ` [PATCH RFC 1/4] xfs: require an rcu grace period before inode recycle Brian Foster
@ 2022-02-17 17:25 ` Brian Foster
  2022-02-17 23:16   ` Dave Chinner
  2022-02-17 17:25 ` [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism Brian Foster
  2022-02-17 17:25 ` [PATCH RFC 4/4] xfs: skip busy inodes on finobt inode allocation Brian Foster
  3 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2022-02-17 17:25 UTC (permalink / raw)
  To: linux-xfs

In order to avoid aggressive recycling of in-core xfs_inode objects with
pending grace periods and the subsequent RCU sync stalls involved with
recycling, we must be able to identify them quickly and reliably at
allocation time. Claim a new tag for the in-core inode radix tree and
tag all inodes with a still pending grace period cookie as busy at the
time they are made reclaimable.

Note that it is not necessary to maintain consistency between the tag
and grace period status once the tag is set. The busy tag simply
reflects that the grace period had not expired by the time the inode was
set reclaimable and therefore any reuse of the inode must first poll the
RCU subsystem for subsequent expiration of the grace period. Clear the
tag when the inode is recycled or reclaimed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_icache.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 693896bc690f..245ee0f6670b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -32,6 +32,8 @@
 #define XFS_ICI_RECLAIM_TAG	0
 /* Inode has speculative preallocations (posteof or cow) to clean. */
 #define XFS_ICI_BLOCKGC_TAG	1
+/* inode has pending RCU grace period when set reclaimable  */
+#define XFS_ICI_BUSY_TAG	2
 
 /*
  * The goal for walking incore inodes.  These can correspond with incore inode
@@ -274,7 +276,7 @@ xfs_perag_clear_inode_tag(
 	if (agino != NULLAGINO)
 		radix_tree_tag_clear(&pag->pag_ici_root, agino, tag);
 	else
-		ASSERT(tag == XFS_ICI_RECLAIM_TAG);
+		ASSERT(tag == XFS_ICI_RECLAIM_TAG || tag == XFS_ICI_BUSY_TAG);
 
 	if (tag == XFS_ICI_RECLAIM_TAG)
 		pag->pag_ici_reclaimable--;
@@ -336,6 +338,7 @@ xfs_iget_recycle(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct inode		*inode = VFS_I(ip);
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 	int			error;
 
 	trace_xfs_iget_recycle(ip);
@@ -392,8 +395,9 @@ xfs_iget_recycle(
 	 */
 	ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
 	ip->i_flags |= XFS_INEW;
-	xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			XFS_ICI_RECLAIM_TAG);
+
+	xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_BUSY_TAG);
+	xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_RECLAIM_TAG);
 	inode->i_state = I_NEW;
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
@@ -931,6 +935,7 @@ xfs_reclaim_inode(
 	if (!radix_tree_delete(&pag->pag_ici_root,
 				XFS_INO_TO_AGINO(ip->i_mount, ino)))
 		ASSERT(0);
+	xfs_perag_clear_inode_tag(pag, NULLAGINO, XFS_ICI_BUSY_TAG);
 	xfs_perag_clear_inode_tag(pag, NULLAGINO, XFS_ICI_RECLAIM_TAG);
 	spin_unlock(&pag->pag_ici_lock);
 
@@ -1807,6 +1812,7 @@ xfs_inodegc_set_reclaimable(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
 
 	if (!xfs_is_shutdown(mp) && ip->i_delayed_blks) {
 		xfs_check_delalloc(ip, XFS_DATA_FORK);
@@ -1821,10 +1827,12 @@ xfs_inodegc_set_reclaimable(
 	trace_xfs_inode_set_reclaimable(ip);
 	if (destroy_gp)
 		ip->i_destroy_gp = destroy_gp;
+	if (!poll_state_synchronize_rcu(ip->i_destroy_gp))
+		xfs_perag_set_inode_tag(pag, agino, XFS_ICI_BUSY_TAG);
+
 	ip->i_flags &= ~(XFS_NEED_INACTIVE | XFS_INACTIVATING);
 	ip->i_flags |= XFS_IRECLAIMABLE;
-	xfs_perag_set_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
-			XFS_ICI_RECLAIM_TAG);
+	xfs_perag_set_inode_tag(pag, agino, XFS_ICI_RECLAIM_TAG);
 
 	spin_unlock(&ip->i_flags_lock);
 	spin_unlock(&pag->pag_ici_lock);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism
  2022-02-17 17:25 [PATCH RFC 0/4] xfs: track and skip realloc of busy inodes Brian Foster
  2022-02-17 17:25 ` [PATCH RFC 1/4] xfs: require an rcu grace period before inode recycle Brian Foster
  2022-02-17 17:25 ` [PATCH RFC 2/4] xfs: tag reclaimable inodes with pending RCU grace periods as busy Brian Foster
@ 2022-02-17 17:25 ` Brian Foster
  2022-02-17 23:20   ` Dave Chinner
  2022-02-17 17:25 ` [PATCH RFC 4/4] xfs: skip busy inodes on finobt inode allocation Brian Foster
  3 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2022-02-17 17:25 UTC (permalink / raw)
  To: linux-xfs

The free inode btree currently tracks all inode chunk records with
at least one free inode. This simplifies the chunk and allocation
selection algorithms as free inode availability can be guaranteed
after a few simple checks. This is no longer the case with busy
inode avoidance, however, because busy inode state is tracked in the
radix tree independent from physical allocation status.

A busy inode avoidance algorithm relies on the ability to fall back
to an inode chunk allocation one way or another in the event that
all current free inodes are busy. Hack in a crude allocation
fallback mechanism for experimental purposes. If the inode selection
algorithm is unable to locate a usable inode, allow it to return
-EAGAIN to perform another physical chunk allocation in the AG and
retry the inode allocation.

The current prototype can perform this allocation and retry sequence
repeatedly because a newly allocated chunk may still be covered by
busy in-core inodes in the radix tree (if it were recently freed,
for example). This is inefficient and temporary. It will be properly
mitigated by background chunk removal. This defers freeing of inode
chunk blocks from the free of the last used inode in the chunk to a
background task that only frees chunks once completely idle, thereby
providing a guarantee that a new chunk allocation always adds
non-busy inodes to the AG.

Not-Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index b418fe0c0679..3eb41228e210 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -27,6 +27,7 @@
 #include "xfs_log.h"
 #include "xfs_rmap.h"
 #include "xfs_ag.h"
+#include "xfs_trans_space.h"
 
 /*
  * Lookup a record by ino in the btree given by cur.
@@ -1689,6 +1690,7 @@ xfs_dialloc_try_ag(
 			goto out_release;
 		}
 
+alloc:
 		error = xfs_ialloc_ag_alloc(*tpp, agbp, pag);
 		if (error < 0)
 			goto out_release;
@@ -1706,8 +1708,22 @@ xfs_dialloc_try_ag(
 
 	/* Allocate an inode in the found AG */
 	error = xfs_dialloc_ag(*tpp, agbp, pag, parent, &ino);
-	if (!error)
+	if (!error) {
 		*new_ino = ino;
+	} else if (error == -EAGAIN) {
+		/*
+		 * XXX: Temporary hack to retry allocs until background chunk
+		 * freeing is worked out.
+		 */
+		error = xfs_mod_fdblocks(pag->pag_mount,
+			       -((int64_t)XFS_IALLOC_SPACE_RES(pag->pag_mount)),
+			       false);
+		if (error)
+			goto out_release;
+		(*tpp)->t_blk_res += XFS_IALLOC_SPACE_RES(pag->pag_mount);
+		goto alloc;
+	}
+
 	return error;
 
 out_release:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC 4/4] xfs: skip busy inodes on finobt inode allocation
  2022-02-17 17:25 [PATCH RFC 0/4] xfs: track and skip realloc of busy inodes Brian Foster
                   ` (2 preceding siblings ...)
  2022-02-17 17:25 ` [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism Brian Foster
@ 2022-02-17 17:25 ` Brian Foster
  3 siblings, 0 replies; 15+ messages in thread
From: Brian Foster @ 2022-02-17 17:25 UTC (permalink / raw)
  To: linux-xfs

Experimental algorithm to skip busy inodes on finobt based inode
allocation. This is a first pass implementation intended primarily
for functional and performance experimentation. The logic is
intentionally kept simple to minimize the scope of changes required
to demonstrate functionality.

The existing finobt inode allocation algorithms are updated to
filter out records that are covered by at least one still-busy
inode[1] and continue scanning as appropriate based on the
allocation mode. For example, near allocation mode continues
scanning left and right until a usable record is found. newino mode
checks the target record and then scans from the first record in the
tree until a usable record is found. If the associated algorithm
cannot find a usable record, it falls back to new chunk allocation
to add non-busy inodes to the selection pool and restarts.

[1] As I write this, it occurs to me this logic could be further
improved to compare the first busy inode against the first free
inode in the record without disrupting the subsequent inode
selection logic.

Not-Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 81 +++++++++++++++++++++++++++++++++++---
 1 file changed, 76 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 3eb41228e210..c79c85327cf4 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1248,12 +1248,53 @@ xfs_dialloc_ag_inobt(
 	return error;
 }
 
+#define XFS_LOOKUP_BATCH	XFS_INODES_PER_CHUNK
+#define XFS_ICI_BUSY_TAG	2
+STATIC bool
+xfs_dialloc_ag_inobt_rec_busy(
+	struct xfs_perag		*pag,
+	struct xfs_inobt_rec_incore	*rec)
+{
+	struct xfs_inode		*batch[XFS_LOOKUP_BATCH];
+	struct xfs_inode		*ip;
+	int				found, i;
+	xfs_agino_t			startino = rec->ir_startino;
+	bool				busy = false;
+	unsigned long			destroy_gp;
+
+	rcu_read_lock();
+	found = radix_tree_gang_lookup_tag(&pag->pag_ici_root, (void **) batch,
+					   startino, XFS_LOOKUP_BATCH,
+					   XFS_ICI_BUSY_TAG);
+	for (i = 0; i < found; i++) {
+		ip = batch[i];
+		spin_lock(&ip->i_flags_lock);
+		if (ip->i_ino >= startino + XFS_INODES_PER_CHUNK) {
+			spin_unlock(&ip->i_flags_lock);
+			break;
+		}
+		destroy_gp = ip->i_destroy_gp;
+		spin_unlock(&ip->i_flags_lock);
+
+		if (!poll_state_synchronize_rcu(destroy_gp)) {
+			busy = true;
+			break;
+		}
+	}
+	rcu_read_unlock();
+	trace_printk("%d: agno %d startino 0x%x found %d busy %d caller %pS\n",
+		     __LINE__, pag->pag_agno, startino, found, busy,
+		     (void *) _RET_IP_);
+	return busy;
+}
+
 /*
  * Use the free inode btree to allocate an inode based on distance from the
  * parent. Note that the provided cursor may be deleted and replaced.
  */
 STATIC int
 xfs_dialloc_ag_finobt_near(
+	struct xfs_perag		*pag,
 	xfs_agino_t			pagino,
 	struct xfs_btree_cur		**ocur,
 	struct xfs_inobt_rec_incore	*rec)
@@ -1281,8 +1322,10 @@ xfs_dialloc_ag_finobt_near(
 		 * existence is enough.
 		 */
 		if (pagino >= rec->ir_startino &&
-		    pagino < (rec->ir_startino + XFS_INODES_PER_CHUNK))
-			return 0;
+		    pagino < (rec->ir_startino + XFS_INODES_PER_CHUNK)) {
+			if (!xfs_dialloc_ag_inobt_rec_busy(pag, rec))
+				return 0;
+		}
 	}
 
 	error = xfs_btree_dup_cursor(lcur, &rcur);
@@ -1306,6 +1349,21 @@ xfs_dialloc_ag_finobt_near(
 		error = -EFSCORRUPTED;
 		goto error_rcur;
 	}
+
+	while (i == 1 && xfs_dialloc_ag_inobt_rec_busy(pag, rec)) {
+		error = xfs_ialloc_next_rec(lcur, rec, &i, 1);
+		if (error)
+			goto error_rcur;
+		i = !i;
+	}
+
+	while (j == 1 && xfs_dialloc_ag_inobt_rec_busy(pag, &rrec)) {
+		error = xfs_ialloc_next_rec(rcur, &rrec, &j, 0);
+		if (error)
+			goto error_rcur;
+		j = !j;
+	}
+
 	if (i == 1 && j == 1) {
 		/*
 		 * Both the left and right records are valid. Choose the closer
@@ -1327,6 +1385,9 @@ xfs_dialloc_ag_finobt_near(
 	} else if (i == 1) {
 		/* only the left record is valid */
 		xfs_btree_del_cursor(rcur, XFS_BTREE_NOERROR);
+	} else {
+		error = -EAGAIN;
+		goto error_rcur;
 	}
 
 	return 0;
@@ -1342,6 +1403,7 @@ xfs_dialloc_ag_finobt_near(
  */
 STATIC int
 xfs_dialloc_ag_finobt_newino(
+	struct xfs_perag		*pag,
 	struct xfs_agi			*agi,
 	struct xfs_btree_cur		*cur,
 	struct xfs_inobt_rec_incore	*rec)
@@ -1360,7 +1422,8 @@ xfs_dialloc_ag_finobt_newino(
 				return error;
 			if (XFS_IS_CORRUPT(cur->bc_mp, i != 1))
 				return -EFSCORRUPTED;
-			return 0;
+			if (!xfs_dialloc_ag_inobt_rec_busy(pag, rec))
+				return 0;
 		}
 	}
 
@@ -1379,6 +1442,14 @@ xfs_dialloc_ag_finobt_newino(
 	if (XFS_IS_CORRUPT(cur->bc_mp, i != 1))
 		return -EFSCORRUPTED;
 
+	while (xfs_dialloc_ag_inobt_rec_busy(pag, rec)) {
+		error = xfs_ialloc_next_rec(cur, rec, &i, 0);
+		if (error)
+			return error;
+		if (i == 1)
+			return -EAGAIN;
+	}
+
 	return 0;
 }
 
@@ -1470,9 +1541,9 @@ xfs_dialloc_ag(
 	 * not, consider the agi hint or find the first free inode in the AG.
 	 */
 	if (pag->pag_agno == pagno)
-		error = xfs_dialloc_ag_finobt_near(pagino, &cur, &rec);
+		error = xfs_dialloc_ag_finobt_near(pag, pagino, &cur, &rec);
 	else
-		error = xfs_dialloc_ag_finobt_newino(agi, cur, &rec);
+		error = xfs_dialloc_ag_finobt_newino(pag, agi, cur, &rec);
 	if (error)
 		goto error_cur;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/4] xfs: tag reclaimable inodes with pending RCU grace periods as busy
  2022-02-17 17:25 ` [PATCH RFC 2/4] xfs: tag reclaimable inodes with pending RCU grace periods as busy Brian Foster
@ 2022-02-17 23:16   ` Dave Chinner
  2022-02-18 14:19     ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2022-02-17 23:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 17, 2022 at 12:25:16PM -0500, Brian Foster wrote:
> In order to avoid aggressive recycling of in-core xfs_inode objects with
> pending grace periods and the subsequent RCU sync stalls involved with
> recycling, we must be able to identify them quickly and reliably at
> allocation time. Claim a new tag for the in-core inode radix tree and
> tag all inodes with a still pending grace period cookie as busy at the
> time they are made reclaimable.
> 
> Note that it is not necessary to maintain consistency between the tag
> and grace period status once the tag is set. The busy tag simply
> reflects that the grace period had not expired by the time the inode was
> set reclaimable and therefore any reuse of the inode must first poll the
> RCU subsystem for subsequent expiration of the grace period. Clear the
> tag when the inode is recycled or reclaimed.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 693896bc690f..245ee0f6670b 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -32,6 +32,8 @@
>  #define XFS_ICI_RECLAIM_TAG	0
>  /* Inode has speculative preallocations (posteof or cow) to clean. */
>  #define XFS_ICI_BLOCKGC_TAG	1
> +/* inode has pending RCU grace period when set reclaimable  */
> +#define XFS_ICI_BUSY_TAG	2
>  
>  /*
>   * The goal for walking incore inodes.  These can correspond with incore inode
> @@ -274,7 +276,7 @@ xfs_perag_clear_inode_tag(
>  	if (agino != NULLAGINO)
>  		radix_tree_tag_clear(&pag->pag_ici_root, agino, tag);
>  	else
> -		ASSERT(tag == XFS_ICI_RECLAIM_TAG);
> +		ASSERT(tag == XFS_ICI_RECLAIM_TAG || tag == XFS_ICI_BUSY_TAG);
>  
>  	if (tag == XFS_ICI_RECLAIM_TAG)
>  		pag->pag_ici_reclaimable--;
> @@ -336,6 +338,7 @@ xfs_iget_recycle(
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  	struct inode		*inode = VFS_I(ip);
> +	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  	int			error;
>  
>  	trace_xfs_iget_recycle(ip);
> @@ -392,8 +395,9 @@ xfs_iget_recycle(
>  	 */
>  	ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
>  	ip->i_flags |= XFS_INEW;
> -	xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
> -			XFS_ICI_RECLAIM_TAG);
> +
> +	xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_BUSY_TAG);
> +	xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_RECLAIM_TAG);

This doesn't need any of the global mp->m_perag_tree tracking or
anything else to be tracked or queued - it's just a single state bit
that is associated with the inode and nothing more. The inode
allocation side of things is already per-ag, so it can check the
perag icache tree directly and hence there's no need for global
perag tree tracking.

Hence this could just be:

+	radix_tree_tag_clear(&pag->pag_ici_root, agino, XFS_ICI_BUSY_TAG);


>  	inode->i_state = I_NEW;
>  	spin_unlock(&ip->i_flags_lock);
>  	spin_unlock(&pag->pag_ici_lock);
> @@ -931,6 +935,7 @@ xfs_reclaim_inode(
>  	if (!radix_tree_delete(&pag->pag_ici_root,
>  				XFS_INO_TO_AGINO(ip->i_mount, ino)))
>  		ASSERT(0);
> +	xfs_perag_clear_inode_tag(pag, NULLAGINO, XFS_ICI_BUSY_TAG);
>  	xfs_perag_clear_inode_tag(pag, NULLAGINO, XFS_ICI_RECLAIM_TAG);
>  	spin_unlock(&pag->pag_ici_lock);

This becomes unnecessary, because radix_tree_delete() clears the
tags for the slot where the entry being deleted is found.

>  
> @@ -1807,6 +1812,7 @@ xfs_inodegc_set_reclaimable(
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  	struct xfs_perag	*pag;
> +	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
>  
>  	if (!xfs_is_shutdown(mp) && ip->i_delayed_blks) {
>  		xfs_check_delalloc(ip, XFS_DATA_FORK);
> @@ -1821,10 +1827,12 @@ xfs_inodegc_set_reclaimable(
>  	trace_xfs_inode_set_reclaimable(ip);
>  	if (destroy_gp)
>  		ip->i_destroy_gp = destroy_gp;
> +	if (!poll_state_synchronize_rcu(ip->i_destroy_gp))
> +		xfs_perag_set_inode_tag(pag, agino, XFS_ICI_BUSY_TAG);

And this becomes:

+	if (!poll_state_synchronize_rcu(ip->i_destroy_gp))
+		radix_tree_tag_set(&pag->pag_ici_root, agino, XFS_ICI_BUSY_TAG);

---

FWIW, I have most of the inode cache life-cycle rework prototyped.
All the unlink stuff is done - unlinked inodes are freed directly
in ->destroy_inode now which gets callled on demand when the inodes
are marked clean or XFS_ISTALE cluster buffers are committed. Hence
they don't even go through an IRECLAIMABLE state anymore. I'm
currently working on the same thing for inodes that needed EOF block
trimming, and once that is done the whole XFS_IRECLAIMABLE state and
the background inode reclaim goes away. This also takes with it the 
XFS_ICI_RECLAIM_TAG, the inode cache shrinker and a few other bits
and pieces...

The prototype needs busy inode tracking now, so I was looking at
doing a simple busy inode tracking thing yesterday hence my comments
above. I'm currently just using the XFS_INACTIVE flags to delay
xfs_iget lookups from allocation until the inode has been reclaimed
(really, really slow!) and generic/001 hits this pretty hard.

I'll have a deeper look at patch 4 on Monday and chop it down to
the bare minimum tag lookups and, hopefully, I'll be able to get
rid of the need for patch 3 in this series at the same time...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism
  2022-02-17 17:25 ` [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism Brian Foster
@ 2022-02-17 23:20   ` Dave Chinner
  2022-02-18 14:21     ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2022-02-17 23:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 17, 2022 at 12:25:17PM -0500, Brian Foster wrote:
> The free inode btree currently tracks all inode chunk records with
> at least one free inode. This simplifies the chunk and allocation
> selection algorithms as free inode availability can be guaranteed
> after a few simple checks. This is no longer the case with busy
> inode avoidance, however, because busy inode state is tracked in the
> radix tree independent from physical allocation status.
> 
> A busy inode avoidance algorithm relies on the ability to fall back
> to an inode chunk allocation one way or another in the event that
> all current free inodes are busy. Hack in a crude allocation
> fallback mechanism for experimental purposes. If the inode selection
> algorithm is unable to locate a usable inode, allow it to return
> -EAGAIN to perform another physical chunk allocation in the AG and
> retry the inode allocation.
> 
> The current prototype can perform this allocation and retry sequence
> repeatedly because a newly allocated chunk may still be covered by
> busy in-core inodes in the radix tree (if it were recently freed,
> for example). This is inefficient and temporary. It will be properly
> mitigated by background chunk removal. This defers freeing of inode
> chunk blocks from the free of the last used inode in the chunk to a
> background task that only frees chunks once completely idle, thereby
> providing a guarantee that a new chunk allocation always adds
> non-busy inodes to the AG.

I think you can get rid of this simply by checking the radix tree
tags for busy inodes at the location of the new inode chunk before
we do the cluster allocation. If there are busy inodes in the range
of the chunk (pure gang tag lookup, don't need to dereference any of
the inodes), just skip to the next chunk offset and try that. Hence
we only ever end up allocating a chunk that we know there are no
busy inodes in and this retry mechanism is unnecessary.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/4] xfs: tag reclaimable inodes with pending RCU grace periods as busy
  2022-02-17 23:16   ` Dave Chinner
@ 2022-02-18 14:19     ` Brian Foster
  0 siblings, 0 replies; 15+ messages in thread
From: Brian Foster @ 2022-02-18 14:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Feb 18, 2022 at 10:16:49AM +1100, Dave Chinner wrote:
> On Thu, Feb 17, 2022 at 12:25:16PM -0500, Brian Foster wrote:
> > In order to avoid aggressive recycling of in-core xfs_inode objects with
> > pending grace periods and the subsequent RCU sync stalls involved with
> > recycling, we must be able to identify them quickly and reliably at
> > allocation time. Claim a new tag for the in-core inode radix tree and
> > tag all inodes with a still pending grace period cookie as busy at the
> > time they are made reclaimable.
> > 
> > Note that it is not necessary to maintain consistency between the tag
> > and grace period status once the tag is set. The busy tag simply
> > reflects that the grace period had not expired by the time the inode was
> > set reclaimable and therefore any reuse of the inode must first poll the
> > RCU subsystem for subsequent expiration of the grace period. Clear the
> > tag when the inode is recycled or reclaimed.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_icache.c | 18 +++++++++++++-----
> >  1 file changed, 13 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 693896bc690f..245ee0f6670b 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -32,6 +32,8 @@
> >  #define XFS_ICI_RECLAIM_TAG	0
> >  /* Inode has speculative preallocations (posteof or cow) to clean. */
> >  #define XFS_ICI_BLOCKGC_TAG	1
> > +/* inode has pending RCU grace period when set reclaimable  */
> > +#define XFS_ICI_BUSY_TAG	2
> >  
> >  /*
> >   * The goal for walking incore inodes.  These can correspond with incore inode
> > @@ -274,7 +276,7 @@ xfs_perag_clear_inode_tag(
> >  	if (agino != NULLAGINO)
> >  		radix_tree_tag_clear(&pag->pag_ici_root, agino, tag);
> >  	else
> > -		ASSERT(tag == XFS_ICI_RECLAIM_TAG);
> > +		ASSERT(tag == XFS_ICI_RECLAIM_TAG || tag == XFS_ICI_BUSY_TAG);
> >  
> >  	if (tag == XFS_ICI_RECLAIM_TAG)
> >  		pag->pag_ici_reclaimable--;
> > @@ -336,6 +338,7 @@ xfs_iget_recycle(
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	struct inode		*inode = VFS_I(ip);
> > +	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> >  	int			error;
> >  
> >  	trace_xfs_iget_recycle(ip);
> > @@ -392,8 +395,9 @@ xfs_iget_recycle(
> >  	 */
> >  	ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
> >  	ip->i_flags |= XFS_INEW;
> > -	xfs_perag_clear_inode_tag(pag, XFS_INO_TO_AGINO(mp, ip->i_ino),
> > -			XFS_ICI_RECLAIM_TAG);
> > +
> > +	xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_BUSY_TAG);
> > +	xfs_perag_clear_inode_tag(pag, agino, XFS_ICI_RECLAIM_TAG);
> 
> This doesn't need any of the global mp->m_perag_tree tracking or
> anything else to be tracked or queued - it's just a single state bit
> that is associated with the inode and nothing more. The inode
> allocation side of things is already per-ag, so it can check the
> perag icache tree directly and hence there's no need for global
> perag tree tracking.
> 

Seems reasonable, though I'll probably worry about this sort of cleanup
after the bigger picture things are worked out. (I like having the
existing tracepoints available, if nothing else).

> Hence this could just be:
> 
> +	radix_tree_tag_clear(&pag->pag_ici_root, agino, XFS_ICI_BUSY_TAG);
> 
> 
> >  	inode->i_state = I_NEW;
> >  	spin_unlock(&ip->i_flags_lock);
> >  	spin_unlock(&pag->pag_ici_lock);
> > @@ -931,6 +935,7 @@ xfs_reclaim_inode(
> >  	if (!radix_tree_delete(&pag->pag_ici_root,
> >  				XFS_INO_TO_AGINO(ip->i_mount, ino)))
> >  		ASSERT(0);
> > +	xfs_perag_clear_inode_tag(pag, NULLAGINO, XFS_ICI_BUSY_TAG);
> >  	xfs_perag_clear_inode_tag(pag, NULLAGINO, XFS_ICI_RECLAIM_TAG);
> >  	spin_unlock(&pag->pag_ici_lock);
> 
> This becomes unnecessary, because radix_tree_delete() clears the
> tags for the slot where the entry being deleted is found.
> 
> >  
> > @@ -1807,6 +1812,7 @@ xfs_inodegc_set_reclaimable(
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	struct xfs_perag	*pag;
> > +	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> >  
> >  	if (!xfs_is_shutdown(mp) && ip->i_delayed_blks) {
> >  		xfs_check_delalloc(ip, XFS_DATA_FORK);
> > @@ -1821,10 +1827,12 @@ xfs_inodegc_set_reclaimable(
> >  	trace_xfs_inode_set_reclaimable(ip);
> >  	if (destroy_gp)
> >  		ip->i_destroy_gp = destroy_gp;
> > +	if (!poll_state_synchronize_rcu(ip->i_destroy_gp))
> > +		xfs_perag_set_inode_tag(pag, agino, XFS_ICI_BUSY_TAG);
> 
> And this becomes:
> 
> +	if (!poll_state_synchronize_rcu(ip->i_destroy_gp))
> +		radix_tree_tag_set(&pag->pag_ici_root, agino, XFS_ICI_BUSY_TAG);
> 
> ---
> 
> FWIW, I have most of the inode cache life-cycle rework prototyped.
> All the unlink stuff is done - unlinked inodes are freed directly
> in ->destroy_inode now which gets callled on demand when the inodes
> are marked clean or XFS_ISTALE cluster buffers are committed. Hence
> they don't even go through an IRECLAIMABLE state anymore. I'm
> currently working on the same thing for inodes that needed EOF block
> trimming, and once that is done the whole XFS_IRECLAIMABLE state and
> the background inode reclaim goes away. This also takes with it the 
> XFS_ICI_RECLAIM_TAG, the inode cache shrinker and a few other bits
> and pieces...
> 
> The prototype needs busy inode tracking now, so I was looking at
> doing a simple busy inode tracking thing yesterday hence my comments
> above. I'm currently just using the XFS_INACTIVE flags to delay
> xfs_iget lookups from allocation until the inode has been reclaimed
> (really, really slow!) and generic/001 hits this pretty hard.
> 

Ok. As shown here, the tracking bits are fairly trivial either way. The
allocation side is a bit more involved..

> I'll have a deeper look at patch 4 on Monday and chop it down to
> the bare minimum tag lookups and, hopefully, I'll be able to get
> rid of the need for patch 3 in this series at the same time...
> 

Patch 3 is just an incremental hack to implement a fallback to chunk
allocation when the inode selection algorithm decides it's warranted. I
was expecting (hoping) this would get cleaned up properly by reworking
some of the selection code so we can continue to make this allocation
decision up front (including some busy state in the logic), but hadn't
got to that point yet. Is that what you're referring to here, or are you
trying to implement some other type of fallback behavior in its place..?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism
  2022-02-17 23:20   ` Dave Chinner
@ 2022-02-18 14:21     ` Brian Foster
  2022-02-18 22:54       ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2022-02-18 14:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Feb 18, 2022 at 10:20:33AM +1100, Dave Chinner wrote:
> On Thu, Feb 17, 2022 at 12:25:17PM -0500, Brian Foster wrote:
> > The free inode btree currently tracks all inode chunk records with
> > at least one free inode. This simplifies the chunk and allocation
> > selection algorithms as free inode availability can be guaranteed
> > after a few simple checks. This is no longer the case with busy
> > inode avoidance, however, because busy inode state is tracked in the
> > radix tree independent from physical allocation status.
> > 
> > A busy inode avoidance algorithm relies on the ability to fall back
> > to an inode chunk allocation one way or another in the event that
> > all current free inodes are busy. Hack in a crude allocation
> > fallback mechanism for experimental purposes. If the inode selection
> > algorithm is unable to locate a usable inode, allow it to return
> > -EAGAIN to perform another physical chunk allocation in the AG and
> > retry the inode allocation.
> > 
> > The current prototype can perform this allocation and retry sequence
> > repeatedly because a newly allocated chunk may still be covered by
> > busy in-core inodes in the radix tree (if it were recently freed,
> > for example). This is inefficient and temporary. It will be properly
> > mitigated by background chunk removal. This defers freeing of inode
> > chunk blocks from the free of the last used inode in the chunk to a
> > background task that only frees chunks once completely idle, thereby
> > providing a guarantee that a new chunk allocation always adds
> > non-busy inodes to the AG.
> 
> I think you can get rid of this simply by checking the radix tree
> tags for busy inodes at the location of the new inode chunk before
> we do the cluster allocation. If there are busy inodes in the range
> of the chunk (pure gang tag lookup, don't need to dereference any of
> the inodes), just skip to the next chunk offset and try that. Hence
> we only ever end up allocating a chunk that we know there are no
> busy inodes in and this retry mechanism is unnecessary.
> 

The retry mechanism exists in this series due to the factoring of the
inode allocation code moreso than whether the fallback is guaranteed to
provide a fully non-busy chunk or not. As the prototype is written, the
inode scan still needs to fall back at least once even with such a
guarantee (see my reply on the previous patch around cleaning up that
particular wart).

With regard to checking busy inode state, that is pretty much what I was
referring to by filtering or hinting the block allocation when we
discussed this on IRC. I'm explicitly trying to avoid that because for
one it unnecessarily spreads concern about busy inodes across layers. On
top of that, it assumes that there will always be some usable physical
block range available without busy inodes, which is not the case. That
means we now need to consider the fact that chunk allocation might fail
for reasons other than -ENOSPC and factor that into the inode allocation
algorithm. IOW, ISTM this just unnecessarily complicates things for
minimal benefit.

The point of background freeing inode chunks was that it makes this
problem go away because then we ensure that inode chunks aren't freed
until all associated busy inodes are cleared, and so we preserve the
historical behavior that an inode chunk allocation guarantees immediate
ability to allocate an inode. I thought we agreed in the previous
discussion that this was the right approach since it seemed to be in the
long term direction for XFS anyways.. hm?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism
  2022-02-18 14:21     ` Brian Foster
@ 2022-02-18 22:54       ` Dave Chinner
  2022-02-20 18:48         ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2022-02-18 22:54 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Feb 18, 2022 at 09:21:40AM -0500, Brian Foster wrote:
> On Fri, Feb 18, 2022 at 10:20:33AM +1100, Dave Chinner wrote:
> > On Thu, Feb 17, 2022 at 12:25:17PM -0500, Brian Foster wrote:
> > > The free inode btree currently tracks all inode chunk records with
> > > at least one free inode. This simplifies the chunk and allocation
> > > selection algorithms as free inode availability can be guaranteed
> > > after a few simple checks. This is no longer the case with busy
> > > inode avoidance, however, because busy inode state is tracked in the
> > > radix tree independent from physical allocation status.
> > > 
> > > A busy inode avoidance algorithm relies on the ability to fall back
> > > to an inode chunk allocation one way or another in the event that
> > > all current free inodes are busy. Hack in a crude allocation
> > > fallback mechanism for experimental purposes. If the inode selection
> > > algorithm is unable to locate a usable inode, allow it to return
> > > -EAGAIN to perform another physical chunk allocation in the AG and
> > > retry the inode allocation.
> > > 
> > > The current prototype can perform this allocation and retry sequence
> > > repeatedly because a newly allocated chunk may still be covered by
> > > busy in-core inodes in the radix tree (if it were recently freed,
> > > for example). This is inefficient and temporary. It will be properly
> > > mitigated by background chunk removal. This defers freeing of inode
> > > chunk blocks from the free of the last used inode in the chunk to a
> > > background task that only frees chunks once completely idle, thereby
> > > providing a guarantee that a new chunk allocation always adds
> > > non-busy inodes to the AG.
> > 
> > I think you can get rid of this simply by checking the radix tree
> > tags for busy inodes at the location of the new inode chunk before
> > we do the cluster allocation. If there are busy inodes in the range
> > of the chunk (pure gang tag lookup, don't need to dereference any of
> > the inodes), just skip to the next chunk offset and try that. Hence
> > we only ever end up allocating a chunk that we know there are no
> > busy inodes in and this retry mechanism is unnecessary.
> > 
> 
> The retry mechanism exists in this series due to the factoring of the
> inode allocation code moreso than whether the fallback is guaranteed to
> provide a fully non-busy chunk or not. As the prototype is written, the
> inode scan still needs to fall back at least once even with such a
> guarantee (see my reply on the previous patch around cleaning up that
> particular wart).
> 
> With regard to checking busy inode state, that is pretty much what I was
> referring to by filtering or hinting the block allocation when we
> discussed this on IRC. I'm explicitly trying to avoid that because for
> one it unnecessarily spreads concern about busy inodes across layers. On
> top of that, it assumes that there will always be some usable physical
> block range available without busy inodes, which is not the case. That
> means we now need to consider the fact that chunk allocation might fail
> for reasons other than -ENOSPC and factor that into the inode allocation
> algorithm. IOW, ISTM this just unnecessarily complicates things for
> minimal benefit.

For the moment, if inode allocation fails because we have busy
inodes after chunk allocation has already been done, then we are
hitting a corner case that isn't typical fast path operation. I
think that we should not complicate things by trying to optimise
this case unnecessarily.

I'd just expedite reclaim using synchronize_rcu() and re-run the
finobt scan as it will always succeed the second time because we
haven't dropped the AGI log and all freed inodes have now passed
through a grace period. Indeed, we need this expedited reclaim path
anyway because if we fail to allocate a new chunk and there are busy
inodes, we need to wait for busy inodes to become unbusy to avoid
premature ENOSPC while there are still avaialbe free inodes.

In the case of the updated inode lifecycle stuff I'm working on, a
log force will replace the synchronise_rcu() call because the inodes
will be XFS_ISTALE and journal IO completion of the cluster buffers
will trigger the inodes to be reclaimed immediately as writeback is
elided for XFS_ISTALE inodes. We may need an AIL push in other
cases, but I'll cross that river when I get to it.

> The point of background freeing inode chunks was that it makes this
> problem go away because then we ensure that inode chunks aren't freed
> until all associated busy inodes are cleared, and so we preserve the
> historical behavior that an inode chunk allocation guarantees immediate
> ability to allocate an inode. I thought we agreed in the previous
> discussion that this was the right approach since it seemed to be in the
> long term direction for XFS anyways.. hm?

Long term, yes, but we need something that works effectively and
efficiently now, with minimal additional overhead, because we're
going to have to live with this code in the allocation fast path for
some time yet.

Really, I want foreground inode allocation to know nothing about
inode chunk allocation. If there are no inodes available for
allocation, it kicks background inode chunk management and sleeps
waiting for to be given an allocated inode it can use. It shouldn't
even have to know about busy inodes - just work from an in-memory
per-ag free list of inode numbers that can be immediately allocated.

In this situation, inodes that have been recently unlinked don't
show up on that list until they can be reallocated safely. This
is all managed asynchronously in the background by the inodegc state
machine (what I'm currently working on) and when the inode is
finally reclaimed it is moved into the free list and allowed to be
reallocated.

IOWs, the long term direction is to make sure that the
foreground inode allocator doesn't even know about the existence of
busy inodes and it gets faster and simpler as we push all the mess
into the background that runs all the slow path allocation and
freeing algorithms.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism
  2022-02-18 22:54       ` Dave Chinner
@ 2022-02-20 18:48         ` Brian Foster
  2022-02-23  7:00           ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2022-02-20 18:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sat, Feb 19, 2022 at 09:54:40AM +1100, Dave Chinner wrote:
> On Fri, Feb 18, 2022 at 09:21:40AM -0500, Brian Foster wrote:
> > On Fri, Feb 18, 2022 at 10:20:33AM +1100, Dave Chinner wrote:
> > > On Thu, Feb 17, 2022 at 12:25:17PM -0500, Brian Foster wrote:
> > > > The free inode btree currently tracks all inode chunk records with
> > > > at least one free inode. This simplifies the chunk and allocation
> > > > selection algorithms as free inode availability can be guaranteed
> > > > after a few simple checks. This is no longer the case with busy
> > > > inode avoidance, however, because busy inode state is tracked in the
> > > > radix tree independent from physical allocation status.
> > > > 
> > > > A busy inode avoidance algorithm relies on the ability to fall back
> > > > to an inode chunk allocation one way or another in the event that
> > > > all current free inodes are busy. Hack in a crude allocation
> > > > fallback mechanism for experimental purposes. If the inode selection
> > > > algorithm is unable to locate a usable inode, allow it to return
> > > > -EAGAIN to perform another physical chunk allocation in the AG and
> > > > retry the inode allocation.
> > > > 
> > > > The current prototype can perform this allocation and retry sequence
> > > > repeatedly because a newly allocated chunk may still be covered by
> > > > busy in-core inodes in the radix tree (if it were recently freed,
> > > > for example). This is inefficient and temporary. It will be properly
> > > > mitigated by background chunk removal. This defers freeing of inode
> > > > chunk blocks from the free of the last used inode in the chunk to a
> > > > background task that only frees chunks once completely idle, thereby
> > > > providing a guarantee that a new chunk allocation always adds
> > > > non-busy inodes to the AG.
> > > 
> > > I think you can get rid of this simply by checking the radix tree
> > > tags for busy inodes at the location of the new inode chunk before
> > > we do the cluster allocation. If there are busy inodes in the range
> > > of the chunk (pure gang tag lookup, don't need to dereference any of
> > > the inodes), just skip to the next chunk offset and try that. Hence
> > > we only ever end up allocating a chunk that we know there are no
> > > busy inodes in and this retry mechanism is unnecessary.
> > > 
> > 
> > The retry mechanism exists in this series due to the factoring of the
> > inode allocation code moreso than whether the fallback is guaranteed to
> > provide a fully non-busy chunk or not. As the prototype is written, the
> > inode scan still needs to fall back at least once even with such a
> > guarantee (see my reply on the previous patch around cleaning up that
> > particular wart).
> > 
> > With regard to checking busy inode state, that is pretty much what I was
> > referring to by filtering or hinting the block allocation when we
> > discussed this on IRC. I'm explicitly trying to avoid that because for
> > one it unnecessarily spreads concern about busy inodes across layers. On
> > top of that, it assumes that there will always be some usable physical
> > block range available without busy inodes, which is not the case. That
> > means we now need to consider the fact that chunk allocation might fail
> > for reasons other than -ENOSPC and factor that into the inode allocation
> > algorithm. IOW, ISTM this just unnecessarily complicates things for
> > minimal benefit.
> 
> For the moment, if inode allocation fails because we have busy
> inodes after chunk allocation has already been done, then we are
> hitting a corner case that isn't typical fast path operation. I
> think that we should not complicate things by trying to optimise
> this case unnecessarily.
> 

Not really, it's one of the first problems I ran into with the chunk
allocation fallback in place.

> I'd just expedite reclaim using synchronize_rcu() and re-run the
> finobt scan as it will always succeed the second time because we
> haven't dropped the AGI log and all freed inodes have now passed
> through a grace period. Indeed, we need this expedited reclaim path
> anyway because if we fail to allocate a new chunk and there are busy
> inodes, we need to wait for busy inodes to become unbusy to avoid
> premature ENOSPC while there are still avaialbe free inodes.
> 

My experiments to this point suggest that would be extremely slow, to
the point where there's not much value to anything beyond patch 1. The
finobt is going to be relatively small with any sort of mixed alloc/free
workload as records cycle in and out, so a "sync per batch" is
effectively the behavior I already see with patch 1 alone and it's a
performance killer (with non-expedited grace periods, at least). We do
need the rcu sync -ENOSPC fallback as you say, but IMO it should be a
last resort.

> In the case of the updated inode lifecycle stuff I'm working on, a
> log force will replace the synchronise_rcu() call because the inodes
> will be XFS_ISTALE and journal IO completion of the cluster buffers
> will trigger the inodes to be reclaimed immediately as writeback is
> elided for XFS_ISTALE inodes. We may need an AIL push in other
> cases, but I'll cross that river when I get to it.
> 

Ok.

> > The point of background freeing inode chunks was that it makes this
> > problem go away because then we ensure that inode chunks aren't freed
> > until all associated busy inodes are cleared, and so we preserve the
> > historical behavior that an inode chunk allocation guarantees immediate
> > ability to allocate an inode. I thought we agreed in the previous
> > discussion that this was the right approach since it seemed to be in the
> > long term direction for XFS anyways.. hm?
> 
> Long term, yes, but we need something that works effectively and
> efficiently now, with minimal additional overhead, because we're
> going to have to live with this code in the allocation fast path for
> some time yet.
> 

Right, but I thought this is why we were only going to do the background
freeing part of the whole "background inode management" thing?

Short of that, I'm not aware of a really great option atm. IMO, pushing
explicit busy inode state/checking down into the block allocator is kind
of a gross layering violation. The approach this series currently uses
is simple and effective, but it's an unbound retry mechanism that just
continues to allocate chunks until we get one we can use, which is too
crude for production.

Perhaps a simple enough short term option is to use the existing block
alloc min/max range mechanisms (as mentioned on IRC). For example:

- Use the existing min/max_agbno allocation arg input values to attempt
  one or two chunk allocs outside of the known range of busy inodes for
  the AG (i.e., allocate blocks higher than the max busy agino or lower
  than the min busy agino).
- If success, then we know we've got a chunk w/o busy inodes.
- If failure, fall back to the existing chunk alloc calls, take whatever
  we get and retry the finobt scan (perhaps more aggressively checking
  each record) hoping we got a usable new record.
- If that still fails, then fall back to synchronize_rcu() as a last
  resort and grab one of the previously busy inodes.

I couldn't say for sure if that would be effective enough without
playing with it a bit, but that would sort of emulate an algorithm that
filtered chunk block allocations with at least one busy inode without
having to modify block allocation code. If it avoids an RCU sync in the
majority of cases it might be effective enough as a stopgap until
background freeing exists. Thoughts?

> Really, I want foreground inode allocation to know nothing about
> inode chunk allocation. If there are no inodes available for
> allocation, it kicks background inode chunk management and sleeps
> waiting for to be given an allocated inode it can use. It shouldn't
> even have to know about busy inodes - just work from an in-memory
> per-ag free list of inode numbers that can be immediately allocated.
> 
> In this situation, inodes that have been recently unlinked don't
> show up on that list until they can be reallocated safely. This
> is all managed asynchronously in the background by the inodegc state
> machine (what I'm currently working on) and when the inode is
> finally reclaimed it is moved into the free list and allowed to be
> reallocated.
> 

I think that makes a lot of sense. That's quite similar to another
experiment I was playing with that essentially populated a capped size
pool of background inactivated inodes that the allocation side could
pull directly from (i.e., so allocation actually becomes a radix tree
lookup instead of a filtered btree scan), but so far I was kind of
fighting with the existing mechanisms, trying not to peturb sustained
remove performance, etc., and hadn't been able to produce a performance
benefit yet. Perhaps this will work out better with the bigger picture
changes to inode lifecycle and background inode management in place..

Brian

> IOWs, the long term direction is to make sure that the
> foreground inode allocator doesn't even know about the existence of
> busy inodes and it gets faster and simpler as we push all the mess
> into the background that runs all the slow path allocation and
> freeing algorithms.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism
  2022-02-20 18:48         ` Brian Foster
@ 2022-02-23  7:00           ` Dave Chinner
  2022-02-28 21:45             ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2022-02-23  7:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Sun, Feb 20, 2022 at 01:48:10PM -0500, Brian Foster wrote:
> On Sat, Feb 19, 2022 at 09:54:40AM +1100, Dave Chinner wrote:
> > On Fri, Feb 18, 2022 at 09:21:40AM -0500, Brian Foster wrote:
> > > The point of background freeing inode chunks was that it makes this
> > > problem go away because then we ensure that inode chunks aren't freed
> > > until all associated busy inodes are cleared, and so we preserve the
> > > historical behavior that an inode chunk allocation guarantees immediate
> > > ability to allocate an inode. I thought we agreed in the previous
> > > discussion that this was the right approach since it seemed to be in the
> > > long term direction for XFS anyways.. hm?
> > 
> > Long term, yes, but we need something that works effectively and
> > efficiently now, with minimal additional overhead, because we're
> > going to have to live with this code in the allocation fast path for
> > some time yet.
> > 
> 
> Right, but I thought this is why we were only going to do the background
> freeing part of the whole "background inode management" thing?
> 
> Short of that, I'm not aware of a really great option atm. IMO, pushing
> explicit busy inode state/checking down into the block allocator is kind
> of a gross layering violation. The approach this series currently uses
> is simple and effective, but it's an unbound retry mechanism that just
> continues to allocate chunks until we get one we can use, which is too
> crude for production.

*nod*

Right, that's why I want to get this whole mechanism completely
detached from the VFS inode RCU life cycle rules and instead
synchronised by internal IO operations such as log forces.

The code I currently have is based on your changes, just without the
fallback chunk allocation. I'm not really even scanning the irecs;
I just look at the number of free inodes and count the number of
busy inodes over the range of the record. If they aren't the same,
we've got inodes in that record we can allocate from. I only do a
free inode-by-free inode busy check once the irec we are going to
allocate from has been selected.

Essentially, I'm just scanning records in the finobt to find the
first with a non-busy inode. If we fall off the end of the finobt,
it issues a log force and kicks the AIL and then retries the
allocation from the finobt. There isn't any attempt to allocate new
inode chunks in the case, but it may end up being necessary and can
be done without falling back to the outer loops.

i.e. as long as we track whether we've allocated a new inode chunk
or not, we can bound the finobt search to a single retry. If we
allocated a new chunk before entering the finobt search, then all we
need is a log force because the busy inodes, by definition, are
XFS_ISTALE at this point and waiting for a CIL push before they can
be reclaimed. At this point an retry of the finobt scan will find
those inodes that were busy now available for allocation.

If we haven't allocated a new chunk, then we can do so immediately
and retry the allocation. If we still find them all busy, we force
the log...

IOWs, once we isolate this busy inode tracking from the VFS inode
RCU requirements, we can bound the allocation behaviour because
chunk allocation and log forces provide us with a guarantee that the
newly allocated inode chunks contain inodes that can be immediately
reallocated without having to care about where the new inode chunk
is located....

> Perhaps a simple enough short term option is to use the existing block
> alloc min/max range mechanisms (as mentioned on IRC). For example:
> 
> - Use the existing min/max_agbno allocation arg input values to attempt
>   one or two chunk allocs outside of the known range of busy inodes for
>   the AG (i.e., allocate blocks higher than the max busy agino or lower
>   than the min busy agino).
> - If success, then we know we've got a chunk w/o busy inodes.
> - If failure, fall back to the existing chunk alloc calls, take whatever
>   we get and retry the finobt scan (perhaps more aggressively checking
>   each record) hoping we got a usable new record.
> - If that still fails, then fall back to synchronize_rcu() as a last
>   resort and grab one of the previously busy inodes.
> 
> I couldn't say for sure if that would be effective enough without
> playing with it a bit, but that would sort of emulate an algorithm that
> filtered chunk block allocations with at least one busy inode without
> having to modify block allocation code. If it avoids an RCU sync in the
> majority of cases it might be effective enough as a stopgap until
> background freeing exists. Thoughts?

It might work, but I'm not a fan of how many hoops we are considering
jumping through to avoid getting tangled up in the RCU requirements
for VFS inode life cycles. I'd much prefer just being able to say
"all inodes busy, log force, try again" like we do with busy extent
limited block allocation...

That said, the complexity gets moved elsewhere (the VFS inode
lifecycle management) rather than into the inode allocator, but I
think that's a valid trade-off because the RCU requirements for
inode reallocation come from the VFS inode lifecycle constraints.

> > Really, I want foreground inode allocation to know nothing about
> > inode chunk allocation. If there are no inodes available for
> > allocation, it kicks background inode chunk management and sleeps
> > waiting for to be given an allocated inode it can use. It shouldn't
> > even have to know about busy inodes - just work from an in-memory
> > per-ag free list of inode numbers that can be immediately allocated.
> > 
> > In this situation, inodes that have been recently unlinked don't
> > show up on that list until they can be reallocated safely. This
> > is all managed asynchronously in the background by the inodegc state
> > machine (what I'm currently working on) and when the inode is
> > finally reclaimed it is moved into the free list and allowed to be
> > reallocated.
> > 
> 
> I think that makes a lot of sense. That's quite similar to another
> experiment I was playing with that essentially populated a capped size
> pool of background inactivated inodes that the allocation side could
> pull directly from (i.e., so allocation actually becomes a radix tree
> lookup instead of a filtered btree scan), but so far I was kind of
> fighting with the existing mechanisms, trying not to peturb sustained
> remove performance, etc., and hadn't been able to produce a performance
> benefit yet. Perhaps this will work out better with the bigger picture
> changes to inode lifecycle and background inode management in place..

*nod*

The "don't have a perf impact" side of thigns is generally why I
test all my new code with fsmark and other scalability tests before
I go anywhere near fstests. If it doesn't perform, it's not worth
trying to make work correctly...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism
  2022-02-23  7:00           ` Dave Chinner
@ 2022-02-28 21:45             ` Brian Foster
  2022-02-28 22:55               ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2022-02-28 21:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Feb 23, 2022 at 06:00:58PM +1100, Dave Chinner wrote:
> On Sun, Feb 20, 2022 at 01:48:10PM -0500, Brian Foster wrote:
> > On Sat, Feb 19, 2022 at 09:54:40AM +1100, Dave Chinner wrote:
> > > On Fri, Feb 18, 2022 at 09:21:40AM -0500, Brian Foster wrote:
> > > > The point of background freeing inode chunks was that it makes this
> > > > problem go away because then we ensure that inode chunks aren't freed
> > > > until all associated busy inodes are cleared, and so we preserve the
> > > > historical behavior that an inode chunk allocation guarantees immediate
> > > > ability to allocate an inode. I thought we agreed in the previous
> > > > discussion that this was the right approach since it seemed to be in the
> > > > long term direction for XFS anyways.. hm?
> > > 
> > > Long term, yes, but we need something that works effectively and
> > > efficiently now, with minimal additional overhead, because we're
> > > going to have to live with this code in the allocation fast path for
> > > some time yet.
> > > 
> > 
> > Right, but I thought this is why we were only going to do the background
> > freeing part of the whole "background inode management" thing?
> > 
> > Short of that, I'm not aware of a really great option atm. IMO, pushing
> > explicit busy inode state/checking down into the block allocator is kind
> > of a gross layering violation. The approach this series currently uses
> > is simple and effective, but it's an unbound retry mechanism that just
> > continues to allocate chunks until we get one we can use, which is too
> > crude for production.
> 
> *nod*
> 
> Right, that's why I want to get this whole mechanism completely
> detached from the VFS inode RCU life cycle rules and instead
> synchronised by internal IO operations such as log forces.
> 
> The code I currently have is based on your changes, just without the
> fallback chunk allocation. I'm not really even scanning the irecs;
> I just look at the number of free inodes and count the number of
> busy inodes over the range of the record. If they aren't the same,
> we've got inodes in that record we can allocate from. I only do a
> free inode-by-free inode busy check once the irec we are going to
> allocate from has been selected.
> 
> Essentially, I'm just scanning records in the finobt to find the
> first with a non-busy inode. If we fall off the end of the finobt,
> it issues a log force and kicks the AIL and then retries the
> allocation from the finobt. There isn't any attempt to allocate new
> inode chunks in the case, but it may end up being necessary and can
> be done without falling back to the outer loops.
> 

Sure, that was just done for expediency to test the approach. That said,
as a first step I wouldn't be opposed to something that similarly falls
back to the outer loop so long as it incorporates a check like described
below to restrict the retry attempt(s). The logic is simple enough that
more extensive cleanups could come separately..

> i.e. as long as we track whether we've allocated a new inode chunk
> or not, we can bound the finobt search to a single retry. If we
> allocated a new chunk before entering the finobt search, then all we
> need is a log force because the busy inodes, by definition, are
> XFS_ISTALE at this point and waiting for a CIL push before they can
> be reclaimed. At this point an retry of the finobt scan will find
> those inodes that were busy now available for allocation.
> 

Remind me what aspect of the prospective VFS changes prevents inode
allocation from seeing a free inode with not-yet-elapsed RCU grace
period..? Will this delay freeing (or evicting) the inode until a grace
period elapses from when the last reference was dropped?

> If we haven't allocated a new chunk, then we can do so immediately
> and retry the allocation. If we still find them all busy, we force
> the log...
> 
> IOWs, once we isolate this busy inode tracking from the VFS inode
> RCU requirements, we can bound the allocation behaviour because
> chunk allocation and log forces provide us with a guarantee that the
> newly allocated inode chunks contain inodes that can be immediately
> reallocated without having to care about where the new inode chunk
> is located....
> 
> > Perhaps a simple enough short term option is to use the existing block
> > alloc min/max range mechanisms (as mentioned on IRC). For example:
> > 
> > - Use the existing min/max_agbno allocation arg input values to attempt
> >   one or two chunk allocs outside of the known range of busy inodes for
> >   the AG (i.e., allocate blocks higher than the max busy agino or lower
> >   than the min busy agino).
> > - If success, then we know we've got a chunk w/o busy inodes.
> > - If failure, fall back to the existing chunk alloc calls, take whatever
> >   we get and retry the finobt scan (perhaps more aggressively checking
> >   each record) hoping we got a usable new record.
> > - If that still fails, then fall back to synchronize_rcu() as a last
> >   resort and grab one of the previously busy inodes.
> > 
> > I couldn't say for sure if that would be effective enough without
> > playing with it a bit, but that would sort of emulate an algorithm that
> > filtered chunk block allocations with at least one busy inode without
> > having to modify block allocation code. If it avoids an RCU sync in the
> > majority of cases it might be effective enough as a stopgap until
> > background freeing exists. Thoughts?
> 
> It might work, but I'm not a fan of how many hoops we are considering
> jumping through to avoid getting tangled up in the RCU requirements
> for VFS inode life cycles. I'd much prefer just being able to say
> "all inodes busy, log force, try again" like we do with busy extent
> limited block allocation...
> 

Well presumably that works fine for your implementation of busy inodes,
but not so much for the variant intended to work around the RCU inode
reuse problem. ISTM this all just depends on the goals and plan here,
and I'm not seeing clear enough reasoning to grok what that is. A
summary of the options discussed so far:

- deferred inode freeing - ideal but too involved, want something sooner
  and more simple
- hinted non-busy chunk allocation - simple but jumping through too many
  RCU requirement hoops
- sync RCU and retry - most analogous to longer term busy inode
  allocation handling (i.e. just replace rcu sync w/ log force and
  retry), but RCU sync is too performance costly as a direct fallback

ISTM the options here range from simple and slow, to moderately simple
and effective (TBD), to complex and ideal. So what is the goal for this
series? My understanding to this point is that VFS changes are a ways
out, so the first step was busy inodes == inodes with pending grace
periods and a rework of the busy inode definition comes later with the
associated VFS changes. That essentially means this series gets fixed up
and posted as a mergeable component in the meantime, albeit with a
"general direction" that is compatible with longer term changes.

However, every RCU requirement/characteristic that this series has to
deal with is never going to be 100% perfectly aligned with the longer
term busy inode requirements because the implementations/definitions
will differ, particularly if the RCU handling is pushed up into the VFS.
So if we're concerned about that, the alternative approach is to skip
incrementally addressing the RCU inode reuse problem and just fold
whatever bits of useful logic from here into your inode lifecycle work
as needed and drop this as a mergeable component.

Brian

> That said, the complexity gets moved elsewhere (the VFS inode
> lifecycle management) rather than into the inode allocator, but I
> think that's a valid trade-off because the RCU requirements for
> inode reallocation come from the VFS inode lifecycle constraints.
> 
> > > Really, I want foreground inode allocation to know nothing about
> > > inode chunk allocation. If there are no inodes available for
> > > allocation, it kicks background inode chunk management and sleeps
> > > waiting for to be given an allocated inode it can use. It shouldn't
> > > even have to know about busy inodes - just work from an in-memory
> > > per-ag free list of inode numbers that can be immediately allocated.
> > > 
> > > In this situation, inodes that have been recently unlinked don't
> > > show up on that list until they can be reallocated safely. This
> > > is all managed asynchronously in the background by the inodegc state
> > > machine (what I'm currently working on) and when the inode is
> > > finally reclaimed it is moved into the free list and allowed to be
> > > reallocated.
> > > 
> > 
> > I think that makes a lot of sense. That's quite similar to another
> > experiment I was playing with that essentially populated a capped size
> > pool of background inactivated inodes that the allocation side could
> > pull directly from (i.e., so allocation actually becomes a radix tree
> > lookup instead of a filtered btree scan), but so far I was kind of
> > fighting with the existing mechanisms, trying not to peturb sustained
> > remove performance, etc., and hadn't been able to produce a performance
> > benefit yet. Perhaps this will work out better with the bigger picture
> > changes to inode lifecycle and background inode management in place..
> 
> *nod*
> 
> The "don't have a perf impact" side of thigns is generally why I
> test all my new code with fsmark and other scalability tests before
> I go anywhere near fstests. If it doesn't perform, it's not worth
> trying to make work correctly...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism
  2022-02-28 21:45             ` Brian Foster
@ 2022-02-28 22:55               ` Dave Chinner
  2022-03-01 15:05                 ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2022-02-28 22:55 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Feb 28, 2022 at 04:45:49PM -0500, Brian Foster wrote:
> On Wed, Feb 23, 2022 at 06:00:58PM +1100, Dave Chinner wrote:
> > i.e. as long as we track whether we've allocated a new inode chunk
> > or not, we can bound the finobt search to a single retry. If we
> > allocated a new chunk before entering the finobt search, then all we
> > need is a log force because the busy inodes, by definition, are
> > XFS_ISTALE at this point and waiting for a CIL push before they can
> > be reclaimed. At this point an retry of the finobt scan will find
> > those inodes that were busy now available for allocation.
> 
> Remind me what aspect of the prospective VFS changes prevents inode
> allocation from seeing a free inode with not-yet-elapsed RCU grace
> period..? Will this delay freeing (or evicting) the inode until a grace
> period elapses from when the last reference was dropped?

It will delay it until the inode has been reclaimed, in which case
reallocating it will result in a new struct xfs_inode being
allocated from slab, and we're all good. Any lookup that finds the
old inode will see either I_WILL_FREE, I_FREEING, I_CLEAR,
XFS_NEED_INACTIVE, XFS_IRECLAIM or ip->i_ino == 0 depending on what
point the lookup occurs. Hence the lookup will be unable to take a
new reference to the unlinked inode, and the lookup retry should
then get the newly allocated xfs_inode once it has been inserted
into the radix tree...

IOWs, it gets rid of recycling altogether, and that's how we avoid
the RCU grace period issues with recycled inodes.

> > > Perhaps a simple enough short term option is to use the existing block
> > > alloc min/max range mechanisms (as mentioned on IRC). For example:
> > > 
> > > - Use the existing min/max_agbno allocation arg input values to attempt
> > >   one or two chunk allocs outside of the known range of busy inodes for
> > >   the AG (i.e., allocate blocks higher than the max busy agino or lower
> > >   than the min busy agino).
> > > - If success, then we know we've got a chunk w/o busy inodes.
> > > - If failure, fall back to the existing chunk alloc calls, take whatever
> > >   we get and retry the finobt scan (perhaps more aggressively checking
> > >   each record) hoping we got a usable new record.
> > > - If that still fails, then fall back to synchronize_rcu() as a last
> > >   resort and grab one of the previously busy inodes.
> > > 
> > > I couldn't say for sure if that would be effective enough without
> > > playing with it a bit, but that would sort of emulate an algorithm that
> > > filtered chunk block allocations with at least one busy inode without
> > > having to modify block allocation code. If it avoids an RCU sync in the
> > > majority of cases it might be effective enough as a stopgap until
> > > background freeing exists. Thoughts?
> > 
> > It might work, but I'm not a fan of how many hoops we are considering
> > jumping through to avoid getting tangled up in the RCU requirements
> > for VFS inode life cycles. I'd much prefer just being able to say
> > "all inodes busy, log force, try again" like we do with busy extent
> > limited block allocation...
> > 
> 
> Well presumably that works fine for your implementation of busy inodes,
> but not so much for the variant intended to work around the RCU inode
> reuse problem. ISTM this all just depends on the goals and plan here,
> and I'm not seeing clear enough reasoning to grok what that is. A
> summary of the options discussed so far:
> 
> - deferred inode freeing - ideal but too involved, want something sooner
>   and more simple
> - hinted non-busy chunk allocation - simple but jumping through too many
>   RCU requirement hoops
> - sync RCU and retry - most analogous to longer term busy inode
>   allocation handling (i.e. just replace rcu sync w/ log force and
>   retry), but RCU sync is too performance costly as a direct fallback
> 
> ISTM the options here range from simple and slow, to moderately simple
> and effective (TBD), to complex and ideal. So what is the goal for this
> series?

To find out if we can do something fast and effective (and without
hacks that might bite us unexepectedly) with RCU grace periods that
is less intrusive than changing inode life cycles. If it turns out
that we can change the inode life cycle faster than we can come up
with a RCU grace period based solution, then we should just go with
inode life cycle changes and forget about the interim RCU grace
period detection changes...

> My understanding to this point is that VFS changes are a ways
> out, so the first step was busy inodes == inodes with pending grace
> periods and a rework of the busy inode definition comes later with the
> associated VFS changes. That essentially means this series gets fixed up
> and posted as a mergeable component in the meantime, albeit with a
> "general direction" that is compatible with longer term changes.

It seems to me that the RCU detection code is also a ways out,
so I'm trying to keep our options open and not have us duplicate
work unnecessarily.

> However, every RCU requirement/characteristic that this series has to
> deal with is never going to be 100% perfectly aligned with the longer
> term busy inode requirements because the implementations/definitions
> will differ, particularly if the RCU handling is pushed up into the VFS.
> So if we're concerned about that, the alternative approach is to skip
> incrementally addressing the RCU inode reuse problem and just fold
> whatever bits of useful logic from here into your inode lifecycle work
> as needed and drop this as a mergeable component.

*nod*

The signs are pointing somewhat that way, but I'm still finding
the occasional unexpected gotcha in the code that I have to work
around. I think I've got them all, but until I've got it working it
pays to keep our options open....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism
  2022-02-28 22:55               ` Dave Chinner
@ 2022-03-01 15:05                 ` Brian Foster
  0 siblings, 0 replies; 15+ messages in thread
From: Brian Foster @ 2022-03-01 15:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 01, 2022 at 09:55:56AM +1100, Dave Chinner wrote:
> On Mon, Feb 28, 2022 at 04:45:49PM -0500, Brian Foster wrote:
> > On Wed, Feb 23, 2022 at 06:00:58PM +1100, Dave Chinner wrote:
> > > i.e. as long as we track whether we've allocated a new inode chunk
> > > or not, we can bound the finobt search to a single retry. If we
> > > allocated a new chunk before entering the finobt search, then all we
> > > need is a log force because the busy inodes, by definition, are
> > > XFS_ISTALE at this point and waiting for a CIL push before they can
> > > be reclaimed. At this point an retry of the finobt scan will find
> > > those inodes that were busy now available for allocation.
> > 
> > Remind me what aspect of the prospective VFS changes prevents inode
> > allocation from seeing a free inode with not-yet-elapsed RCU grace
> > period..? Will this delay freeing (or evicting) the inode until a grace
> > period elapses from when the last reference was dropped?
> 
> It will delay it until the inode has been reclaimed, in which case
> reallocating it will result in a new struct xfs_inode being
> allocated from slab, and we're all good. Any lookup that finds the
> old inode will see either I_WILL_FREE, I_FREEING, I_CLEAR,
> XFS_NEED_INACTIVE, XFS_IRECLAIM or ip->i_ino == 0 depending on what
> point the lookup occurs. Hence the lookup will be unable to take a
> new reference to the unlinked inode, and the lookup retry should
> then get the newly allocated xfs_inode once it has been inserted
> into the radix tree...
> 
> IOWs, it gets rid of recycling altogether, and that's how we avoid
> the RCU grace period issues with recycled inodes.
> 
> > > > Perhaps a simple enough short term option is to use the existing block
> > > > alloc min/max range mechanisms (as mentioned on IRC). For example:
> > > > 
> > > > - Use the existing min/max_agbno allocation arg input values to attempt
> > > >   one or two chunk allocs outside of the known range of busy inodes for
> > > >   the AG (i.e., allocate blocks higher than the max busy agino or lower
> > > >   than the min busy agino).
> > > > - If success, then we know we've got a chunk w/o busy inodes.
> > > > - If failure, fall back to the existing chunk alloc calls, take whatever
> > > >   we get and retry the finobt scan (perhaps more aggressively checking
> > > >   each record) hoping we got a usable new record.
> > > > - If that still fails, then fall back to synchronize_rcu() as a last
> > > >   resort and grab one of the previously busy inodes.
> > > > 
> > > > I couldn't say for sure if that would be effective enough without
> > > > playing with it a bit, but that would sort of emulate an algorithm that
> > > > filtered chunk block allocations with at least one busy inode without
> > > > having to modify block allocation code. If it avoids an RCU sync in the
> > > > majority of cases it might be effective enough as a stopgap until
> > > > background freeing exists. Thoughts?
> > > 
> > > It might work, but I'm not a fan of how many hoops we are considering
> > > jumping through to avoid getting tangled up in the RCU requirements
> > > for VFS inode life cycles. I'd much prefer just being able to say
> > > "all inodes busy, log force, try again" like we do with busy extent
> > > limited block allocation...
> > > 
> > 
> > Well presumably that works fine for your implementation of busy inodes,
> > but not so much for the variant intended to work around the RCU inode
> > reuse problem. ISTM this all just depends on the goals and plan here,
> > and I'm not seeing clear enough reasoning to grok what that is. A
> > summary of the options discussed so far:
> > 
> > - deferred inode freeing - ideal but too involved, want something sooner
> >   and more simple
> > - hinted non-busy chunk allocation - simple but jumping through too many
> >   RCU requirement hoops
> > - sync RCU and retry - most analogous to longer term busy inode
> >   allocation handling (i.e. just replace rcu sync w/ log force and
> >   retry), but RCU sync is too performance costly as a direct fallback
> > 
> > ISTM the options here range from simple and slow, to moderately simple
> > and effective (TBD), to complex and ideal. So what is the goal for this
> > series?
> 
> To find out if we can do something fast and effective (and without
> hacks that might bite us unexepectedly) with RCU grace periods that
> is less intrusive than changing inode life cycles. If it turns out
> that we can change the inode life cycle faster than we can come up
> with a RCU grace period based solution, then we should just go with
> inode life cycle changes and forget about the interim RCU grace
> period detection changes...
> 
> > My understanding to this point is that VFS changes are a ways
> > out, so the first step was busy inodes == inodes with pending grace
> > periods and a rework of the busy inode definition comes later with the
> > associated VFS changes. That essentially means this series gets fixed up
> > and posted as a mergeable component in the meantime, albeit with a
> > "general direction" that is compatible with longer term changes.
> 
> It seems to me that the RCU detection code is also a ways out,
> so I'm trying to keep our options open and not have us duplicate
> work unnecessarily.
> 

It depends..? All discussions to this point around the "proper" fix to
the RCU inode recycle / lifecycle issue suggested a timeline of a year
or longer. Hence my initial thought process around the busy inode +
deferred inode freeing being a reasonable strategy. If the chunk
allocation hint thing is effective, then with that it's really just a
matter of factoring that in along with some allocation cleanups (i.e.
one chunk alloc retry limit, refining the !busy inode selection logic,
etc.) and then perhaps a few weeks or so of review/test/dev cycles on
top of that.

I've no real sense of where the lifecycle stuff is at or how invasive it
is, so have no real reference to compare against. In any event, my
attempts to progress this have kind of tripped my early wack-a-mole
detector so I'll leave this alone for now and we can come back to it as
an incremental step if the need arises.

Brian

> > However, every RCU requirement/characteristic that this series has to
> > deal with is never going to be 100% perfectly aligned with the longer
> > term busy inode requirements because the implementations/definitions
> > will differ, particularly if the RCU handling is pushed up into the VFS.
> > So if we're concerned about that, the alternative approach is to skip
> > incrementally addressing the RCU inode reuse problem and just fold
> > whatever bits of useful logic from here into your inode lifecycle work
> > as needed and drop this as a mergeable component.
> 
> *nod*
> 
> The signs are pointing somewhat that way, but I'm still finding
> the occasional unexpected gotcha in the code that I have to work
> around. I think I've got them all, but until I've got it working it
> pays to keep our options open....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-03-01 15:06 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-17 17:25 [PATCH RFC 0/4] xfs: track and skip realloc of busy inodes Brian Foster
2022-02-17 17:25 ` [PATCH RFC 1/4] xfs: require an rcu grace period before inode recycle Brian Foster
2022-02-17 17:25 ` [PATCH RFC 2/4] xfs: tag reclaimable inodes with pending RCU grace periods as busy Brian Foster
2022-02-17 23:16   ` Dave Chinner
2022-02-18 14:19     ` Brian Foster
2022-02-17 17:25 ` [PATCH RFC 3/4] xfs: crude chunk allocation retry mechanism Brian Foster
2022-02-17 23:20   ` Dave Chinner
2022-02-18 14:21     ` Brian Foster
2022-02-18 22:54       ` Dave Chinner
2022-02-20 18:48         ` Brian Foster
2022-02-23  7:00           ` Dave Chinner
2022-02-28 21:45             ` Brian Foster
2022-02-28 22:55               ` Dave Chinner
2022-03-01 15:05                 ` Brian Foster
2022-02-17 17:25 ` [PATCH RFC 4/4] xfs: skip busy inodes on finobt inode allocation Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.