All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2 V2] xfs: xfs: non-blocking inodegc pushes
@ 2022-06-15 22:04 Dave Chinner
  2022-06-15 22:04 ` [PATCH 1/2] xfs: bound maximum wait time for inodegc work Dave Chinner
  2022-06-15 22:04 ` [PATCH 2/2] xfs: introduce xfs_inodegc_push() Dave Chinner
  0 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2022-06-15 22:04 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

These patches introduce non-blocking inodegc pushes to fix long
hold-offs in statfs() operations when inodegc is performing long
running inode inactivation operations.

The first patch introduces the bound maximum work start time for the
inodegc queues - it's short, only 1 jiffie (10ms) - but we don't
want to delay inodegc for an arbitrarily long period of time.
However, it means that work always starts quickly and so that
reduces the need for statfs() to have to wait for background inodegc
to start and complete to catch space "freed" by recent unlinks.

The second patch converts statfs to use a "push" rather than a
"flush". The push simply schedules any pending work that hasn't yet
timed out to run immediately and returns. It does not wait for the
inodegc work to complete - that's what a flush does, and that's what
caused all the problems for statfs(). Hence statfs() is converted to
push semantics at the same time, thereby removing the blocking
behaviour it currently has.

This should prevent a large amount of the issues that have been
seeing with lots of processes stuck in statfs() - that will no long
happen. The only time user processes should get stuck now is when
the inodegc throttle kicks in (unlinks only at this point) or if we
are waiting for a lock a long running inodegc operation holds to be
released. We had those specific problems before background inodegc -
they manifested as unkillable unlink operations that had every
backed up on them instead of background inodegc that has everything
backed up on them.

This patch has been running in my test environment for nearly a
month now without regressions occurring. While there are likely
still going to be inodegc flush related hold-offs in certain
circumstances, nothing appears to be impacting the correctness of
fstests tests or creating new issues. The 0-day kernel testing bot
also indicates that certain benchmarks (such as aim7 and
stress-ng.rename) run significantly faster with bound maximum delays
and non-blocking statfs operations.

Comments, thoughts and testing appreciated.

-Dave.

Version 2:
- Also convert quota reportting inodegc flushes to a push.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] xfs: bound maximum wait time for inodegc work
  2022-06-15 22:04 [PATCH 0/2 V2] xfs: xfs: non-blocking inodegc pushes Dave Chinner
@ 2022-06-15 22:04 ` Dave Chinner
  2022-06-17 16:34   ` Brian Foster
  2022-06-15 22:04 ` [PATCH 2/2] xfs: introduce xfs_inodegc_push() Dave Chinner
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2022-06-15 22:04 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Currently inodegc work can sit queued on the per-cpu queue until
the workqueue is either flushed of the queue reaches a depth that
triggers work queuing (and later throttling). This means that we
could queue work that waits for a long time for some other event to
trigger flushing.

Hence instead of just queueing work at a specific depth, use a
delayed work that queues the work at a bound time. We can still
schedule the work immediately at a given depth, but we no long need
to worry about leaving a number of items on the list that won't get
processed until external events prevail.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
 fs/xfs/xfs_mount.h  |  2 +-
 fs/xfs/xfs_super.c  |  2 +-
 3 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 374b3bafaeb0..46b30ecf498c 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -442,7 +442,7 @@ xfs_inodegc_queue_all(
 	for_each_online_cpu(cpu) {
 		gc = per_cpu_ptr(mp->m_inodegc, cpu);
 		if (!llist_empty(&gc->list))
-			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
+			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
 	}
 }
 
@@ -1844,8 +1844,8 @@ void
 xfs_inodegc_worker(
 	struct work_struct	*work)
 {
-	struct xfs_inodegc	*gc = container_of(work, struct xfs_inodegc,
-							work);
+	struct xfs_inodegc	*gc = container_of(to_delayed_work(work),
+						struct xfs_inodegc, work);
 	struct llist_node	*node = llist_del_all(&gc->list);
 	struct xfs_inode	*ip, *n;
 
@@ -2017,6 +2017,7 @@ xfs_inodegc_queue(
 	struct xfs_inodegc	*gc;
 	int			items;
 	unsigned int		shrinker_hits;
+	unsigned long		queue_delay = 1;
 
 	trace_xfs_inode_set_need_inactive(ip);
 	spin_lock(&ip->i_flags_lock);
@@ -2028,19 +2029,26 @@ xfs_inodegc_queue(
 	items = READ_ONCE(gc->items);
 	WRITE_ONCE(gc->items, items + 1);
 	shrinker_hits = READ_ONCE(gc->shrinker_hits);
-	put_cpu_ptr(gc);
 
-	if (!xfs_is_inodegc_enabled(mp))
+	/*
+	 * We queue the work while holding the current CPU so that the work
+	 * is scheduled to run on this CPU.
+	 */
+	if (!xfs_is_inodegc_enabled(mp)) {
+		put_cpu_ptr(gc);
 		return;
-
-	if (xfs_inodegc_want_queue_work(ip, items)) {
-		trace_xfs_inodegc_queue(mp, __return_address);
-		queue_work(mp->m_inodegc_wq, &gc->work);
 	}
 
+	if (xfs_inodegc_want_queue_work(ip, items))
+		queue_delay = 0;
+
+	trace_xfs_inodegc_queue(mp, __return_address);
+	mod_delayed_work(mp->m_inodegc_wq, &gc->work, queue_delay);
+	put_cpu_ptr(gc);
+
 	if (xfs_inodegc_want_flush_work(ip, items, shrinker_hits)) {
 		trace_xfs_inodegc_throttle(mp, __return_address);
-		flush_work(&gc->work);
+		flush_delayed_work(&gc->work);
 	}
 }
 
@@ -2057,7 +2065,7 @@ xfs_inodegc_cpu_dead(
 	unsigned int		count = 0;
 
 	dead_gc = per_cpu_ptr(mp->m_inodegc, dead_cpu);
-	cancel_work_sync(&dead_gc->work);
+	cancel_delayed_work_sync(&dead_gc->work);
 
 	if (llist_empty(&dead_gc->list))
 		return;
@@ -2076,12 +2084,12 @@ xfs_inodegc_cpu_dead(
 	llist_add_batch(first, last, &gc->list);
 	count += READ_ONCE(gc->items);
 	WRITE_ONCE(gc->items, count);
-	put_cpu_ptr(gc);
 
 	if (xfs_is_inodegc_enabled(mp)) {
 		trace_xfs_inodegc_queue(mp, __return_address);
-		queue_work(mp->m_inodegc_wq, &gc->work);
+		mod_delayed_work(mp->m_inodegc_wq, &gc->work, 0);
 	}
+	put_cpu_ptr(gc);
 }
 
 /*
@@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan(
 			unsigned int	h = READ_ONCE(gc->shrinker_hits);
 
 			WRITE_ONCE(gc->shrinker_hits, h + 1);
-			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
+			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
 			no_items = false;
 		}
 	}
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index ba5d42abf66e..d2eaebd85abf 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -61,7 +61,7 @@ struct xfs_error_cfg {
  */
 struct xfs_inodegc {
 	struct llist_head	list;
-	struct work_struct	work;
+	struct delayed_work	work;
 
 	/* approximate count of inodes in the list */
 	unsigned int		items;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 48a7239ed1b2..651ae75a7e23 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1075,7 +1075,7 @@ xfs_inodegc_init_percpu(
 		gc = per_cpu_ptr(mp->m_inodegc, cpu);
 		init_llist_head(&gc->list);
 		gc->items = 0;
-		INIT_WORK(&gc->work, xfs_inodegc_worker);
+		INIT_DELAYED_WORK(&gc->work, xfs_inodegc_worker);
 	}
 	return 0;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] xfs: introduce xfs_inodegc_push()
  2022-06-15 22:04 [PATCH 0/2 V2] xfs: xfs: non-blocking inodegc pushes Dave Chinner
  2022-06-15 22:04 ` [PATCH 1/2] xfs: bound maximum wait time for inodegc work Dave Chinner
@ 2022-06-15 22:04 ` Dave Chinner
  2022-06-22  5:21   ` Darrick J. Wong
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2022-06-15 22:04 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The current blocking mechanism for pushing the inodegc queue out to
disk can result in systems becoming unusable when there is a long
running inodegc operation. This is because the statfs()
implementation currently issues a blocking flush of the inodegc
queue and a significant number of common system utilities will call
statfs() to discover something about the underlying filesystem.

This can result in userspace operations getting stuck on inodegc
progress, and when trying to remove a heavily reflinked file on slow
storage with a full journal, this can result in delays measuring in
hours.

Avoid this problem by adding "push" function that expedites the
flushing of the inodegc queue, but doesn't wait for it to complete.

Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this
mechanism so they don't block but still ensure that queued
operations are expedited.

Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")
Reported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c      | 20 +++++++++++++++-----
 fs/xfs/xfs_icache.h      |  1 +
 fs/xfs/xfs_qm_syscalls.c |  7 +++++--
 fs/xfs/xfs_super.c       |  7 +++++--
 fs/xfs/xfs_trace.h       |  1 +
 5 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 46b30ecf498c..aef4097ffd3e 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1865,19 +1865,29 @@ xfs_inodegc_worker(
 }
 
 /*
- * Force all currently queued inode inactivation work to run immediately and
- * wait for the work to finish.
+ * Expedite all pending inodegc work to run immediately. This does not wait for
+ * completion of the work.
  */
 void
-xfs_inodegc_flush(
+xfs_inodegc_push(
 	struct xfs_mount	*mp)
 {
 	if (!xfs_is_inodegc_enabled(mp))
 		return;
+	trace_xfs_inodegc_push(mp, __return_address);
+	xfs_inodegc_queue_all(mp);
+}
 
+/*
+ * Force all currently queued inode inactivation work to run immediately and
+ * wait for the work to finish.
+ */
+void
+xfs_inodegc_flush(
+	struct xfs_mount	*mp)
+{
+	xfs_inodegc_push(mp);
 	trace_xfs_inodegc_flush(mp, __return_address);
-
-	xfs_inodegc_queue_all(mp);
 	flush_workqueue(mp->m_inodegc_wq);
 }
 
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 2e4cfddf8b8e..6cd180721659 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -76,6 +76,7 @@ void xfs_blockgc_stop(struct xfs_mount *mp);
 void xfs_blockgc_start(struct xfs_mount *mp);
 
 void xfs_inodegc_worker(struct work_struct *work);
+void xfs_inodegc_push(struct xfs_mount *mp);
 void xfs_inodegc_flush(struct xfs_mount *mp);
 void xfs_inodegc_stop(struct xfs_mount *mp);
 void xfs_inodegc_start(struct xfs_mount *mp);
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 74ac9ca9e119..a30f4d067746 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -454,9 +454,12 @@ xfs_qm_scall_getquota(
 	struct xfs_dquot	*dqp;
 	int			error;
 
-	/* Flush inodegc work at the start of a quota reporting scan. */
+	/*
+	 * Expedite pending inodegc work at the start of a quota reporting
+	 * scan but don't block waiting for it to complete.
+	 */
 	if (id == 0)
-		xfs_inodegc_flush(mp);
+		xfs_inodegc_push(mp);
 
 	/*
 	 * Try to get the dquot. We don't want it allocated on disk, so don't
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 651ae75a7e23..4edee1d3784a 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -798,8 +798,11 @@ xfs_fs_statfs(
 	xfs_extlen_t		lsize;
 	int64_t			ffree;
 
-	/* Wait for whatever inactivations are in progress. */
-	xfs_inodegc_flush(mp);
+	/*
+	 * Expedite background inodegc but don't wait. We do not want to block
+	 * here waiting hours for a billion extent file to be truncated.
+	 */
+	xfs_inodegc_push(mp);
 
 	statp->f_type = XFS_SUPER_MAGIC;
 	statp->f_namelen = MAXNAMELEN - 1;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d32026585c1b..0fa1b7a2918c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -240,6 +240,7 @@ DEFINE_EVENT(xfs_fs_class, name,					\
 	TP_PROTO(struct xfs_mount *mp, void *caller_ip), \
 	TP_ARGS(mp, caller_ip))
 DEFINE_FS_EVENT(xfs_inodegc_flush);
+DEFINE_FS_EVENT(xfs_inodegc_push);
 DEFINE_FS_EVENT(xfs_inodegc_start);
 DEFINE_FS_EVENT(xfs_inodegc_stop);
 DEFINE_FS_EVENT(xfs_inodegc_queue);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work
  2022-06-15 22:04 ` [PATCH 1/2] xfs: bound maximum wait time for inodegc work Dave Chinner
@ 2022-06-17 16:34   ` Brian Foster
  2022-06-17 21:52     ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Brian Foster @ 2022-06-17 16:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Currently inodegc work can sit queued on the per-cpu queue until
> the workqueue is either flushed of the queue reaches a depth that
> triggers work queuing (and later throttling). This means that we
> could queue work that waits for a long time for some other event to
> trigger flushing.
> 
> Hence instead of just queueing work at a specific depth, use a
> delayed work that queues the work at a bound time. We can still
> schedule the work immediately at a given depth, but we no long need
> to worry about leaving a number of items on the list that won't get
> processed until external events prevail.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
>  fs/xfs/xfs_mount.h  |  2 +-
>  fs/xfs/xfs_super.c  |  2 +-
>  3 files changed, 24 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 374b3bafaeb0..46b30ecf498c 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
...
> @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan(
>  			unsigned int	h = READ_ONCE(gc->shrinker_hits);
>  
>  			WRITE_ONCE(gc->shrinker_hits, h + 1);
> -			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> +			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
>  			no_items = false;
>  		}

This all seems reasonable to me, but is there much practical benefit to
shrinker infra/feedback just to expedite a delayed work item by one
jiffy? Maybe there's a use case to continue to trigger throttling..? If
so, it looks like decent enough overhead to cycle through every cpu in
both callbacks that it might be worth spelling out more clearly in the
top-level comment.

Brian

>  	}
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index ba5d42abf66e..d2eaebd85abf 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -61,7 +61,7 @@ struct xfs_error_cfg {
>   */
>  struct xfs_inodegc {
>  	struct llist_head	list;
> -	struct work_struct	work;
> +	struct delayed_work	work;
>  
>  	/* approximate count of inodes in the list */
>  	unsigned int		items;
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 48a7239ed1b2..651ae75a7e23 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1075,7 +1075,7 @@ xfs_inodegc_init_percpu(
>  		gc = per_cpu_ptr(mp->m_inodegc, cpu);
>  		init_llist_head(&gc->list);
>  		gc->items = 0;
> -		INIT_WORK(&gc->work, xfs_inodegc_worker);
> +		INIT_DELAYED_WORK(&gc->work, xfs_inodegc_worker);
>  	}
>  	return 0;
>  }
> -- 
> 2.35.1
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work
  2022-06-17 16:34   ` Brian Foster
@ 2022-06-17 21:52     ` Dave Chinner
  2022-06-22  5:20       ` Darrick J. Wong
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2022-06-17 21:52 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote:
> On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Currently inodegc work can sit queued on the per-cpu queue until
> > the workqueue is either flushed of the queue reaches a depth that
> > triggers work queuing (and later throttling). This means that we
> > could queue work that waits for a long time for some other event to
> > trigger flushing.
> > 
> > Hence instead of just queueing work at a specific depth, use a
> > delayed work that queues the work at a bound time. We can still
> > schedule the work immediately at a given depth, but we no long need
> > to worry about leaving a number of items on the list that won't get
> > processed until external events prevail.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
> >  fs/xfs/xfs_mount.h  |  2 +-
> >  fs/xfs/xfs_super.c  |  2 +-
> >  3 files changed, 24 insertions(+), 16 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 374b3bafaeb0..46b30ecf498c 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> ...
> > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan(
> >  			unsigned int	h = READ_ONCE(gc->shrinker_hits);
> >  
> >  			WRITE_ONCE(gc->shrinker_hits, h + 1);
> > -			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> > +			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> >  			no_items = false;
> >  		}
> 
> This all seems reasonable to me, but is there much practical benefit to
> shrinker infra/feedback just to expedite a delayed work item by one
> jiffy? Maybe there's a use case to continue to trigger throttling..?

I haven't really considered doing anything other than fixing the
reported bug. That just requires an API conversion for the existing
"queue immediately" semantics and is the safest minimum change
to fix the issue at hand.

So, yes, the shrinker code may (or may not) be superfluous now, but
I haven't looked at it and done analysis of the behaviour without
the shrinkers enabled. I'll do that in a completely separate
patchset if it turns out that it is not needed now.

> If
> so, it looks like decent enough overhead to cycle through every cpu in
> both callbacks that it might be worth spelling out more clearly in the
> top-level comment.

I'm not sure what you are asking here - mod_delayed_work_on() has
pretty much the same overhead and behaviour as queue_work() in this
case, so... ?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work
  2022-06-17 21:52     ` Dave Chinner
@ 2022-06-22  5:20       ` Darrick J. Wong
  2022-06-22 16:13         ` Brian Foster
  0 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2022-06-22  5:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote:
> On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote:
> > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Currently inodegc work can sit queued on the per-cpu queue until
> > > the workqueue is either flushed of the queue reaches a depth that
> > > triggers work queuing (and later throttling). This means that we
> > > could queue work that waits for a long time for some other event to
> > > trigger flushing.
> > > 
> > > Hence instead of just queueing work at a specific depth, use a
> > > delayed work that queues the work at a bound time. We can still
> > > schedule the work immediately at a given depth, but we no long need
> > > to worry about leaving a number of items on the list that won't get
> > > processed until external events prevail.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
> > >  fs/xfs/xfs_mount.h  |  2 +-
> > >  fs/xfs/xfs_super.c  |  2 +-
> > >  3 files changed, 24 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 374b3bafaeb0..46b30ecf498c 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > ...
> > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan(
> > >  			unsigned int	h = READ_ONCE(gc->shrinker_hits);
> > >  
> > >  			WRITE_ONCE(gc->shrinker_hits, h + 1);
> > > -			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> > > +			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> > >  			no_items = false;
> > >  		}
> > 
> > This all seems reasonable to me, but is there much practical benefit to
> > shrinker infra/feedback just to expedite a delayed work item by one
> > jiffy? Maybe there's a use case to continue to trigger throttling..?
> 
> I haven't really considered doing anything other than fixing the
> reported bug. That just requires an API conversion for the existing
> "queue immediately" semantics and is the safest minimum change
> to fix the issue at hand.
> 
> So, yes, the shrinker code may (or may not) be superfluous now, but
> I haven't looked at it and done analysis of the behaviour without
> the shrinkers enabled. I'll do that in a completely separate
> patchset if it turns out that it is not needed now.

I think the shrinker part is still necessary -- bulkstat and xfs_scrub
on a very low memory machine (~560M RAM) opening and closing tens of
millions of files can still OOM the machine if one doesn't have a means
to slow down ->destroy_inode (and hence the next open()) when reclaim
really starts to dig in.  Without the shrinker bits, it's even easier to
trigger OOM storms when xfs has timer-delayed inactivation... which is
something that Brian pointed out a year ago when we were reviewing the
initial inodegc patchset.

> > If
> > so, it looks like decent enough overhead to cycle through every cpu in
> > both callbacks that it might be worth spelling out more clearly in the
> > top-level comment.
> 
> I'm not sure what you are asking here - mod_delayed_work_on() has
> pretty much the same overhead and behaviour as queue_work() in this
> case, so... ?

<shrug> Looks ok to me, since djwong-dev has had some variant of timer
delayed inactivation in it longer than it hasn't:

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: introduce xfs_inodegc_push()
  2022-06-15 22:04 ` [PATCH 2/2] xfs: introduce xfs_inodegc_push() Dave Chinner
@ 2022-06-22  5:21   ` Darrick J. Wong
  0 siblings, 0 replies; 16+ messages in thread
From: Darrick J. Wong @ 2022-06-22  5:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 16, 2022 at 08:04:16AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The current blocking mechanism for pushing the inodegc queue out to
> disk can result in systems becoming unusable when there is a long
> running inodegc operation. This is because the statfs()
> implementation currently issues a blocking flush of the inodegc
> queue and a significant number of common system utilities will call
> statfs() to discover something about the underlying filesystem.
> 
> This can result in userspace operations getting stuck on inodegc
> progress, and when trying to remove a heavily reflinked file on slow
> storage with a full journal, this can result in delays measuring in
> hours.
> 
> Avoid this problem by adding "push" function that expedites the
> flushing of the inodegc queue, but doesn't wait for it to complete.
> 
> Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this
> mechanism so they don't block but still ensure that queued
> operations are expedited.
> 
> Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")
> Reported-by: Chris Dunlop <chris@onthe.net.au>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_icache.c      | 20 +++++++++++++++-----
>  fs/xfs/xfs_icache.h      |  1 +
>  fs/xfs/xfs_qm_syscalls.c |  7 +++++--
>  fs/xfs/xfs_super.c       |  7 +++++--
>  fs/xfs/xfs_trace.h       |  1 +
>  5 files changed, 27 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 46b30ecf498c..aef4097ffd3e 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1865,19 +1865,29 @@ xfs_inodegc_worker(
>  }
>  
>  /*
> - * Force all currently queued inode inactivation work to run immediately and
> - * wait for the work to finish.
> + * Expedite all pending inodegc work to run immediately. This does not wait for
> + * completion of the work.
>   */
>  void
> -xfs_inodegc_flush(
> +xfs_inodegc_push(
>  	struct xfs_mount	*mp)
>  {
>  	if (!xfs_is_inodegc_enabled(mp))
>  		return;
> +	trace_xfs_inodegc_push(mp, __return_address);
> +	xfs_inodegc_queue_all(mp);
> +}
>  
> +/*
> + * Force all currently queued inode inactivation work to run immediately and
> + * wait for the work to finish.
> + */
> +void
> +xfs_inodegc_flush(
> +	struct xfs_mount	*mp)
> +{
> +	xfs_inodegc_push(mp);
>  	trace_xfs_inodegc_flush(mp, __return_address);
> -
> -	xfs_inodegc_queue_all(mp);
>  	flush_workqueue(mp->m_inodegc_wq);
>  }
>  
> diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
> index 2e4cfddf8b8e..6cd180721659 100644
> --- a/fs/xfs/xfs_icache.h
> +++ b/fs/xfs/xfs_icache.h
> @@ -76,6 +76,7 @@ void xfs_blockgc_stop(struct xfs_mount *mp);
>  void xfs_blockgc_start(struct xfs_mount *mp);
>  
>  void xfs_inodegc_worker(struct work_struct *work);
> +void xfs_inodegc_push(struct xfs_mount *mp);
>  void xfs_inodegc_flush(struct xfs_mount *mp);
>  void xfs_inodegc_stop(struct xfs_mount *mp);
>  void xfs_inodegc_start(struct xfs_mount *mp);
> diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> index 74ac9ca9e119..a30f4d067746 100644
> --- a/fs/xfs/xfs_qm_syscalls.c
> +++ b/fs/xfs/xfs_qm_syscalls.c
> @@ -454,9 +454,12 @@ xfs_qm_scall_getquota(
>  	struct xfs_dquot	*dqp;
>  	int			error;
>  
> -	/* Flush inodegc work at the start of a quota reporting scan. */
> +	/*
> +	 * Expedite pending inodegc work at the start of a quota reporting
> +	 * scan but don't block waiting for it to complete.
> +	 */
>  	if (id == 0)
> -		xfs_inodegc_flush(mp);
> +		xfs_inodegc_push(mp);
>  
>  	/*
>  	 * Try to get the dquot. We don't want it allocated on disk, so don't
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 651ae75a7e23..4edee1d3784a 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -798,8 +798,11 @@ xfs_fs_statfs(
>  	xfs_extlen_t		lsize;
>  	int64_t			ffree;
>  
> -	/* Wait for whatever inactivations are in progress. */
> -	xfs_inodegc_flush(mp);
> +	/*
> +	 * Expedite background inodegc but don't wait. We do not want to block
> +	 * here waiting hours for a billion extent file to be truncated.
> +	 */
> +	xfs_inodegc_push(mp);
>  
>  	statp->f_type = XFS_SUPER_MAGIC;
>  	statp->f_namelen = MAXNAMELEN - 1;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index d32026585c1b..0fa1b7a2918c 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -240,6 +240,7 @@ DEFINE_EVENT(xfs_fs_class, name,					\
>  	TP_PROTO(struct xfs_mount *mp, void *caller_ip), \
>  	TP_ARGS(mp, caller_ip))
>  DEFINE_FS_EVENT(xfs_inodegc_flush);
> +DEFINE_FS_EVENT(xfs_inodegc_push);
>  DEFINE_FS_EVENT(xfs_inodegc_start);
>  DEFINE_FS_EVENT(xfs_inodegc_stop);
>  DEFINE_FS_EVENT(xfs_inodegc_queue);
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work
  2022-06-22  5:20       ` Darrick J. Wong
@ 2022-06-22 16:13         ` Brian Foster
  2022-06-23  0:25           ` Darrick J. Wong
  0 siblings, 1 reply; 16+ messages in thread
From: Brian Foster @ 2022-06-22 16:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Tue, Jun 21, 2022 at 10:20:46PM -0700, Darrick J. Wong wrote:
> On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote:
> > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote:
> > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Currently inodegc work can sit queued on the per-cpu queue until
> > > > the workqueue is either flushed of the queue reaches a depth that
> > > > triggers work queuing (and later throttling). This means that we
> > > > could queue work that waits for a long time for some other event to
> > > > trigger flushing.
> > > > 
> > > > Hence instead of just queueing work at a specific depth, use a
> > > > delayed work that queues the work at a bound time. We can still
> > > > schedule the work immediately at a given depth, but we no long need
> > > > to worry about leaving a number of items on the list that won't get
> > > > processed until external events prevail.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
> > > >  fs/xfs/xfs_mount.h  |  2 +-
> > > >  fs/xfs/xfs_super.c  |  2 +-
> > > >  3 files changed, 24 insertions(+), 16 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > index 374b3bafaeb0..46b30ecf498c 100644
> > > > --- a/fs/xfs/xfs_icache.c
> > > > +++ b/fs/xfs/xfs_icache.c
> > > ...
> > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan(
> > > >  			unsigned int	h = READ_ONCE(gc->shrinker_hits);
> > > >  
> > > >  			WRITE_ONCE(gc->shrinker_hits, h + 1);
> > > > -			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> > > > +			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> > > >  			no_items = false;
> > > >  		}
> > > 
> > > This all seems reasonable to me, but is there much practical benefit to
> > > shrinker infra/feedback just to expedite a delayed work item by one
> > > jiffy? Maybe there's a use case to continue to trigger throttling..?
> > 
> > I haven't really considered doing anything other than fixing the
> > reported bug. That just requires an API conversion for the existing
> > "queue immediately" semantics and is the safest minimum change
> > to fix the issue at hand.
> > 
> > So, yes, the shrinker code may (or may not) be superfluous now, but
> > I haven't looked at it and done analysis of the behaviour without
> > the shrinkers enabled. I'll do that in a completely separate
> > patchset if it turns out that it is not needed now.
> 
> I think the shrinker part is still necessary -- bulkstat and xfs_scrub
> on a very low memory machine (~560M RAM) opening and closing tens of
> millions of files can still OOM the machine if one doesn't have a means
> to slow down ->destroy_inode (and hence the next open()) when reclaim
> really starts to dig in.  Without the shrinker bits, it's even easier to
> trigger OOM storms when xfs has timer-delayed inactivation... which is
> something that Brian pointed out a year ago when we were reviewing the
> initial inodegc patchset.
> 

It wouldn't surprise me if the infrastructure is still necessary for the
throttling use case. In that case, I'm more curious about things like
whether it's still as effective as intended with such a small scheduling
delay, or whether it still might be worth simplifying in various ways
(i.e., does the scheduling delay actually make a difference? do we still
need a per cpu granular throttle? etc.).

> > > If
> > > so, it looks like decent enough overhead to cycle through every cpu in
> > > both callbacks that it might be worth spelling out more clearly in the
> > > top-level comment.
> > 
> > I'm not sure what you are asking here - mod_delayed_work_on() has
> > pretty much the same overhead and behaviour as queue_work() in this
> > case, so... ?
> 

I'm just pointing out that the comment around the shrinker
infrastructure isn't very informative if the shrinker turns out to still
be necessary for reasons other than making the workers run sooner.

> <shrug> Looks ok to me, since djwong-dev has had some variant of timer
> delayed inactivation in it longer than it hasn't:
> 

Was that with a correspondingly small delay or something larger (on the
order of seconds or so)? Either way, it sounds like you have a
predictable enough workload that can actually test this continues to
work as expected..?

Brian

> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work
  2022-06-22 16:13         ` Brian Foster
@ 2022-06-23  0:25           ` Darrick J. Wong
  2022-06-23 11:49             ` Brian Foster
  0 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2022-06-23  0:25 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Wed, Jun 22, 2022 at 12:13:54PM -0400, Brian Foster wrote:
> On Tue, Jun 21, 2022 at 10:20:46PM -0700, Darrick J. Wong wrote:
> > On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote:
> > > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote:
> > > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > Currently inodegc work can sit queued on the per-cpu queue until
> > > > > the workqueue is either flushed of the queue reaches a depth that
> > > > > triggers work queuing (and later throttling). This means that we
> > > > > could queue work that waits for a long time for some other event to
> > > > > trigger flushing.
> > > > > 
> > > > > Hence instead of just queueing work at a specific depth, use a
> > > > > delayed work that queues the work at a bound time. We can still
> > > > > schedule the work immediately at a given depth, but we no long need
> > > > > to worry about leaving a number of items on the list that won't get
> > > > > processed until external events prevail.
> > > > > 
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > ---
> > > > >  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
> > > > >  fs/xfs/xfs_mount.h  |  2 +-
> > > > >  fs/xfs/xfs_super.c  |  2 +-
> > > > >  3 files changed, 24 insertions(+), 16 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > index 374b3bafaeb0..46b30ecf498c 100644
> > > > > --- a/fs/xfs/xfs_icache.c
> > > > > +++ b/fs/xfs/xfs_icache.c
> > > > ...
> > > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan(
> > > > >  			unsigned int	h = READ_ONCE(gc->shrinker_hits);
> > > > >  
> > > > >  			WRITE_ONCE(gc->shrinker_hits, h + 1);
> > > > > -			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> > > > > +			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> > > > >  			no_items = false;
> > > > >  		}
> > > > 
> > > > This all seems reasonable to me, but is there much practical benefit to
> > > > shrinker infra/feedback just to expedite a delayed work item by one
> > > > jiffy? Maybe there's a use case to continue to trigger throttling..?
> > > 
> > > I haven't really considered doing anything other than fixing the
> > > reported bug. That just requires an API conversion for the existing
> > > "queue immediately" semantics and is the safest minimum change
> > > to fix the issue at hand.
> > > 
> > > So, yes, the shrinker code may (or may not) be superfluous now, but
> > > I haven't looked at it and done analysis of the behaviour without
> > > the shrinkers enabled. I'll do that in a completely separate
> > > patchset if it turns out that it is not needed now.
> > 
> > I think the shrinker part is still necessary -- bulkstat and xfs_scrub
> > on a very low memory machine (~560M RAM) opening and closing tens of
> > millions of files can still OOM the machine if one doesn't have a means
> > to slow down ->destroy_inode (and hence the next open()) when reclaim
> > really starts to dig in.  Without the shrinker bits, it's even easier to
> > trigger OOM storms when xfs has timer-delayed inactivation... which is
> > something that Brian pointed out a year ago when we were reviewing the
> > initial inodegc patchset.
> > 
> 
> It wouldn't surprise me if the infrastructure is still necessary for the
> throttling use case. In that case, I'm more curious about things like
> whether it's still as effective as intended with such a small scheduling
> delay, or whether it still might be worth simplifying in various ways
> (i.e., does the scheduling delay actually make a difference? do we still
> need a per cpu granular throttle? etc.).

It can still be useful for certain g*dawful scenarios --

Let's say you have a horribly misconfigured cloudy system with a tiny
log, hundreds of CPUs, a memory hogging process, another process with
many hundreds of threads that are performing small appending synchronous
writes to a large number of files, and some other process repeatedly
opens and closes files.  Background writeback completion will create
enough workers to tie up the log such that writeback and inodegc contend
for log grant space and make slow progress.  If memory is also tight,
we want to slow down the file scanning process so that it doesn't shove
/more/ inodes into the cache and push the system towards OOM behavior.

Back in the old days when inodegc was a radix tree tag it was fairly
easy to get OOMs when the delay interval was long (5 seconds).  The
OOM probability went down pretty sharply as the interval approached
zero, but even at 1 jiffy I could still occasionally trip it, whereas
the pre-deferred-inactivation kernels would never OOM.

I haven't tested it all that rigorously with Dave's fancy new per-cpu
list design, but I did throw on my silly test setup (see below) and
still got it to OOM once in 3 runs with the shrinker bits turned off.

> > > > If
> > > > so, it looks like decent enough overhead to cycle through every cpu in
> > > > both callbacks that it might be worth spelling out more clearly in the
> > > > top-level comment.
> > > 
> > > I'm not sure what you are asking here - mod_delayed_work_on() has
> > > pretty much the same overhead and behaviour as queue_work() in this
> > > case, so... ?
> > 
> 
> I'm just pointing out that the comment around the shrinker
> infrastructure isn't very informative if the shrinker turns out to still
> be necessary for reasons other than making the workers run sooner.

<nod> That comment /does/ need to be updated to note the subtlety that a
lot of shrinker activity can slow down close()ing a file by making user
tasks wait for the inodegc workers to clear the backlog.

> > <shrug> Looks ok to me, since djwong-dev has had some variant of timer
> > delayed inactivation in it longer than it hasn't:
> > 
> 
> Was that with a correspondingly small delay or something larger (on the
> order of seconds or so)? Either way, it sounds like you have a
> predictable enough workload that can actually test this continues to
> work as expected..?

Yeah.  I snapshot /home (which has ~20 million inodes) then race
fsstress and xfs_scrub -n in a VM with 560MB of RAM.

--D

> Brian
> 
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > 
> > --D
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work
  2022-06-23  0:25           ` Darrick J. Wong
@ 2022-06-23 11:49             ` Brian Foster
  2022-06-23 19:56               ` Darrick J. Wong
  0 siblings, 1 reply; 16+ messages in thread
From: Brian Foster @ 2022-06-23 11:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Wed, Jun 22, 2022 at 05:25:36PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 22, 2022 at 12:13:54PM -0400, Brian Foster wrote:
> > On Tue, Jun 21, 2022 at 10:20:46PM -0700, Darrick J. Wong wrote:
> > > On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote:
> > > > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote:
> > > > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote:
> > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > 
> > > > > > Currently inodegc work can sit queued on the per-cpu queue until
> > > > > > the workqueue is either flushed of the queue reaches a depth that
> > > > > > triggers work queuing (and later throttling). This means that we
> > > > > > could queue work that waits for a long time for some other event to
> > > > > > trigger flushing.
> > > > > > 
> > > > > > Hence instead of just queueing work at a specific depth, use a
> > > > > > delayed work that queues the work at a bound time. We can still
> > > > > > schedule the work immediately at a given depth, but we no long need
> > > > > > to worry about leaving a number of items on the list that won't get
> > > > > > processed until external events prevail.
> > > > > > 
> > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > > ---
> > > > > >  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
> > > > > >  fs/xfs/xfs_mount.h  |  2 +-
> > > > > >  fs/xfs/xfs_super.c  |  2 +-
> > > > > >  3 files changed, 24 insertions(+), 16 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > > index 374b3bafaeb0..46b30ecf498c 100644
> > > > > > --- a/fs/xfs/xfs_icache.c
> > > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > ...
> > > > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan(
> > > > > >  			unsigned int	h = READ_ONCE(gc->shrinker_hits);
> > > > > >  
> > > > > >  			WRITE_ONCE(gc->shrinker_hits, h + 1);
> > > > > > -			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> > > > > > +			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> > > > > >  			no_items = false;
> > > > > >  		}
> > > > > 
> > > > > This all seems reasonable to me, but is there much practical benefit to
> > > > > shrinker infra/feedback just to expedite a delayed work item by one
> > > > > jiffy? Maybe there's a use case to continue to trigger throttling..?
> > > > 
> > > > I haven't really considered doing anything other than fixing the
> > > > reported bug. That just requires an API conversion for the existing
> > > > "queue immediately" semantics and is the safest minimum change
> > > > to fix the issue at hand.
> > > > 
> > > > So, yes, the shrinker code may (or may not) be superfluous now, but
> > > > I haven't looked at it and done analysis of the behaviour without
> > > > the shrinkers enabled. I'll do that in a completely separate
> > > > patchset if it turns out that it is not needed now.
> > > 
> > > I think the shrinker part is still necessary -- bulkstat and xfs_scrub
> > > on a very low memory machine (~560M RAM) opening and closing tens of
> > > millions of files can still OOM the machine if one doesn't have a means
> > > to slow down ->destroy_inode (and hence the next open()) when reclaim
> > > really starts to dig in.  Without the shrinker bits, it's even easier to
> > > trigger OOM storms when xfs has timer-delayed inactivation... which is
> > > something that Brian pointed out a year ago when we were reviewing the
> > > initial inodegc patchset.
> > > 
> > 
> > It wouldn't surprise me if the infrastructure is still necessary for the
> > throttling use case. In that case, I'm more curious about things like
> > whether it's still as effective as intended with such a small scheduling
> > delay, or whether it still might be worth simplifying in various ways
> > (i.e., does the scheduling delay actually make a difference? do we still
> > need a per cpu granular throttle? etc.).
> 
> It can still be useful for certain g*dawful scenarios --
> 
> Let's say you have a horribly misconfigured cloudy system with a tiny
> log, hundreds of CPUs, a memory hogging process, another process with
> many hundreds of threads that are performing small appending synchronous
> writes to a large number of files, and some other process repeatedly
> opens and closes files.  Background writeback completion will create
> enough workers to tie up the log such that writeback and inodegc contend
> for log grant space and make slow progress.  If memory is also tight,
> we want to slow down the file scanning process so that it doesn't shove
> /more/ inodes into the cache and push the system towards OOM behavior.
> 

Yeah, I get the general idea/purpose of the throttling. What I'm probing
at here is whether a case like this is still handled effectively with
such a short scheduling delay. Presumably there is some window before
list size based throttling triggers for which the shrinker is expected
to cover, so that implies the shrinker historically is able to detect
and push populated queues and trigger throttling from the point it is
invoked (whether directly via repeated shrinker invocations or
indirectly via causing larger queue sizes is not clear to me).

The thing that stands out to me as a question wrt to this change is that
the trigger for shrinker induced throttling is the list size at the time
of the callback(s), and that goes from having a lifecycle associated
with the size-oriented scheduling algorithm to a time-based scheduling
lifecycle of one jiffy (also noting that the inodegc worker resets
shrinker_hits before it begins to process inodes). So with that in mind,
how reliable is this lowmem signal based on the list size back to the
tasks creating more work and memory pressure? Once a shrinker invocation
occurs, what are the odds that the callback is able to detect a
populated list and act accordingly?

These questions are somewhat rhetorical because this all seems rather
unpredictable when we consider varying resource availability. The
relevant question for this patch is probably just that somebody has
tested and confirmed that the shrinker hasn't been subtly or indirectly
broken in cases like the one you describe above (where perhaps we might
not have many shrinker callback opportunities to act on before OOM).

> Back in the old days when inodegc was a radix tree tag it was fairly
> easy to get OOMs when the delay interval was long (5 seconds).  The
> OOM probability went down pretty sharply as the interval approached
> zero, but even at 1 jiffy I could still occasionally trip it, whereas
> the pre-deferred-inactivation kernels would never OOM.
> 
> I haven't tested it all that rigorously with Dave's fancy new per-cpu
> list design, but I did throw on my silly test setup (see below) and
> still got it to OOM once in 3 runs with the shrinker bits turned off.
> 

Ok.. so that implies we still need throttling, but I'm not sure what
"fancy percpu list design" refers to. If you have a good test case, I
think the interesting immediate question is: are those OOMs avoided with
this patch but the shrinker infrastructure still in place?

If not, then I wonder if something is going wonky there. If so, I'm
still a bit curious what the behavior looks like and whether it can be
simplified in light of this change, but that's certainly beyond the
scope of this patch.

> > > > > If
> > > > > so, it looks like decent enough overhead to cycle through every cpu in
> > > > > both callbacks that it might be worth spelling out more clearly in the
> > > > > top-level comment.
> > > > 
> > > > I'm not sure what you are asking here - mod_delayed_work_on() has
> > > > pretty much the same overhead and behaviour as queue_work() in this
> > > > case, so... ?
> > > 
> > 
> > I'm just pointing out that the comment around the shrinker
> > infrastructure isn't very informative if the shrinker turns out to still
> > be necessary for reasons other than making the workers run sooner.
> 
> <nod> That comment /does/ need to be updated to note the subtlety that a
> lot of shrinker activity can slow down close()ing a file by making user
> tasks wait for the inodegc workers to clear the backlog.
> 
> > > <shrug> Looks ok to me, since djwong-dev has had some variant of timer
> > > delayed inactivation in it longer than it hasn't:
> > > 
> > 
> > Was that with a correspondingly small delay or something larger (on the
> > order of seconds or so)? Either way, it sounds like you have a
> > predictable enough workload that can actually test this continues to
> > work as expected..?
> 
> Yeah.  I snapshot /home (which has ~20 million inodes) then race
> fsstress and xfs_scrub -n in a VM with 560MB of RAM.
> 

Yeah small delay or yeah large delay?

Brian

> --D
> 
> > Brian
> > 
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > --D
> > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work
  2022-06-23 11:49             ` Brian Foster
@ 2022-06-23 19:56               ` Darrick J. Wong
  2022-06-24 12:39                 ` Brian Foster
  0 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2022-06-23 19:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Thu, Jun 23, 2022 at 07:49:46AM -0400, Brian Foster wrote:
> On Wed, Jun 22, 2022 at 05:25:36PM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 22, 2022 at 12:13:54PM -0400, Brian Foster wrote:
> > > On Tue, Jun 21, 2022 at 10:20:46PM -0700, Darrick J. Wong wrote:
> > > > On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote:
> > > > > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote:
> > > > > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote:
> > > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > > 
> > > > > > > Currently inodegc work can sit queued on the per-cpu queue until
> > > > > > > the workqueue is either flushed of the queue reaches a depth that
> > > > > > > triggers work queuing (and later throttling). This means that we
> > > > > > > could queue work that waits for a long time for some other event to
> > > > > > > trigger flushing.
> > > > > > > 
> > > > > > > Hence instead of just queueing work at a specific depth, use a
> > > > > > > delayed work that queues the work at a bound time. We can still
> > > > > > > schedule the work immediately at a given depth, but we no long need
> > > > > > > to worry about leaving a number of items on the list that won't get
> > > > > > > processed until external events prevail.
> > > > > > > 
> > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > > > ---
> > > > > > >  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
> > > > > > >  fs/xfs/xfs_mount.h  |  2 +-
> > > > > > >  fs/xfs/xfs_super.c  |  2 +-
> > > > > > >  3 files changed, 24 insertions(+), 16 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > > > index 374b3bafaeb0..46b30ecf498c 100644
> > > > > > > --- a/fs/xfs/xfs_icache.c
> > > > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > > ...
> > > > > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan(
> > > > > > >  			unsigned int	h = READ_ONCE(gc->shrinker_hits);
> > > > > > >  
> > > > > > >  			WRITE_ONCE(gc->shrinker_hits, h + 1);
> > > > > > > -			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> > > > > > > +			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> > > > > > >  			no_items = false;
> > > > > > >  		}
> > > > > > 
> > > > > > This all seems reasonable to me, but is there much practical benefit to
> > > > > > shrinker infra/feedback just to expedite a delayed work item by one
> > > > > > jiffy? Maybe there's a use case to continue to trigger throttling..?
> > > > > 
> > > > > I haven't really considered doing anything other than fixing the
> > > > > reported bug. That just requires an API conversion for the existing
> > > > > "queue immediately" semantics and is the safest minimum change
> > > > > to fix the issue at hand.
> > > > > 
> > > > > So, yes, the shrinker code may (or may not) be superfluous now, but
> > > > > I haven't looked at it and done analysis of the behaviour without
> > > > > the shrinkers enabled. I'll do that in a completely separate
> > > > > patchset if it turns out that it is not needed now.
> > > > 
> > > > I think the shrinker part is still necessary -- bulkstat and xfs_scrub
> > > > on a very low memory machine (~560M RAM) opening and closing tens of
> > > > millions of files can still OOM the machine if one doesn't have a means
> > > > to slow down ->destroy_inode (and hence the next open()) when reclaim
> > > > really starts to dig in.  Without the shrinker bits, it's even easier to
> > > > trigger OOM storms when xfs has timer-delayed inactivation... which is
> > > > something that Brian pointed out a year ago when we were reviewing the
> > > > initial inodegc patchset.
> > > > 
> > > 
> > > It wouldn't surprise me if the infrastructure is still necessary for the
> > > throttling use case. In that case, I'm more curious about things like
> > > whether it's still as effective as intended with such a small scheduling
> > > delay, or whether it still might be worth simplifying in various ways
> > > (i.e., does the scheduling delay actually make a difference? do we still
> > > need a per cpu granular throttle? etc.).
> > 
> > It can still be useful for certain g*dawful scenarios --
> > 
> > Let's say you have a horribly misconfigured cloudy system with a tiny
> > log, hundreds of CPUs, a memory hogging process, another process with
> > many hundreds of threads that are performing small appending synchronous
> > writes to a large number of files, and some other process repeatedly
> > opens and closes files.  Background writeback completion will create
> > enough workers to tie up the log such that writeback and inodegc contend
> > for log grant space and make slow progress.  If memory is also tight,
> > we want to slow down the file scanning process so that it doesn't shove
> > /more/ inodes into the cache and push the system towards OOM behavior.
> > 
> 
> Yeah, I get the general idea/purpose of the throttling. What I'm probing
> at here is whether a case like this is still handled effectively with
> such a short scheduling delay.

Given the trace_xfs_inodegc_shrinker_* data I've collected on my
simulator, I think the 1 jiffy delay is handling this well enough that
we only trip that tracepoint about a half dozen times in 2h of trying to
exercising the inode caches.  That said, I haven't really had time to
try this on (say) a 20CPU machine with ridiculously small memory to see
if I get different results.

(Note: I'm still trying to figure out why generic/522 reports
corruption, and hoping that willy's folios pull today just fixes it
magically...)

> Presumably there is some window before
> list size based throttling triggers for which the shrinker is expected
> to cover, so that implies the shrinker historically is able to detect
> and push populated queues and trigger throttling from the point it is
> invoked (whether directly via repeated shrinker invocations or
> indirectly via causing larger queue sizes is not clear to me).

Back when I merged the inodegc series, it was fairly easy to get the
shrinker to trip dozens of times during a stress run, even when the only
throttling criteria was the queue size.  I think the reflink/rmap
transaction reservation optimizations have made the log grant
bottlenecks much less severe, since I now see a lot less contention on
the grant heads.

> The thing that stands out to me as a question wrt to this change is that
> the trigger for shrinker induced throttling is the list size at the time
> of the callback(s), and that goes from having a lifecycle associated
> with the size-oriented scheduling algorithm to a time-based scheduling
> lifecycle of one jiffy (also noting that the inodegc worker resets
> shrinker_hits before it begins to process inodes). So with that in mind,
> how reliable is this lowmem signal based on the list size back to the
> tasks creating more work and memory pressure? Once a shrinker invocation
> occurs, what are the odds that the callback is able to detect a
> populated list and act accordingly?

Hrmmm.  Decent, I think?  If the list is populated but the inactivation
worker is not running, then we queue the worker and clear the list; if
the worker is already running when the shrinker gets called, we might
end up requeuing it unnecessarily, in which case it'll just clear more.

> These questions are somewhat rhetorical because this all seems rather
> unpredictable when we consider varying resource availability.

Not to mention the shrinker itself undergoing various behavioral changes
over the years... :(

> The
> relevant question for this patch is probably just that somebody has
> tested and confirmed that the shrinker hasn't been subtly or indirectly
> broken in cases like the one you describe above (where perhaps we might
> not have many shrinker callback opportunities to act on before OOM).

FWIW, the only time I /ever/ saw OOMs (either now with 5.19 or ages ago
with 5.9) was with extreme lowmem testing.  On VMs with more than a
gigabyte or so of memory, I notice that we usually hit the time/length
thresholds without shrinkers getting involved.  Granted, I did try to
pick the shrinker values so that we only get called on the *second*
level of shrinking, which is after we've freed some but not enough
memory.

If my responses seem a bit like handwaving, they are, because (at the
moment) this is well off in the weeds.

> > Back in the old days when inodegc was a radix tree tag it was fairly
> > easy to get OOMs when the delay interval was long (5 seconds).  The
> > OOM probability went down pretty sharply as the interval approached
> > zero, but even at 1 jiffy I could still occasionally trip it, whereas
> > the pre-deferred-inactivation kernels would never OOM.
> > 
> > I haven't tested it all that rigorously with Dave's fancy new per-cpu
> > list design, but I did throw on my silly test setup (see below) and
> > still got it to OOM once in 3 runs with the shrinker bits turned off.
> > 
> 
> Ok.. so that implies we still need throttling, but I'm not sure what
> "fancy percpu list design" refers to. If you have a good test case, I

You might recall that the original versions of deferred inode
inactivation would set radix tree tags, and a per-AG workqueue function
would scan that AG's radix tree for tagged inodes and inactivate them.
Dave observed that letting the cpu caches grow cold led to performance
problems, and replaced the tagging mechanism with per-cpu lists, which
is what we have now.

> think the interesting immediate question is: are those OOMs avoided with
> this patch but the shrinker infrastructure still in place?

That's the $64000 question that I don't know definitively yet.

> If not, then I wonder if something is going wonky there. If so, I'm
> still a bit curious what the behavior looks like and whether it can be
> simplified in light of this change, but that's certainly beyond the
> scope of this patch.

<nod> For now I'd like to get this going for 5.19 to fix the original
complaint about statfs blocking in D state while waiting for inodegc
after deleting a ~100m file extents[1], and defer the decision of
whether or not we /really/ need the shrinker to a later time (like
5.20).

[1] https://lore.kernel.org/linux-xfs/20220509024659.GA62606@onthe.net.au/

> 
> > > > > > If
> > > > > > so, it looks like decent enough overhead to cycle through every cpu in
> > > > > > both callbacks that it might be worth spelling out more clearly in the
> > > > > > top-level comment.
> > > > > 
> > > > > I'm not sure what you are asking here - mod_delayed_work_on() has
> > > > > pretty much the same overhead and behaviour as queue_work() in this
> > > > > case, so... ?
> > > > 
> > > 
> > > I'm just pointing out that the comment around the shrinker
> > > infrastructure isn't very informative if the shrinker turns out to still
> > > be necessary for reasons other than making the workers run sooner.
> > 
> > <nod> That comment /does/ need to be updated to note the subtlety that a
> > lot of shrinker activity can slow down close()ing a file by making user
> > tasks wait for the inodegc workers to clear the backlog.
> > 
> > > > <shrug> Looks ok to me, since djwong-dev has had some variant of timer
> > > > delayed inactivation in it longer than it hasn't:
> > > > 
> > > 
> > > Was that with a correspondingly small delay or something larger (on the
> > > order of seconds or so)? Either way, it sounds like you have a
> > > predictable enough workload that can actually test this continues to
> > > work as expected..?
> > 
> > Yeah.  I snapshot /home (which has ~20 million inodes) then race
> > fsstress and xfs_scrub -n in a VM with 560MB of RAM.
> > 
> 
> Yeah small delay or yeah large delay?

Both -- with large inactivation delays and no shrinker, OOMs happen
fairly frequently; with a short delay and no shrinker, they're harder
(but still not impossible) to trigger.

Granted ... even the overall frequency of OOMs with large inactivation
delays seems to have gone down a bit from when I was more actively
testing in the 5.9 era.

--D

> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > --D
> > > > 
> > > > > Cheers,
> > > > > 
> > > > > Dave.
> > > > > -- 
> > > > > Dave Chinner
> > > > > david@fromorbit.com
> > > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work
  2022-06-23 19:56               ` Darrick J. Wong
@ 2022-06-24 12:39                 ` Brian Foster
  2022-06-25  1:03                   ` Darrick J. Wong
  0 siblings, 1 reply; 16+ messages in thread
From: Brian Foster @ 2022-06-24 12:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Thu, Jun 23, 2022 at 12:56:40PM -0700, Darrick J. Wong wrote:
> On Thu, Jun 23, 2022 at 07:49:46AM -0400, Brian Foster wrote:
> > On Wed, Jun 22, 2022 at 05:25:36PM -0700, Darrick J. Wong wrote:
> > > On Wed, Jun 22, 2022 at 12:13:54PM -0400, Brian Foster wrote:
> > > > On Tue, Jun 21, 2022 at 10:20:46PM -0700, Darrick J. Wong wrote:
> > > > > On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote:
> > > > > > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote:
> > > > > > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote:
> > > > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > > > 
> > > > > > > > Currently inodegc work can sit queued on the per-cpu queue until
> > > > > > > > the workqueue is either flushed of the queue reaches a depth that
> > > > > > > > triggers work queuing (and later throttling). This means that we
> > > > > > > > could queue work that waits for a long time for some other event to
> > > > > > > > trigger flushing.
> > > > > > > > 
> > > > > > > > Hence instead of just queueing work at a specific depth, use a
> > > > > > > > delayed work that queues the work at a bound time. We can still
> > > > > > > > schedule the work immediately at a given depth, but we no long need
> > > > > > > > to worry about leaving a number of items on the list that won't get
> > > > > > > > processed until external events prevail.
> > > > > > > > 
> > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > > > > ---
> > > > > > > >  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
> > > > > > > >  fs/xfs/xfs_mount.h  |  2 +-
> > > > > > > >  fs/xfs/xfs_super.c  |  2 +-
> > > > > > > >  3 files changed, 24 insertions(+), 16 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > > > > index 374b3bafaeb0..46b30ecf498c 100644
> > > > > > > > --- a/fs/xfs/xfs_icache.c
> > > > > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > > > ...
> > > > > > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan(
> > > > > > > >  			unsigned int	h = READ_ONCE(gc->shrinker_hits);
> > > > > > > >  
> > > > > > > >  			WRITE_ONCE(gc->shrinker_hits, h + 1);
> > > > > > > > -			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> > > > > > > > +			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> > > > > > > >  			no_items = false;
> > > > > > > >  		}
> > > > > > > 
> > > > > > > This all seems reasonable to me, but is there much practical benefit to
> > > > > > > shrinker infra/feedback just to expedite a delayed work item by one
> > > > > > > jiffy? Maybe there's a use case to continue to trigger throttling..?
> > > > > > 
> > > > > > I haven't really considered doing anything other than fixing the
> > > > > > reported bug. That just requires an API conversion for the existing
> > > > > > "queue immediately" semantics and is the safest minimum change
> > > > > > to fix the issue at hand.
> > > > > > 
> > > > > > So, yes, the shrinker code may (or may not) be superfluous now, but
> > > > > > I haven't looked at it and done analysis of the behaviour without
> > > > > > the shrinkers enabled. I'll do that in a completely separate
> > > > > > patchset if it turns out that it is not needed now.
> > > > > 
> > > > > I think the shrinker part is still necessary -- bulkstat and xfs_scrub
> > > > > on a very low memory machine (~560M RAM) opening and closing tens of
> > > > > millions of files can still OOM the machine if one doesn't have a means
> > > > > to slow down ->destroy_inode (and hence the next open()) when reclaim
> > > > > really starts to dig in.  Without the shrinker bits, it's even easier to
> > > > > trigger OOM storms when xfs has timer-delayed inactivation... which is
> > > > > something that Brian pointed out a year ago when we were reviewing the
> > > > > initial inodegc patchset.
> > > > > 
> > > > 
> > > > It wouldn't surprise me if the infrastructure is still necessary for the
> > > > throttling use case. In that case, I'm more curious about things like
> > > > whether it's still as effective as intended with such a small scheduling
> > > > delay, or whether it still might be worth simplifying in various ways
> > > > (i.e., does the scheduling delay actually make a difference? do we still
> > > > need a per cpu granular throttle? etc.).
> > > 
> > > It can still be useful for certain g*dawful scenarios --
> > > 
> > > Let's say you have a horribly misconfigured cloudy system with a tiny
> > > log, hundreds of CPUs, a memory hogging process, another process with
> > > many hundreds of threads that are performing small appending synchronous
> > > writes to a large number of files, and some other process repeatedly
> > > opens and closes files.  Background writeback completion will create
> > > enough workers to tie up the log such that writeback and inodegc contend
> > > for log grant space and make slow progress.  If memory is also tight,
> > > we want to slow down the file scanning process so that it doesn't shove
> > > /more/ inodes into the cache and push the system towards OOM behavior.
> > > 
> > 
> > Yeah, I get the general idea/purpose of the throttling. What I'm probing
> > at here is whether a case like this is still handled effectively with
> > such a short scheduling delay.
> 
> Given the trace_xfs_inodegc_shrinker_* data I've collected on my
> simulator, I think the 1 jiffy delay is handling this well enough that
> we only trip that tracepoint about a half dozen times in 2h of trying to
> exercising the inode caches.  That said, I haven't really had time to
> try this on (say) a 20CPU machine with ridiculously small memory to see
> if I get different results.
> 
> (Note: I'm still trying to figure out why generic/522 reports
> corruption, and hoping that willy's folios pull today just fixes it
> magically...)
> 
> > Presumably there is some window before
> > list size based throttling triggers for which the shrinker is expected
> > to cover, so that implies the shrinker historically is able to detect
> > and push populated queues and trigger throttling from the point it is
> > invoked (whether directly via repeated shrinker invocations or
> > indirectly via causing larger queue sizes is not clear to me).
> 
> Back when I merged the inodegc series, it was fairly easy to get the
> shrinker to trip dozens of times during a stress run, even when the only
> throttling criteria was the queue size.  I think the reflink/rmap
> transaction reservation optimizations have made the log grant
> bottlenecks much less severe, since I now see a lot less contention on
> the grant heads.
> 
> > The thing that stands out to me as a question wrt to this change is that
> > the trigger for shrinker induced throttling is the list size at the time
> > of the callback(s), and that goes from having a lifecycle associated
> > with the size-oriented scheduling algorithm to a time-based scheduling
> > lifecycle of one jiffy (also noting that the inodegc worker resets
> > shrinker_hits before it begins to process inodes). So with that in mind,
> > how reliable is this lowmem signal based on the list size back to the
> > tasks creating more work and memory pressure? Once a shrinker invocation
> > occurs, what are the odds that the callback is able to detect a
> > populated list and act accordingly?
> 
> Hrmmm.  Decent, I think?  If the list is populated but the inactivation
> worker is not running, then we queue the worker and clear the list; if
> the worker is already running when the shrinker gets called, we might
> end up requeuing it unnecessarily, in which case it'll just clear more.
> 

Sure, that sounds harmless, but I was thinking more about the likelihood
of missing throttles than spurious work cycles and such..

Generally with this change, the worker is pretty much always scheduled
once an inode is queued. It executes once a tick passes where another
destroy hasn't otherwise come along to kick it out by another tick. So
if I run a sustained removal, I see similar batching/scheduling as since
before this change. If I run something with a mix of removes and
$something_else, then it doesn't seem to take much to skip batching and
run a worker cycle per inode.

So what happens if the shrinker scan sees one or more workers that
happen to execute before the scan callback is able to actually trigger
throttling? Is that critical, or is this enough of a fallback mechanism
that it's unlikely to be important? Similarly if one is running a
workload that results in behavior as described above, is this sort of
"oneshot shrinker to worker throttle" actually still going to slow down
the worker tasks? Is shrinker callback frequency high enough to actually
make an impact? Etc.

These are questions I don't really have enough context to answer atm,
nor can say whether or not may result in problems...

> > These questions are somewhat rhetorical because this all seems rather
> > unpredictable when we consider varying resource availability.
> 
> Not to mention the shrinker itself undergoing various behavioral changes
> over the years... :(
> 
> > The
> > relevant question for this patch is probably just that somebody has
> > tested and confirmed that the shrinker hasn't been subtly or indirectly
> > broken in cases like the one you describe above (where perhaps we might
> > not have many shrinker callback opportunities to act on before OOM).
> 
> FWIW, the only time I /ever/ saw OOMs (either now with 5.19 or ages ago
> with 5.9) was with extreme lowmem testing.  On VMs with more than a
> gigabyte or so of memory, I notice that we usually hit the time/length
> thresholds without shrinkers getting involved.  Granted, I did try to
> pick the shrinker values so that we only get called on the *second*
> level of shrinking, which is after we've freed some but not enough
> memory.
> 
> If my responses seem a bit like handwaving, they are, because (at the
> moment) this is well off in the weeds.
> 
> > > Back in the old days when inodegc was a radix tree tag it was fairly
> > > easy to get OOMs when the delay interval was long (5 seconds).  The
> > > OOM probability went down pretty sharply as the interval approached
> > > zero, but even at 1 jiffy I could still occasionally trip it, whereas
> > > the pre-deferred-inactivation kernels would never OOM.
> > > 
> > > I haven't tested it all that rigorously with Dave's fancy new per-cpu
> > > list design, but I did throw on my silly test setup (see below) and
> > > still got it to OOM once in 3 runs with the shrinker bits turned off.
> > > 
> > 
> > Ok.. so that implies we still need throttling, but I'm not sure what
> > "fancy percpu list design" refers to. If you have a good test case, I
> 
> You might recall that the original versions of deferred inode
> inactivation would set radix tree tags, and a per-AG workqueue function
> would scan that AG's radix tree for tagged inodes and inactivate them.
> Dave observed that letting the cpu caches grow cold led to performance
> problems, and replaced the tagging mechanism with per-cpu lists, which
> is what we have now.
> 
> > think the interesting immediate question is: are those OOMs avoided with
> > this patch but the shrinker infrastructure still in place?
> 
> That's the $64000 question that I don't know definitively yet.
> 

Ok. If the answer to the above questions is "don't really know, it might
be broken, but historical testing shows the shrinker is much less
important with smaller queuing delays such that it's unlikely to cause
any new problems on anything but the most pathological workloads," then
that sounds reasonable to me.

I don't mean to harp on it. I just want to make sure the
question/concern gets across clearly enough that somebody who knows this
system better than I can grok it and make a more intelligent assessment.

Brian

> > If not, then I wonder if something is going wonky there. If so, I'm
> > still a bit curious what the behavior looks like and whether it can be
> > simplified in light of this change, but that's certainly beyond the
> > scope of this patch.
> 
> <nod> For now I'd like to get this going for 5.19 to fix the original
> complaint about statfs blocking in D state while waiting for inodegc
> after deleting a ~100m file extents[1], and defer the decision of
> whether or not we /really/ need the shrinker to a later time (like
> 5.20).
> 
> [1] https://lore.kernel.org/linux-xfs/20220509024659.GA62606@onthe.net.au/
> 
> > 
> > > > > > > If
> > > > > > > so, it looks like decent enough overhead to cycle through every cpu in
> > > > > > > both callbacks that it might be worth spelling out more clearly in the
> > > > > > > top-level comment.
> > > > > > 
> > > > > > I'm not sure what you are asking here - mod_delayed_work_on() has
> > > > > > pretty much the same overhead and behaviour as queue_work() in this
> > > > > > case, so... ?
> > > > > 
> > > > 
> > > > I'm just pointing out that the comment around the shrinker
> > > > infrastructure isn't very informative if the shrinker turns out to still
> > > > be necessary for reasons other than making the workers run sooner.
> > > 
> > > <nod> That comment /does/ need to be updated to note the subtlety that a
> > > lot of shrinker activity can slow down close()ing a file by making user
> > > tasks wait for the inodegc workers to clear the backlog.
> > > 
> > > > > <shrug> Looks ok to me, since djwong-dev has had some variant of timer
> > > > > delayed inactivation in it longer than it hasn't:
> > > > > 
> > > > 
> > > > Was that with a correspondingly small delay or something larger (on the
> > > > order of seconds or so)? Either way, it sounds like you have a
> > > > predictable enough workload that can actually test this continues to
> > > > work as expected..?
> > > 
> > > Yeah.  I snapshot /home (which has ~20 million inodes) then race
> > > fsstress and xfs_scrub -n in a VM with 560MB of RAM.
> > > 
> > 
> > Yeah small delay or yeah large delay?
> 
> Both -- with large inactivation delays and no shrinker, OOMs happen
> fairly frequently; with a short delay and no shrinker, they're harder
> (but still not impossible) to trigger.
> 
> Granted ... even the overall frequency of OOMs with large inactivation
> delays seems to have gone down a bit from when I was more actively
> testing in the 5.9 era.
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > > > 
> > > > > --D
> > > > > 
> > > > > > Cheers,
> > > > > > 
> > > > > > Dave.
> > > > > > -- 
> > > > > > Dave Chinner
> > > > > > david@fromorbit.com
> > > > > 
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work
  2022-06-24 12:39                 ` Brian Foster
@ 2022-06-25  1:03                   ` Darrick J. Wong
  0 siblings, 0 replies; 16+ messages in thread
From: Darrick J. Wong @ 2022-06-25  1:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Fri, Jun 24, 2022 at 08:39:32AM -0400, Brian Foster wrote:
> On Thu, Jun 23, 2022 at 12:56:40PM -0700, Darrick J. Wong wrote:
> > On Thu, Jun 23, 2022 at 07:49:46AM -0400, Brian Foster wrote:
> > > On Wed, Jun 22, 2022 at 05:25:36PM -0700, Darrick J. Wong wrote:
> > > > On Wed, Jun 22, 2022 at 12:13:54PM -0400, Brian Foster wrote:
> > > > > On Tue, Jun 21, 2022 at 10:20:46PM -0700, Darrick J. Wong wrote:
> > > > > > On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote:
> > > > > > > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote:
> > > > > > > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote:
> > > > > > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > > > > > 
> > > > > > > > > Currently inodegc work can sit queued on the per-cpu queue until
> > > > > > > > > the workqueue is either flushed of the queue reaches a depth that
> > > > > > > > > triggers work queuing (and later throttling). This means that we
> > > > > > > > > could queue work that waits for a long time for some other event to
> > > > > > > > > trigger flushing.
> > > > > > > > > 
> > > > > > > > > Hence instead of just queueing work at a specific depth, use a
> > > > > > > > > delayed work that queues the work at a bound time. We can still
> > > > > > > > > schedule the work immediately at a given depth, but we no long need
> > > > > > > > > to worry about leaving a number of items on the list that won't get
> > > > > > > > > processed until external events prevail.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > > > > > ---
> > > > > > > > >  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
> > > > > > > > >  fs/xfs/xfs_mount.h  |  2 +-
> > > > > > > > >  fs/xfs/xfs_super.c  |  2 +-
> > > > > > > > >  3 files changed, 24 insertions(+), 16 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > > > > > index 374b3bafaeb0..46b30ecf498c 100644
> > > > > > > > > --- a/fs/xfs/xfs_icache.c
> > > > > > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > > > > ...
> > > > > > > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan(
> > > > > > > > >  			unsigned int	h = READ_ONCE(gc->shrinker_hits);
> > > > > > > > >  
> > > > > > > > >  			WRITE_ONCE(gc->shrinker_hits, h + 1);
> > > > > > > > > -			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> > > > > > > > > +			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> > > > > > > > >  			no_items = false;
> > > > > > > > >  		}
> > > > > > > > 
> > > > > > > > This all seems reasonable to me, but is there much practical benefit to
> > > > > > > > shrinker infra/feedback just to expedite a delayed work item by one
> > > > > > > > jiffy? Maybe there's a use case to continue to trigger throttling..?
> > > > > > > 
> > > > > > > I haven't really considered doing anything other than fixing the
> > > > > > > reported bug. That just requires an API conversion for the existing
> > > > > > > "queue immediately" semantics and is the safest minimum change
> > > > > > > to fix the issue at hand.
> > > > > > > 
> > > > > > > So, yes, the shrinker code may (or may not) be superfluous now, but
> > > > > > > I haven't looked at it and done analysis of the behaviour without
> > > > > > > the shrinkers enabled. I'll do that in a completely separate
> > > > > > > patchset if it turns out that it is not needed now.
> > > > > > 
> > > > > > I think the shrinker part is still necessary -- bulkstat and xfs_scrub
> > > > > > on a very low memory machine (~560M RAM) opening and closing tens of
> > > > > > millions of files can still OOM the machine if one doesn't have a means
> > > > > > to slow down ->destroy_inode (and hence the next open()) when reclaim
> > > > > > really starts to dig in.  Without the shrinker bits, it's even easier to
> > > > > > trigger OOM storms when xfs has timer-delayed inactivation... which is
> > > > > > something that Brian pointed out a year ago when we were reviewing the
> > > > > > initial inodegc patchset.
> > > > > > 
> > > > > 
> > > > > It wouldn't surprise me if the infrastructure is still necessary for the
> > > > > throttling use case. In that case, I'm more curious about things like
> > > > > whether it's still as effective as intended with such a small scheduling
> > > > > delay, or whether it still might be worth simplifying in various ways
> > > > > (i.e., does the scheduling delay actually make a difference? do we still
> > > > > need a per cpu granular throttle? etc.).
> > > > 
> > > > It can still be useful for certain g*dawful scenarios --
> > > > 
> > > > Let's say you have a horribly misconfigured cloudy system with a tiny
> > > > log, hundreds of CPUs, a memory hogging process, another process with
> > > > many hundreds of threads that are performing small appending synchronous
> > > > writes to a large number of files, and some other process repeatedly
> > > > opens and closes files.  Background writeback completion will create
> > > > enough workers to tie up the log such that writeback and inodegc contend
> > > > for log grant space and make slow progress.  If memory is also tight,
> > > > we want to slow down the file scanning process so that it doesn't shove
> > > > /more/ inodes into the cache and push the system towards OOM behavior.
> > > > 
> > > 
> > > Yeah, I get the general idea/purpose of the throttling. What I'm probing
> > > at here is whether a case like this is still handled effectively with
> > > such a short scheduling delay.
> > 
> > Given the trace_xfs_inodegc_shrinker_* data I've collected on my
> > simulator, I think the 1 jiffy delay is handling this well enough that
> > we only trip that tracepoint about a half dozen times in 2h of trying to
> > exercising the inode caches.  That said, I haven't really had time to
> > try this on (say) a 20CPU machine with ridiculously small memory to see
> > if I get different results.
> > 
> > (Note: I'm still trying to figure out why generic/522 reports
> > corruption, and hoping that willy's folios pull today just fixes it
> > magically...)
> > 
> > > Presumably there is some window before
> > > list size based throttling triggers for which the shrinker is expected
> > > to cover, so that implies the shrinker historically is able to detect
> > > and push populated queues and trigger throttling from the point it is
> > > invoked (whether directly via repeated shrinker invocations or
> > > indirectly via causing larger queue sizes is not clear to me).
> > 
> > Back when I merged the inodegc series, it was fairly easy to get the
> > shrinker to trip dozens of times during a stress run, even when the only
> > throttling criteria was the queue size.  I think the reflink/rmap
> > transaction reservation optimizations have made the log grant
> > bottlenecks much less severe, since I now see a lot less contention on
> > the grant heads.
> > 
> > > The thing that stands out to me as a question wrt to this change is that
> > > the trigger for shrinker induced throttling is the list size at the time
> > > of the callback(s), and that goes from having a lifecycle associated
> > > with the size-oriented scheduling algorithm to a time-based scheduling
> > > lifecycle of one jiffy (also noting that the inodegc worker resets
> > > shrinker_hits before it begins to process inodes). So with that in mind,
> > > how reliable is this lowmem signal based on the list size back to the
> > > tasks creating more work and memory pressure? Once a shrinker invocation
> > > occurs, what are the odds that the callback is able to detect a
> > > populated list and act accordingly?
> > 
> > Hrmmm.  Decent, I think?  If the list is populated but the inactivation
> > worker is not running, then we queue the worker and clear the list; if
> > the worker is already running when the shrinker gets called, we might
> > end up requeuing it unnecessarily, in which case it'll just clear more.
> > 
> 
> Sure, that sounds harmless, but I was thinking more about the likelihood
> of missing throttles than spurious work cycles and such..
> 
> Generally with this change, the worker is pretty much always scheduled
> once an inode is queued. It executes once a tick passes where another
> destroy hasn't otherwise come along to kick it out by another tick. So
> if I run a sustained removal, I see similar batching/scheduling as since
> before this change. If I run something with a mix of removes and
> $something_else, then it doesn't seem to take much to skip batching and
> run a worker cycle per inode.
> 
> So what happens if the shrinker scan sees one or more workers that

Er... by "shrinker scan", do you mean xfs_inodegc_shrinker_count?
And by "scan callback", do you mean xfs_inodegc_shrinker_scan?

> happen to execute before the scan callback is able to actually trigger
> throttling? Is that critical, or is this enough of a fallback mechanism
> that it's unlikely to be important?

I /think/ it's possible to miss throttling opportunities, but I also
suspect it's relatively unimportant because of the low maximum queue
size.

Let's say _count runs, discovers non-empty lists, and triggers the
callback.  If the workers run after _count returns but before _scan gets
a chance to run, then they'll have processed the queued inodes and _scan
won't set shrinker_hits anywhere.  Yes, a _destroy_inode won't get
throttled, but the system is at least making some forward progress
clearing the backlog and (hopefully) freeing memory.

If there are a /lot/ of destroy_inode calls, they'll eventually hit the
queue size limit (32) and throttle on the workers.  If there aren't any
further destroy_inode calls, there's nothing to throttle.  If however
there's some destroy_inode calls but not enough to hit the queue limit,
then I suppose we're reliant on the shrinker to get called again.  This
is of course trivially satisfied if there are other processes that also
enter the shrinker code and trigger more scans.

However, I suppose there /is/ a small hole here -- if there was only
that one process in the shrinker and _scan doesn't find anything, the
shrinker will not call us back.  In that case, the only way we trigger
throttling is if there's enough destroy_inode activity to hit a queue
limit.

Dave and I have talked about closing this hole by queuing the worker
immediately if the inode being destroyed has more than a certain number
of extents or something that acts as a proxy for memory consumption, but
so far as we can tell, shrinker invocations are rare and missed
throttling hasn't shown up in a noticeable way.

> Similarly if one is running a
> workload that results in behavior as described above, is this sort of
> "oneshot shrinker to worker throttle" actually still going to slow down
> the worker tasks? Is shrinker callback frequency high enough to actually
> make an impact? Etc.

It seems to, but with a smallish sample size on a normal system, which
means I've only seen the shrinker throttle show up in the trace output
if I use the insane lowmem scenario or artificially crank up the delay
to 30 seconds.  The good news is that in those scenarios, it /does/ slow
things down enough that we mostly don't OOM.

(Though at this point it's really hard to go back for an apples-apples
comparison with the pre-inodegc code.)

> These are questions I don't really have enough context to answer atm,
> nor can say whether or not may result in problems...
> 
> > > These questions are somewhat rhetorical because this all seems rather
> > > unpredictable when we consider varying resource availability.
> > 
> > Not to mention the shrinker itself undergoing various behavioral changes
> > over the years... :(
> > 
> > > The
> > > relevant question for this patch is probably just that somebody has
> > > tested and confirmed that the shrinker hasn't been subtly or indirectly
> > > broken in cases like the one you describe above (where perhaps we might
> > > not have many shrinker callback opportunities to act on before OOM).
> > 
> > FWIW, the only time I /ever/ saw OOMs (either now with 5.19 or ages ago
> > with 5.9) was with extreme lowmem testing.  On VMs with more than a
> > gigabyte or so of memory, I notice that we usually hit the time/length
> > thresholds without shrinkers getting involved.  Granted, I did try to
> > pick the shrinker values so that we only get called on the *second*
> > level of shrinking, which is after we've freed some but not enough
> > memory.
> > 
> > If my responses seem a bit like handwaving, they are, because (at the
> > moment) this is well off in the weeds.
> > 
> > > > Back in the old days when inodegc was a radix tree tag it was fairly
> > > > easy to get OOMs when the delay interval was long (5 seconds).  The
> > > > OOM probability went down pretty sharply as the interval approached
> > > > zero, but even at 1 jiffy I could still occasionally trip it, whereas
> > > > the pre-deferred-inactivation kernels would never OOM.
> > > > 
> > > > I haven't tested it all that rigorously with Dave's fancy new per-cpu
> > > > list design, but I did throw on my silly test setup (see below) and
> > > > still got it to OOM once in 3 runs with the shrinker bits turned off.
> > > > 
> > > 
> > > Ok.. so that implies we still need throttling, but I'm not sure what
> > > "fancy percpu list design" refers to. If you have a good test case, I
> > 
> > You might recall that the original versions of deferred inode
> > inactivation would set radix tree tags, and a per-AG workqueue function
> > would scan that AG's radix tree for tagged inodes and inactivate them.
> > Dave observed that letting the cpu caches grow cold led to performance
> > problems, and replaced the tagging mechanism with per-cpu lists, which
> > is what we have now.
> > 
> > > think the interesting immediate question is: are those OOMs avoided with
> > > this patch but the shrinker infrastructure still in place?
> > 
> > That's the $64000 question that I don't know definitively yet.
> > 
> 
> Ok. If the answer to the above questions is "don't really know, it might
> be broken, but historical testing shows the shrinker is much less
> important with smaller queuing delays such that it's unlikely to cause
> any new problems on anything but the most pathological workloads," then
> that sounds reasonable to me.

Yep, pretty much. :/

> I don't mean to harp on it. I just want to make sure the
> question/concern gets across clearly enough that somebody who knows this
> system better than I can grok it and make a more intelligent assessment.

Oh, I'm glad you're harping on it -- anything involving shrinkers and
deferred work ought to get a closer look.

--D

> Brian
> 
> > > If not, then I wonder if something is going wonky there. If so, I'm
> > > still a bit curious what the behavior looks like and whether it can be
> > > simplified in light of this change, but that's certainly beyond the
> > > scope of this patch.
> > 
> > <nod> For now I'd like to get this going for 5.19 to fix the original
> > complaint about statfs blocking in D state while waiting for inodegc
> > after deleting a ~100m file extents[1], and defer the decision of
> > whether or not we /really/ need the shrinker to a later time (like
> > 5.20).
> > 
> > [1] https://lore.kernel.org/linux-xfs/20220509024659.GA62606@onthe.net.au/
> > 
> > > 
> > > > > > > > If
> > > > > > > > so, it looks like decent enough overhead to cycle through every cpu in
> > > > > > > > both callbacks that it might be worth spelling out more clearly in the
> > > > > > > > top-level comment.
> > > > > > > 
> > > > > > > I'm not sure what you are asking here - mod_delayed_work_on() has
> > > > > > > pretty much the same overhead and behaviour as queue_work() in this
> > > > > > > case, so... ?
> > > > > > 
> > > > > 
> > > > > I'm just pointing out that the comment around the shrinker
> > > > > infrastructure isn't very informative if the shrinker turns out to still
> > > > > be necessary for reasons other than making the workers run sooner.
> > > > 
> > > > <nod> That comment /does/ need to be updated to note the subtlety that a
> > > > lot of shrinker activity can slow down close()ing a file by making user
> > > > tasks wait for the inodegc workers to clear the backlog.
> > > > 
> > > > > > <shrug> Looks ok to me, since djwong-dev has had some variant of timer
> > > > > > delayed inactivation in it longer than it hasn't:
> > > > > > 
> > > > > 
> > > > > Was that with a correspondingly small delay or something larger (on the
> > > > > order of seconds or so)? Either way, it sounds like you have a
> > > > > predictable enough workload that can actually test this continues to
> > > > > work as expected..?
> > > > 
> > > > Yeah.  I snapshot /home (which has ~20 million inodes) then race
> > > > fsstress and xfs_scrub -n in a VM with 560MB of RAM.
> > > > 
> > > 
> > > Yeah small delay or yeah large delay?
> > 
> > Both -- with large inactivation delays and no shrinker, OOMs happen
> > fairly frequently; with a short delay and no shrinker, they're harder
> > (but still not impossible) to trigger.
> > 
> > Granted ... even the overall frequency of OOMs with large inactivation
> > delays seems to have gone down a bit from when I was more actively
> > testing in the 5.9 era.
> > 
> > --D
> > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > Cheers,
> > > > > > > 
> > > > > > > Dave.
> > > > > > > -- 
> > > > > > > Dave Chinner
> > > > > > > david@fromorbit.com
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work
  2022-05-24 16:54   ` Darrick J. Wong
@ 2022-05-24 23:03     ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2022-05-24 23:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chris

On Tue, May 24, 2022 at 09:54:51AM -0700, Darrick J. Wong wrote:
> On Tue, May 24, 2022 at 04:38:01PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Currently inodegc work can sit queued on the per-cpu queue until
> > the workqueue is either flushed of the queue reaches a depth that
> > triggers work queuing (and later throttling). This means that we
> > could queue work that waits for a long time for some other event to
> > trigger flushing.
> > 
> > Hence instead of just queueing work at a specific depth, use a
> > delayed work that queues the work at a bound time. We can still
> > schedule the work immediately at a given depth, but we no long need
> 
> Nit: "no longer need..."
> 
> > to worry about leaving a number of items on the list that won't get
> > processed until external events prevail.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
> >  fs/xfs/xfs_mount.h  |  2 +-
> >  fs/xfs/xfs_super.c  |  2 +-
> >  3 files changed, 24 insertions(+), 16 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 5269354b1b69..786702273621 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -440,7 +440,7 @@ xfs_inodegc_queue_all(
> >  	for_each_online_cpu(cpu) {
> >  		gc = per_cpu_ptr(mp->m_inodegc, cpu);
> >  		if (!llist_empty(&gc->list))
> > -			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> > +			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> >  	}
> >  }
> >  
> > @@ -1841,8 +1841,8 @@ void
> >  xfs_inodegc_worker(
> >  	struct work_struct	*work)
> >  {
> > -	struct xfs_inodegc	*gc = container_of(work, struct xfs_inodegc,
> > -							work);
> > +	struct xfs_inodegc	*gc = container_of(to_delayed_work(work),
> > +						struct xfs_inodegc, work);
> >  	struct llist_node	*node = llist_del_all(&gc->list);
> >  	struct xfs_inode	*ip, *n;
> >  
> > @@ -2014,6 +2014,7 @@ xfs_inodegc_queue(
> >  	struct xfs_inodegc	*gc;
> >  	int			items;
> >  	unsigned int		shrinker_hits;
> > +	unsigned long		queue_delay = 1;
> 
> A default delay of one clock tick, correct?
> 
> Just out of curiosity, how does this shake out wrt fstests that do a
> thing and then measure free space?

No regressions on a 5.18+for-next kernel on the two machines (one
ramdisk, one SSD) I ran it on yesterday. The runs were clean, which
is why I posted it for comments.

> I have a dim recollection of a bug that I found in one of the
> preproduction iterations of inodegc back when I used delayed_work to
> schedule the background gc.  If memory serves, calling mod_delayed_work
> on a delayed_work object that is currently running does /not/ cause the
> delayed_work object to be requeued, even if delay==0.

I don't think that is correct - I actually went through the code to
check this because I wanted to be certain that it behaved the way I
needed it to. Indeed, the documented behaviour of
mod_delayed_work_on() is:

 * If @dwork is idle, equivalent to queue_delayed_work_on(); otherwise,
 * modify @dwork's timer so that it expires after @delay.  If @delay is
 * zero, @work is guaranteed to be scheduled immediately regardless of its
 * current state.

In terms of the implementation, try_to_grab_pending() will grab the
delayed work and set/steal the WORK_STRUCT_PENDING_BIT, and
mod_delayed_work_on() will loop until it owns the bit or the dwork
is cancelled. Once it owns the PENDING bit, it will call
__queue_delayed_work(), which either queues the work immediately
(delay = 0) or sets up a timer to expire in delay ticks. 

The PENDING bit is cleared by the kworker thread before it calls the
work->current_func() to execute the work, so if the work is
currenlty running, try_to_grab_pending() will be able to set/steal
the WORK_STRUCT_PENDING_BIT without issues, and so even if the work
is currently running, we should be able queue it again via
mod_delayed_work_on().

So, AFAICT, the comment and behaviour match, and mod_delayed_work()
will result in queuing of the dwork even if it is currently running.

> Aha, I found a description in my notes.  I've adapted them to the
> current patchset, since in those days inodegc used a radix tree tag
> and per-AG workers instead of a locklesslist and per-cpu workers.
> If the following occurs:
> 
> Worker 1			Thread 2
> 
> xfs_inodegc_worker
> <starts up>
> node = llist_del_all()
> <start processing inodes>
> <block on something, yield>
> 				xfs_irele
> 				xfs_inode_mark_reclaimable
> 				llist_add
> 				mod_delayed_work()
> 				<exit>
> <process the rest of nodelist>
> return
> 
> Then at the end of this sequence, we'll end up with thread 2's inode
> queued to the gc list but the delayed work is /not/ queued.  That inode
> remains on the gc list (and unprocessed) until someone comes along to
> push that CPU's gc list, whether it's a statfs, or an unmount, or
> someone hitting ENOSPC and triggering blockgc.

Right, if mod_delayed_work() didn't queue the work then this would
be an issue, but AFAICT mod_delayed_work() will requeue in this
case and it will not get hung up in this case. I certainly haven't
seen any evidence that it's not working as I expected (so far).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work
  2022-05-24  6:38 ` [PATCH 1/2] xfs: bound maximum wait time for inodegc work Dave Chinner
@ 2022-05-24 16:54   ` Darrick J. Wong
  2022-05-24 23:03     ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2022-05-24 16:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chris

On Tue, May 24, 2022 at 04:38:01PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Currently inodegc work can sit queued on the per-cpu queue until
> the workqueue is either flushed of the queue reaches a depth that
> triggers work queuing (and later throttling). This means that we
> could queue work that waits for a long time for some other event to
> trigger flushing.
> 
> Hence instead of just queueing work at a specific depth, use a
> delayed work that queues the work at a bound time. We can still
> schedule the work immediately at a given depth, but we no long need

Nit: "no longer need..."

> to worry about leaving a number of items on the list that won't get
> processed until external events prevail.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
>  fs/xfs/xfs_mount.h  |  2 +-
>  fs/xfs/xfs_super.c  |  2 +-
>  3 files changed, 24 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 5269354b1b69..786702273621 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -440,7 +440,7 @@ xfs_inodegc_queue_all(
>  	for_each_online_cpu(cpu) {
>  		gc = per_cpu_ptr(mp->m_inodegc, cpu);
>  		if (!llist_empty(&gc->list))
> -			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> +			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
>  	}
>  }
>  
> @@ -1841,8 +1841,8 @@ void
>  xfs_inodegc_worker(
>  	struct work_struct	*work)
>  {
> -	struct xfs_inodegc	*gc = container_of(work, struct xfs_inodegc,
> -							work);
> +	struct xfs_inodegc	*gc = container_of(to_delayed_work(work),
> +						struct xfs_inodegc, work);
>  	struct llist_node	*node = llist_del_all(&gc->list);
>  	struct xfs_inode	*ip, *n;
>  
> @@ -2014,6 +2014,7 @@ xfs_inodegc_queue(
>  	struct xfs_inodegc	*gc;
>  	int			items;
>  	unsigned int		shrinker_hits;
> +	unsigned long		queue_delay = 1;

A default delay of one clock tick, correct?

Just out of curiosity, how does this shake out wrt fstests that do a
thing and then measure free space?

I have a dim recollection of a bug that I found in one of the
preproduction iterations of inodegc back when I used delayed_work to
schedule the background gc.  If memory serves, calling mod_delayed_work
on a delayed_work object that is currently running does /not/ cause the
delayed_work object to be requeued, even if delay==0.

Aha, I found a description in my notes.  I've adapted them to the
current patchset, since in those days inodegc used a radix tree tag
and per-AG workers instead of a locklesslist and per-cpu workers.
If the following occurs:

Worker 1			Thread 2

xfs_inodegc_worker
<starts up>
node = llist_del_all()
<start processing inodes>
<block on something, yield>
				xfs_irele
				xfs_inode_mark_reclaimable
				llist_add
				mod_delayed_work()
				<exit>
<process the rest of nodelist>
return

Then at the end of this sequence, we'll end up with thread 2's inode
queued to the gc list but the delayed work is /not/ queued.  That inode
remains on the gc list (and unprocessed) until someone comes along to
push that CPU's gc list, whether it's a statfs, or an unmount, or
someone hitting ENOSPC and triggering blockgc.

I observed this bug while digging into online repair occasionally
stalling for a long time or erroring out during inode scans.  If you'll
recall, earlier inodegc iterations allowed iget to recycle inodes that
were queued for inactivation, but later iterations didn't, so it became
the responsibility of the online repair's inode scanner to push the
inodegc workers when iget found an inode that was queued for
inactivation.

(The current online repair inode scanner is smarter in the sense that it
will try inodegc_flush a few times before backing out to userspace, and
if it does, xfs_scrub will generally requeue the entire scrub
operation.)

--D

>  
>  	trace_xfs_inode_set_need_inactive(ip);
>  	spin_lock(&ip->i_flags_lock);
> @@ -2025,19 +2026,26 @@ xfs_inodegc_queue(
>  	items = READ_ONCE(gc->items);
>  	WRITE_ONCE(gc->items, items + 1);
>  	shrinker_hits = READ_ONCE(gc->shrinker_hits);
> -	put_cpu_ptr(gc);
>  
> -	if (!xfs_is_inodegc_enabled(mp))
> +	/*
> +	 * We queue the work while holding the current CPU so that the work
> +	 * is scheduled to run on this CPU.
> +	 */
> +	if (!xfs_is_inodegc_enabled(mp)) {
> +		put_cpu_ptr(gc);
>  		return;
> -
> -	if (xfs_inodegc_want_queue_work(ip, items)) {
> -		trace_xfs_inodegc_queue(mp, __return_address);
> -		queue_work(mp->m_inodegc_wq, &gc->work);
>  	}
>  
> +	if (xfs_inodegc_want_queue_work(ip, items))
> +		queue_delay = 0;
> +
> +	trace_xfs_inodegc_queue(mp, __return_address);
> +	mod_delayed_work(mp->m_inodegc_wq, &gc->work, queue_delay);
> +	put_cpu_ptr(gc);
> +
>  	if (xfs_inodegc_want_flush_work(ip, items, shrinker_hits)) {
>  		trace_xfs_inodegc_throttle(mp, __return_address);
> -		flush_work(&gc->work);
> +		flush_delayed_work(&gc->work);
>  	}
>  }
>  
> @@ -2054,7 +2062,7 @@ xfs_inodegc_cpu_dead(
>  	unsigned int		count = 0;
>  
>  	dead_gc = per_cpu_ptr(mp->m_inodegc, dead_cpu);
> -	cancel_work_sync(&dead_gc->work);
> +	cancel_delayed_work_sync(&dead_gc->work);
>  
>  	if (llist_empty(&dead_gc->list))
>  		return;
> @@ -2073,12 +2081,12 @@ xfs_inodegc_cpu_dead(
>  	llist_add_batch(first, last, &gc->list);
>  	count += READ_ONCE(gc->items);
>  	WRITE_ONCE(gc->items, count);
> -	put_cpu_ptr(gc);
>  
>  	if (xfs_is_inodegc_enabled(mp)) {
>  		trace_xfs_inodegc_queue(mp, __return_address);
> -		queue_work(mp->m_inodegc_wq, &gc->work);
> +		mod_delayed_work(mp->m_inodegc_wq, &gc->work, 0);
>  	}
> +	put_cpu_ptr(gc);
>  }
>  
>  /*
> @@ -2173,7 +2181,7 @@ xfs_inodegc_shrinker_scan(
>  			unsigned int	h = READ_ONCE(gc->shrinker_hits);
>  
>  			WRITE_ONCE(gc->shrinker_hits, h + 1);
> -			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> +			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
>  			no_items = false;
>  		}
>  	}
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 8c42786e4942..377c5e59f6a0 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -61,7 +61,7 @@ struct xfs_error_cfg {
>   */
>  struct xfs_inodegc {
>  	struct llist_head	list;
> -	struct work_struct	work;
> +	struct delayed_work	work;
>  
>  	/* approximate count of inodes in the list */
>  	unsigned int		items;
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 51ce127a0cc6..62f6b97355a2 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1073,7 +1073,7 @@ xfs_inodegc_init_percpu(
>  		gc = per_cpu_ptr(mp->m_inodegc, cpu);
>  		init_llist_head(&gc->list);
>  		gc->items = 0;
> -		INIT_WORK(&gc->work, xfs_inodegc_worker);
> +		INIT_DELAYED_WORK(&gc->work, xfs_inodegc_worker);
>  	}
>  	return 0;
>  }
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] xfs: bound maximum wait time for inodegc work
  2022-05-24  6:38 [RFC PATCH 0/2] xfs: non-blocking inodegc pushes Dave Chinner
@ 2022-05-24  6:38 ` Dave Chinner
  2022-05-24 16:54   ` Darrick J. Wong
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2022-05-24  6:38 UTC (permalink / raw)
  To: linux-xfs; +Cc: chris

From: Dave Chinner <dchinner@redhat.com>

Currently inodegc work can sit queued on the per-cpu queue until
the workqueue is either flushed of the queue reaches a depth that
triggers work queuing (and later throttling). This means that we
could queue work that waits for a long time for some other event to
trigger flushing.

Hence instead of just queueing work at a specific depth, use a
delayed work that queues the work at a bound time. We can still
schedule the work immediately at a given depth, but we no long need
to worry about leaving a number of items on the list that won't get
processed until external events prevail.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
 fs/xfs/xfs_mount.h  |  2 +-
 fs/xfs/xfs_super.c  |  2 +-
 3 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 5269354b1b69..786702273621 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -440,7 +440,7 @@ xfs_inodegc_queue_all(
 	for_each_online_cpu(cpu) {
 		gc = per_cpu_ptr(mp->m_inodegc, cpu);
 		if (!llist_empty(&gc->list))
-			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
+			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
 	}
 }
 
@@ -1841,8 +1841,8 @@ void
 xfs_inodegc_worker(
 	struct work_struct	*work)
 {
-	struct xfs_inodegc	*gc = container_of(work, struct xfs_inodegc,
-							work);
+	struct xfs_inodegc	*gc = container_of(to_delayed_work(work),
+						struct xfs_inodegc, work);
 	struct llist_node	*node = llist_del_all(&gc->list);
 	struct xfs_inode	*ip, *n;
 
@@ -2014,6 +2014,7 @@ xfs_inodegc_queue(
 	struct xfs_inodegc	*gc;
 	int			items;
 	unsigned int		shrinker_hits;
+	unsigned long		queue_delay = 1;
 
 	trace_xfs_inode_set_need_inactive(ip);
 	spin_lock(&ip->i_flags_lock);
@@ -2025,19 +2026,26 @@ xfs_inodegc_queue(
 	items = READ_ONCE(gc->items);
 	WRITE_ONCE(gc->items, items + 1);
 	shrinker_hits = READ_ONCE(gc->shrinker_hits);
-	put_cpu_ptr(gc);
 
-	if (!xfs_is_inodegc_enabled(mp))
+	/*
+	 * We queue the work while holding the current CPU so that the work
+	 * is scheduled to run on this CPU.
+	 */
+	if (!xfs_is_inodegc_enabled(mp)) {
+		put_cpu_ptr(gc);
 		return;
-
-	if (xfs_inodegc_want_queue_work(ip, items)) {
-		trace_xfs_inodegc_queue(mp, __return_address);
-		queue_work(mp->m_inodegc_wq, &gc->work);
 	}
 
+	if (xfs_inodegc_want_queue_work(ip, items))
+		queue_delay = 0;
+
+	trace_xfs_inodegc_queue(mp, __return_address);
+	mod_delayed_work(mp->m_inodegc_wq, &gc->work, queue_delay);
+	put_cpu_ptr(gc);
+
 	if (xfs_inodegc_want_flush_work(ip, items, shrinker_hits)) {
 		trace_xfs_inodegc_throttle(mp, __return_address);
-		flush_work(&gc->work);
+		flush_delayed_work(&gc->work);
 	}
 }
 
@@ -2054,7 +2062,7 @@ xfs_inodegc_cpu_dead(
 	unsigned int		count = 0;
 
 	dead_gc = per_cpu_ptr(mp->m_inodegc, dead_cpu);
-	cancel_work_sync(&dead_gc->work);
+	cancel_delayed_work_sync(&dead_gc->work);
 
 	if (llist_empty(&dead_gc->list))
 		return;
@@ -2073,12 +2081,12 @@ xfs_inodegc_cpu_dead(
 	llist_add_batch(first, last, &gc->list);
 	count += READ_ONCE(gc->items);
 	WRITE_ONCE(gc->items, count);
-	put_cpu_ptr(gc);
 
 	if (xfs_is_inodegc_enabled(mp)) {
 		trace_xfs_inodegc_queue(mp, __return_address);
-		queue_work(mp->m_inodegc_wq, &gc->work);
+		mod_delayed_work(mp->m_inodegc_wq, &gc->work, 0);
 	}
+	put_cpu_ptr(gc);
 }
 
 /*
@@ -2173,7 +2181,7 @@ xfs_inodegc_shrinker_scan(
 			unsigned int	h = READ_ONCE(gc->shrinker_hits);
 
 			WRITE_ONCE(gc->shrinker_hits, h + 1);
-			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
+			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
 			no_items = false;
 		}
 	}
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8c42786e4942..377c5e59f6a0 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -61,7 +61,7 @@ struct xfs_error_cfg {
  */
 struct xfs_inodegc {
 	struct llist_head	list;
-	struct work_struct	work;
+	struct delayed_work	work;
 
 	/* approximate count of inodes in the list */
 	unsigned int		items;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 51ce127a0cc6..62f6b97355a2 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1073,7 +1073,7 @@ xfs_inodegc_init_percpu(
 		gc = per_cpu_ptr(mp->m_inodegc, cpu);
 		init_llist_head(&gc->list);
 		gc->items = 0;
-		INIT_WORK(&gc->work, xfs_inodegc_worker);
+		INIT_DELAYED_WORK(&gc->work, xfs_inodegc_worker);
 	}
 	return 0;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-06-25  1:04 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-15 22:04 [PATCH 0/2 V2] xfs: xfs: non-blocking inodegc pushes Dave Chinner
2022-06-15 22:04 ` [PATCH 1/2] xfs: bound maximum wait time for inodegc work Dave Chinner
2022-06-17 16:34   ` Brian Foster
2022-06-17 21:52     ` Dave Chinner
2022-06-22  5:20       ` Darrick J. Wong
2022-06-22 16:13         ` Brian Foster
2022-06-23  0:25           ` Darrick J. Wong
2022-06-23 11:49             ` Brian Foster
2022-06-23 19:56               ` Darrick J. Wong
2022-06-24 12:39                 ` Brian Foster
2022-06-25  1:03                   ` Darrick J. Wong
2022-06-15 22:04 ` [PATCH 2/2] xfs: introduce xfs_inodegc_push() Dave Chinner
2022-06-22  5:21   ` Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2022-05-24  6:38 [RFC PATCH 0/2] xfs: non-blocking inodegc pushes Dave Chinner
2022-05-24  6:38 ` [PATCH 1/2] xfs: bound maximum wait time for inodegc work Dave Chinner
2022-05-24 16:54   ` Darrick J. Wong
2022-05-24 23:03     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.