* [PATCH 0/2 V2] xfs: xfs: non-blocking inodegc pushes @ 2022-06-15 22:04 Dave Chinner 2022-06-15 22:04 ` [PATCH 1/2] xfs: bound maximum wait time for inodegc work Dave Chinner 2022-06-15 22:04 ` [PATCH 2/2] xfs: introduce xfs_inodegc_push() Dave Chinner 0 siblings, 2 replies; 16+ messages in thread From: Dave Chinner @ 2022-06-15 22:04 UTC (permalink / raw) To: linux-xfs Hi folks, These patches introduce non-blocking inodegc pushes to fix long hold-offs in statfs() operations when inodegc is performing long running inode inactivation operations. The first patch introduces the bound maximum work start time for the inodegc queues - it's short, only 1 jiffie (10ms) - but we don't want to delay inodegc for an arbitrarily long period of time. However, it means that work always starts quickly and so that reduces the need for statfs() to have to wait for background inodegc to start and complete to catch space "freed" by recent unlinks. The second patch converts statfs to use a "push" rather than a "flush". The push simply schedules any pending work that hasn't yet timed out to run immediately and returns. It does not wait for the inodegc work to complete - that's what a flush does, and that's what caused all the problems for statfs(). Hence statfs() is converted to push semantics at the same time, thereby removing the blocking behaviour it currently has. This should prevent a large amount of the issues that have been seeing with lots of processes stuck in statfs() - that will no long happen. The only time user processes should get stuck now is when the inodegc throttle kicks in (unlinks only at this point) or if we are waiting for a lock a long running inodegc operation holds to be released. We had those specific problems before background inodegc - they manifested as unkillable unlink operations that had every backed up on them instead of background inodegc that has everything backed up on them. This patch has been running in my test environment for nearly a month now without regressions occurring. While there are likely still going to be inodegc flush related hold-offs in certain circumstances, nothing appears to be impacting the correctness of fstests tests or creating new issues. The 0-day kernel testing bot also indicates that certain benchmarks (such as aim7 and stress-ng.rename) run significantly faster with bound maximum delays and non-blocking statfs operations. Comments, thoughts and testing appreciated. -Dave. Version 2: - Also convert quota reportting inodegc flushes to a push. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/2] xfs: bound maximum wait time for inodegc work 2022-06-15 22:04 [PATCH 0/2 V2] xfs: xfs: non-blocking inodegc pushes Dave Chinner @ 2022-06-15 22:04 ` Dave Chinner 2022-06-17 16:34 ` Brian Foster 2022-06-15 22:04 ` [PATCH 2/2] xfs: introduce xfs_inodegc_push() Dave Chinner 1 sibling, 1 reply; 16+ messages in thread From: Dave Chinner @ 2022-06-15 22:04 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> Currently inodegc work can sit queued on the per-cpu queue until the workqueue is either flushed of the queue reaches a depth that triggers work queuing (and later throttling). This means that we could queue work that waits for a long time for some other event to trigger flushing. Hence instead of just queueing work at a specific depth, use a delayed work that queues the work at a bound time. We can still schedule the work immediately at a given depth, but we no long need to worry about leaving a number of items on the list that won't get processed until external events prevail. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- fs/xfs/xfs_mount.h | 2 +- fs/xfs/xfs_super.c | 2 +- 3 files changed, 24 insertions(+), 16 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 374b3bafaeb0..46b30ecf498c 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -442,7 +442,7 @@ xfs_inodegc_queue_all( for_each_online_cpu(cpu) { gc = per_cpu_ptr(mp->m_inodegc, cpu); if (!llist_empty(&gc->list)) - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); } } @@ -1844,8 +1844,8 @@ void xfs_inodegc_worker( struct work_struct *work) { - struct xfs_inodegc *gc = container_of(work, struct xfs_inodegc, - work); + struct xfs_inodegc *gc = container_of(to_delayed_work(work), + struct xfs_inodegc, work); struct llist_node *node = llist_del_all(&gc->list); struct xfs_inode *ip, *n; @@ -2017,6 +2017,7 @@ xfs_inodegc_queue( struct xfs_inodegc *gc; int items; unsigned int shrinker_hits; + unsigned long queue_delay = 1; trace_xfs_inode_set_need_inactive(ip); spin_lock(&ip->i_flags_lock); @@ -2028,19 +2029,26 @@ xfs_inodegc_queue( items = READ_ONCE(gc->items); WRITE_ONCE(gc->items, items + 1); shrinker_hits = READ_ONCE(gc->shrinker_hits); - put_cpu_ptr(gc); - if (!xfs_is_inodegc_enabled(mp)) + /* + * We queue the work while holding the current CPU so that the work + * is scheduled to run on this CPU. + */ + if (!xfs_is_inodegc_enabled(mp)) { + put_cpu_ptr(gc); return; - - if (xfs_inodegc_want_queue_work(ip, items)) { - trace_xfs_inodegc_queue(mp, __return_address); - queue_work(mp->m_inodegc_wq, &gc->work); } + if (xfs_inodegc_want_queue_work(ip, items)) + queue_delay = 0; + + trace_xfs_inodegc_queue(mp, __return_address); + mod_delayed_work(mp->m_inodegc_wq, &gc->work, queue_delay); + put_cpu_ptr(gc); + if (xfs_inodegc_want_flush_work(ip, items, shrinker_hits)) { trace_xfs_inodegc_throttle(mp, __return_address); - flush_work(&gc->work); + flush_delayed_work(&gc->work); } } @@ -2057,7 +2065,7 @@ xfs_inodegc_cpu_dead( unsigned int count = 0; dead_gc = per_cpu_ptr(mp->m_inodegc, dead_cpu); - cancel_work_sync(&dead_gc->work); + cancel_delayed_work_sync(&dead_gc->work); if (llist_empty(&dead_gc->list)) return; @@ -2076,12 +2084,12 @@ xfs_inodegc_cpu_dead( llist_add_batch(first, last, &gc->list); count += READ_ONCE(gc->items); WRITE_ONCE(gc->items, count); - put_cpu_ptr(gc); if (xfs_is_inodegc_enabled(mp)) { trace_xfs_inodegc_queue(mp, __return_address); - queue_work(mp->m_inodegc_wq, &gc->work); + mod_delayed_work(mp->m_inodegc_wq, &gc->work, 0); } + put_cpu_ptr(gc); } /* @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan( unsigned int h = READ_ONCE(gc->shrinker_hits); WRITE_ONCE(gc->shrinker_hits, h + 1); - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); no_items = false; } } diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index ba5d42abf66e..d2eaebd85abf 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -61,7 +61,7 @@ struct xfs_error_cfg { */ struct xfs_inodegc { struct llist_head list; - struct work_struct work; + struct delayed_work work; /* approximate count of inodes in the list */ unsigned int items; diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 48a7239ed1b2..651ae75a7e23 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1075,7 +1075,7 @@ xfs_inodegc_init_percpu( gc = per_cpu_ptr(mp->m_inodegc, cpu); init_llist_head(&gc->list); gc->items = 0; - INIT_WORK(&gc->work, xfs_inodegc_worker); + INIT_DELAYED_WORK(&gc->work, xfs_inodegc_worker); } return 0; } -- 2.35.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work 2022-06-15 22:04 ` [PATCH 1/2] xfs: bound maximum wait time for inodegc work Dave Chinner @ 2022-06-17 16:34 ` Brian Foster 2022-06-17 21:52 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Brian Foster @ 2022-06-17 16:34 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Currently inodegc work can sit queued on the per-cpu queue until > the workqueue is either flushed of the queue reaches a depth that > triggers work queuing (and later throttling). This means that we > could queue work that waits for a long time for some other event to > trigger flushing. > > Hence instead of just queueing work at a specific depth, use a > delayed work that queues the work at a bound time. We can still > schedule the work immediately at a given depth, but we no long need > to worry about leaving a number of items on the list that won't get > processed until external events prevail. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- > fs/xfs/xfs_mount.h | 2 +- > fs/xfs/xfs_super.c | 2 +- > 3 files changed, 24 insertions(+), 16 deletions(-) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 374b3bafaeb0..46b30ecf498c 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c ... > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan( > unsigned int h = READ_ONCE(gc->shrinker_hits); > > WRITE_ONCE(gc->shrinker_hits, h + 1); > - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); > + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); > no_items = false; > } This all seems reasonable to me, but is there much practical benefit to shrinker infra/feedback just to expedite a delayed work item by one jiffy? Maybe there's a use case to continue to trigger throttling..? If so, it looks like decent enough overhead to cycle through every cpu in both callbacks that it might be worth spelling out more clearly in the top-level comment. Brian > } > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > index ba5d42abf66e..d2eaebd85abf 100644 > --- a/fs/xfs/xfs_mount.h > +++ b/fs/xfs/xfs_mount.h > @@ -61,7 +61,7 @@ struct xfs_error_cfg { > */ > struct xfs_inodegc { > struct llist_head list; > - struct work_struct work; > + struct delayed_work work; > > /* approximate count of inodes in the list */ > unsigned int items; > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 48a7239ed1b2..651ae75a7e23 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -1075,7 +1075,7 @@ xfs_inodegc_init_percpu( > gc = per_cpu_ptr(mp->m_inodegc, cpu); > init_llist_head(&gc->list); > gc->items = 0; > - INIT_WORK(&gc->work, xfs_inodegc_worker); > + INIT_DELAYED_WORK(&gc->work, xfs_inodegc_worker); > } > return 0; > } > -- > 2.35.1 > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work 2022-06-17 16:34 ` Brian Foster @ 2022-06-17 21:52 ` Dave Chinner 2022-06-22 5:20 ` Darrick J. Wong 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2022-06-17 21:52 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote: > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > Currently inodegc work can sit queued on the per-cpu queue until > > the workqueue is either flushed of the queue reaches a depth that > > triggers work queuing (and later throttling). This means that we > > could queue work that waits for a long time for some other event to > > trigger flushing. > > > > Hence instead of just queueing work at a specific depth, use a > > delayed work that queues the work at a bound time. We can still > > schedule the work immediately at a given depth, but we no long need > > to worry about leaving a number of items on the list that won't get > > processed until external events prevail. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- > > fs/xfs/xfs_mount.h | 2 +- > > fs/xfs/xfs_super.c | 2 +- > > 3 files changed, 24 insertions(+), 16 deletions(-) > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index 374b3bafaeb0..46b30ecf498c 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > ... > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan( > > unsigned int h = READ_ONCE(gc->shrinker_hits); > > > > WRITE_ONCE(gc->shrinker_hits, h + 1); > > - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); > > + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); > > no_items = false; > > } > > This all seems reasonable to me, but is there much practical benefit to > shrinker infra/feedback just to expedite a delayed work item by one > jiffy? Maybe there's a use case to continue to trigger throttling..? I haven't really considered doing anything other than fixing the reported bug. That just requires an API conversion for the existing "queue immediately" semantics and is the safest minimum change to fix the issue at hand. So, yes, the shrinker code may (or may not) be superfluous now, but I haven't looked at it and done analysis of the behaviour without the shrinkers enabled. I'll do that in a completely separate patchset if it turns out that it is not needed now. > If > so, it looks like decent enough overhead to cycle through every cpu in > both callbacks that it might be worth spelling out more clearly in the > top-level comment. I'm not sure what you are asking here - mod_delayed_work_on() has pretty much the same overhead and behaviour as queue_work() in this case, so... ? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work 2022-06-17 21:52 ` Dave Chinner @ 2022-06-22 5:20 ` Darrick J. Wong 2022-06-22 16:13 ` Brian Foster 0 siblings, 1 reply; 16+ messages in thread From: Darrick J. Wong @ 2022-06-22 5:20 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, linux-xfs On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote: > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote: > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > Currently inodegc work can sit queued on the per-cpu queue until > > > the workqueue is either flushed of the queue reaches a depth that > > > triggers work queuing (and later throttling). This means that we > > > could queue work that waits for a long time for some other event to > > > trigger flushing. > > > > > > Hence instead of just queueing work at a specific depth, use a > > > delayed work that queues the work at a bound time. We can still > > > schedule the work immediately at a given depth, but we no long need > > > to worry about leaving a number of items on the list that won't get > > > processed until external events prevail. > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- > > > fs/xfs/xfs_mount.h | 2 +- > > > fs/xfs/xfs_super.c | 2 +- > > > 3 files changed, 24 insertions(+), 16 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > index 374b3bafaeb0..46b30ecf498c 100644 > > > --- a/fs/xfs/xfs_icache.c > > > +++ b/fs/xfs/xfs_icache.c > > ... > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan( > > > unsigned int h = READ_ONCE(gc->shrinker_hits); > > > > > > WRITE_ONCE(gc->shrinker_hits, h + 1); > > > - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); > > > + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); > > > no_items = false; > > > } > > > > This all seems reasonable to me, but is there much practical benefit to > > shrinker infra/feedback just to expedite a delayed work item by one > > jiffy? Maybe there's a use case to continue to trigger throttling..? > > I haven't really considered doing anything other than fixing the > reported bug. That just requires an API conversion for the existing > "queue immediately" semantics and is the safest minimum change > to fix the issue at hand. > > So, yes, the shrinker code may (or may not) be superfluous now, but > I haven't looked at it and done analysis of the behaviour without > the shrinkers enabled. I'll do that in a completely separate > patchset if it turns out that it is not needed now. I think the shrinker part is still necessary -- bulkstat and xfs_scrub on a very low memory machine (~560M RAM) opening and closing tens of millions of files can still OOM the machine if one doesn't have a means to slow down ->destroy_inode (and hence the next open()) when reclaim really starts to dig in. Without the shrinker bits, it's even easier to trigger OOM storms when xfs has timer-delayed inactivation... which is something that Brian pointed out a year ago when we were reviewing the initial inodegc patchset. > > If > > so, it looks like decent enough overhead to cycle through every cpu in > > both callbacks that it might be worth spelling out more clearly in the > > top-level comment. > > I'm not sure what you are asking here - mod_delayed_work_on() has > pretty much the same overhead and behaviour as queue_work() in this > case, so... ? <shrug> Looks ok to me, since djwong-dev has had some variant of timer delayed inactivation in it longer than it hasn't: Reviewed-by: Darrick J. Wong <djwong@kernel.org> --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work 2022-06-22 5:20 ` Darrick J. Wong @ 2022-06-22 16:13 ` Brian Foster 2022-06-23 0:25 ` Darrick J. Wong 0 siblings, 1 reply; 16+ messages in thread From: Brian Foster @ 2022-06-22 16:13 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs On Tue, Jun 21, 2022 at 10:20:46PM -0700, Darrick J. Wong wrote: > On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote: > > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote: > > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > Currently inodegc work can sit queued on the per-cpu queue until > > > > the workqueue is either flushed of the queue reaches a depth that > > > > triggers work queuing (and later throttling). This means that we > > > > could queue work that waits for a long time for some other event to > > > > trigger flushing. > > > > > > > > Hence instead of just queueing work at a specific depth, use a > > > > delayed work that queues the work at a bound time. We can still > > > > schedule the work immediately at a given depth, but we no long need > > > > to worry about leaving a number of items on the list that won't get > > > > processed until external events prevail. > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > --- > > > > fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- > > > > fs/xfs/xfs_mount.h | 2 +- > > > > fs/xfs/xfs_super.c | 2 +- > > > > 3 files changed, 24 insertions(+), 16 deletions(-) > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > index 374b3bafaeb0..46b30ecf498c 100644 > > > > --- a/fs/xfs/xfs_icache.c > > > > +++ b/fs/xfs/xfs_icache.c > > > ... > > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan( > > > > unsigned int h = READ_ONCE(gc->shrinker_hits); > > > > > > > > WRITE_ONCE(gc->shrinker_hits, h + 1); > > > > - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); > > > > + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); > > > > no_items = false; > > > > } > > > > > > This all seems reasonable to me, but is there much practical benefit to > > > shrinker infra/feedback just to expedite a delayed work item by one > > > jiffy? Maybe there's a use case to continue to trigger throttling..? > > > > I haven't really considered doing anything other than fixing the > > reported bug. That just requires an API conversion for the existing > > "queue immediately" semantics and is the safest minimum change > > to fix the issue at hand. > > > > So, yes, the shrinker code may (or may not) be superfluous now, but > > I haven't looked at it and done analysis of the behaviour without > > the shrinkers enabled. I'll do that in a completely separate > > patchset if it turns out that it is not needed now. > > I think the shrinker part is still necessary -- bulkstat and xfs_scrub > on a very low memory machine (~560M RAM) opening and closing tens of > millions of files can still OOM the machine if one doesn't have a means > to slow down ->destroy_inode (and hence the next open()) when reclaim > really starts to dig in. Without the shrinker bits, it's even easier to > trigger OOM storms when xfs has timer-delayed inactivation... which is > something that Brian pointed out a year ago when we were reviewing the > initial inodegc patchset. > It wouldn't surprise me if the infrastructure is still necessary for the throttling use case. In that case, I'm more curious about things like whether it's still as effective as intended with such a small scheduling delay, or whether it still might be worth simplifying in various ways (i.e., does the scheduling delay actually make a difference? do we still need a per cpu granular throttle? etc.). > > > If > > > so, it looks like decent enough overhead to cycle through every cpu in > > > both callbacks that it might be worth spelling out more clearly in the > > > top-level comment. > > > > I'm not sure what you are asking here - mod_delayed_work_on() has > > pretty much the same overhead and behaviour as queue_work() in this > > case, so... ? > I'm just pointing out that the comment around the shrinker infrastructure isn't very informative if the shrinker turns out to still be necessary for reasons other than making the workers run sooner. > <shrug> Looks ok to me, since djwong-dev has had some variant of timer > delayed inactivation in it longer than it hasn't: > Was that with a correspondingly small delay or something larger (on the order of seconds or so)? Either way, it sounds like you have a predictable enough workload that can actually test this continues to work as expected..? Brian > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > --D > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work 2022-06-22 16:13 ` Brian Foster @ 2022-06-23 0:25 ` Darrick J. Wong 2022-06-23 11:49 ` Brian Foster 0 siblings, 1 reply; 16+ messages in thread From: Darrick J. Wong @ 2022-06-23 0:25 UTC (permalink / raw) To: Brian Foster; +Cc: Dave Chinner, linux-xfs On Wed, Jun 22, 2022 at 12:13:54PM -0400, Brian Foster wrote: > On Tue, Jun 21, 2022 at 10:20:46PM -0700, Darrick J. Wong wrote: > > On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote: > > > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote: > > > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote: > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > > > Currently inodegc work can sit queued on the per-cpu queue until > > > > > the workqueue is either flushed of the queue reaches a depth that > > > > > triggers work queuing (and later throttling). This means that we > > > > > could queue work that waits for a long time for some other event to > > > > > trigger flushing. > > > > > > > > > > Hence instead of just queueing work at a specific depth, use a > > > > > delayed work that queues the work at a bound time. We can still > > > > > schedule the work immediately at a given depth, but we no long need > > > > > to worry about leaving a number of items on the list that won't get > > > > > processed until external events prevail. > > > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > > --- > > > > > fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- > > > > > fs/xfs/xfs_mount.h | 2 +- > > > > > fs/xfs/xfs_super.c | 2 +- > > > > > 3 files changed, 24 insertions(+), 16 deletions(-) > > > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > > index 374b3bafaeb0..46b30ecf498c 100644 > > > > > --- a/fs/xfs/xfs_icache.c > > > > > +++ b/fs/xfs/xfs_icache.c > > > > ... > > > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan( > > > > > unsigned int h = READ_ONCE(gc->shrinker_hits); > > > > > > > > > > WRITE_ONCE(gc->shrinker_hits, h + 1); > > > > > - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); > > > > > + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); > > > > > no_items = false; > > > > > } > > > > > > > > This all seems reasonable to me, but is there much practical benefit to > > > > shrinker infra/feedback just to expedite a delayed work item by one > > > > jiffy? Maybe there's a use case to continue to trigger throttling..? > > > > > > I haven't really considered doing anything other than fixing the > > > reported bug. That just requires an API conversion for the existing > > > "queue immediately" semantics and is the safest minimum change > > > to fix the issue at hand. > > > > > > So, yes, the shrinker code may (or may not) be superfluous now, but > > > I haven't looked at it and done analysis of the behaviour without > > > the shrinkers enabled. I'll do that in a completely separate > > > patchset if it turns out that it is not needed now. > > > > I think the shrinker part is still necessary -- bulkstat and xfs_scrub > > on a very low memory machine (~560M RAM) opening and closing tens of > > millions of files can still OOM the machine if one doesn't have a means > > to slow down ->destroy_inode (and hence the next open()) when reclaim > > really starts to dig in. Without the shrinker bits, it's even easier to > > trigger OOM storms when xfs has timer-delayed inactivation... which is > > something that Brian pointed out a year ago when we were reviewing the > > initial inodegc patchset. > > > > It wouldn't surprise me if the infrastructure is still necessary for the > throttling use case. In that case, I'm more curious about things like > whether it's still as effective as intended with such a small scheduling > delay, or whether it still might be worth simplifying in various ways > (i.e., does the scheduling delay actually make a difference? do we still > need a per cpu granular throttle? etc.). It can still be useful for certain g*dawful scenarios -- Let's say you have a horribly misconfigured cloudy system with a tiny log, hundreds of CPUs, a memory hogging process, another process with many hundreds of threads that are performing small appending synchronous writes to a large number of files, and some other process repeatedly opens and closes files. Background writeback completion will create enough workers to tie up the log such that writeback and inodegc contend for log grant space and make slow progress. If memory is also tight, we want to slow down the file scanning process so that it doesn't shove /more/ inodes into the cache and push the system towards OOM behavior. Back in the old days when inodegc was a radix tree tag it was fairly easy to get OOMs when the delay interval was long (5 seconds). The OOM probability went down pretty sharply as the interval approached zero, but even at 1 jiffy I could still occasionally trip it, whereas the pre-deferred-inactivation kernels would never OOM. I haven't tested it all that rigorously with Dave's fancy new per-cpu list design, but I did throw on my silly test setup (see below) and still got it to OOM once in 3 runs with the shrinker bits turned off. > > > > If > > > > so, it looks like decent enough overhead to cycle through every cpu in > > > > both callbacks that it might be worth spelling out more clearly in the > > > > top-level comment. > > > > > > I'm not sure what you are asking here - mod_delayed_work_on() has > > > pretty much the same overhead and behaviour as queue_work() in this > > > case, so... ? > > > > I'm just pointing out that the comment around the shrinker > infrastructure isn't very informative if the shrinker turns out to still > be necessary for reasons other than making the workers run sooner. <nod> That comment /does/ need to be updated to note the subtlety that a lot of shrinker activity can slow down close()ing a file by making user tasks wait for the inodegc workers to clear the backlog. > > <shrug> Looks ok to me, since djwong-dev has had some variant of timer > > delayed inactivation in it longer than it hasn't: > > > > Was that with a correspondingly small delay or something larger (on the > order of seconds or so)? Either way, it sounds like you have a > predictable enough workload that can actually test this continues to > work as expected..? Yeah. I snapshot /home (which has ~20 million inodes) then race fsstress and xfs_scrub -n in a VM with 560MB of RAM. --D > Brian > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > > --D > > > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work 2022-06-23 0:25 ` Darrick J. Wong @ 2022-06-23 11:49 ` Brian Foster 2022-06-23 19:56 ` Darrick J. Wong 0 siblings, 1 reply; 16+ messages in thread From: Brian Foster @ 2022-06-23 11:49 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs On Wed, Jun 22, 2022 at 05:25:36PM -0700, Darrick J. Wong wrote: > On Wed, Jun 22, 2022 at 12:13:54PM -0400, Brian Foster wrote: > > On Tue, Jun 21, 2022 at 10:20:46PM -0700, Darrick J. Wong wrote: > > > On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote: > > > > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote: > > > > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote: > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > > > > > Currently inodegc work can sit queued on the per-cpu queue until > > > > > > the workqueue is either flushed of the queue reaches a depth that > > > > > > triggers work queuing (and later throttling). This means that we > > > > > > could queue work that waits for a long time for some other event to > > > > > > trigger flushing. > > > > > > > > > > > > Hence instead of just queueing work at a specific depth, use a > > > > > > delayed work that queues the work at a bound time. We can still > > > > > > schedule the work immediately at a given depth, but we no long need > > > > > > to worry about leaving a number of items on the list that won't get > > > > > > processed until external events prevail. > > > > > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > > > --- > > > > > > fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- > > > > > > fs/xfs/xfs_mount.h | 2 +- > > > > > > fs/xfs/xfs_super.c | 2 +- > > > > > > 3 files changed, 24 insertions(+), 16 deletions(-) > > > > > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > > > index 374b3bafaeb0..46b30ecf498c 100644 > > > > > > --- a/fs/xfs/xfs_icache.c > > > > > > +++ b/fs/xfs/xfs_icache.c > > > > > ... > > > > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan( > > > > > > unsigned int h = READ_ONCE(gc->shrinker_hits); > > > > > > > > > > > > WRITE_ONCE(gc->shrinker_hits, h + 1); > > > > > > - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); > > > > > > + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); > > > > > > no_items = false; > > > > > > } > > > > > > > > > > This all seems reasonable to me, but is there much practical benefit to > > > > > shrinker infra/feedback just to expedite a delayed work item by one > > > > > jiffy? Maybe there's a use case to continue to trigger throttling..? > > > > > > > > I haven't really considered doing anything other than fixing the > > > > reported bug. That just requires an API conversion for the existing > > > > "queue immediately" semantics and is the safest minimum change > > > > to fix the issue at hand. > > > > > > > > So, yes, the shrinker code may (or may not) be superfluous now, but > > > > I haven't looked at it and done analysis of the behaviour without > > > > the shrinkers enabled. I'll do that in a completely separate > > > > patchset if it turns out that it is not needed now. > > > > > > I think the shrinker part is still necessary -- bulkstat and xfs_scrub > > > on a very low memory machine (~560M RAM) opening and closing tens of > > > millions of files can still OOM the machine if one doesn't have a means > > > to slow down ->destroy_inode (and hence the next open()) when reclaim > > > really starts to dig in. Without the shrinker bits, it's even easier to > > > trigger OOM storms when xfs has timer-delayed inactivation... which is > > > something that Brian pointed out a year ago when we were reviewing the > > > initial inodegc patchset. > > > > > > > It wouldn't surprise me if the infrastructure is still necessary for the > > throttling use case. In that case, I'm more curious about things like > > whether it's still as effective as intended with such a small scheduling > > delay, or whether it still might be worth simplifying in various ways > > (i.e., does the scheduling delay actually make a difference? do we still > > need a per cpu granular throttle? etc.). > > It can still be useful for certain g*dawful scenarios -- > > Let's say you have a horribly misconfigured cloudy system with a tiny > log, hundreds of CPUs, a memory hogging process, another process with > many hundreds of threads that are performing small appending synchronous > writes to a large number of files, and some other process repeatedly > opens and closes files. Background writeback completion will create > enough workers to tie up the log such that writeback and inodegc contend > for log grant space and make slow progress. If memory is also tight, > we want to slow down the file scanning process so that it doesn't shove > /more/ inodes into the cache and push the system towards OOM behavior. > Yeah, I get the general idea/purpose of the throttling. What I'm probing at here is whether a case like this is still handled effectively with such a short scheduling delay. Presumably there is some window before list size based throttling triggers for which the shrinker is expected to cover, so that implies the shrinker historically is able to detect and push populated queues and trigger throttling from the point it is invoked (whether directly via repeated shrinker invocations or indirectly via causing larger queue sizes is not clear to me). The thing that stands out to me as a question wrt to this change is that the trigger for shrinker induced throttling is the list size at the time of the callback(s), and that goes from having a lifecycle associated with the size-oriented scheduling algorithm to a time-based scheduling lifecycle of one jiffy (also noting that the inodegc worker resets shrinker_hits before it begins to process inodes). So with that in mind, how reliable is this lowmem signal based on the list size back to the tasks creating more work and memory pressure? Once a shrinker invocation occurs, what are the odds that the callback is able to detect a populated list and act accordingly? These questions are somewhat rhetorical because this all seems rather unpredictable when we consider varying resource availability. The relevant question for this patch is probably just that somebody has tested and confirmed that the shrinker hasn't been subtly or indirectly broken in cases like the one you describe above (where perhaps we might not have many shrinker callback opportunities to act on before OOM). > Back in the old days when inodegc was a radix tree tag it was fairly > easy to get OOMs when the delay interval was long (5 seconds). The > OOM probability went down pretty sharply as the interval approached > zero, but even at 1 jiffy I could still occasionally trip it, whereas > the pre-deferred-inactivation kernels would never OOM. > > I haven't tested it all that rigorously with Dave's fancy new per-cpu > list design, but I did throw on my silly test setup (see below) and > still got it to OOM once in 3 runs with the shrinker bits turned off. > Ok.. so that implies we still need throttling, but I'm not sure what "fancy percpu list design" refers to. If you have a good test case, I think the interesting immediate question is: are those OOMs avoided with this patch but the shrinker infrastructure still in place? If not, then I wonder if something is going wonky there. If so, I'm still a bit curious what the behavior looks like and whether it can be simplified in light of this change, but that's certainly beyond the scope of this patch. > > > > > If > > > > > so, it looks like decent enough overhead to cycle through every cpu in > > > > > both callbacks that it might be worth spelling out more clearly in the > > > > > top-level comment. > > > > > > > > I'm not sure what you are asking here - mod_delayed_work_on() has > > > > pretty much the same overhead and behaviour as queue_work() in this > > > > case, so... ? > > > > > > > I'm just pointing out that the comment around the shrinker > > infrastructure isn't very informative if the shrinker turns out to still > > be necessary for reasons other than making the workers run sooner. > > <nod> That comment /does/ need to be updated to note the subtlety that a > lot of shrinker activity can slow down close()ing a file by making user > tasks wait for the inodegc workers to clear the backlog. > > > > <shrug> Looks ok to me, since djwong-dev has had some variant of timer > > > delayed inactivation in it longer than it hasn't: > > > > > > > Was that with a correspondingly small delay or something larger (on the > > order of seconds or so)? Either way, it sounds like you have a > > predictable enough workload that can actually test this continues to > > work as expected..? > > Yeah. I snapshot /home (which has ~20 million inodes) then race > fsstress and xfs_scrub -n in a VM with 560MB of RAM. > Yeah small delay or yeah large delay? Brian > --D > > > Brian > > > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > > > > --D > > > > > > > Cheers, > > > > > > > > Dave. > > > > -- > > > > Dave Chinner > > > > david@fromorbit.com > > > > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work 2022-06-23 11:49 ` Brian Foster @ 2022-06-23 19:56 ` Darrick J. Wong 2022-06-24 12:39 ` Brian Foster 0 siblings, 1 reply; 16+ messages in thread From: Darrick J. Wong @ 2022-06-23 19:56 UTC (permalink / raw) To: Brian Foster; +Cc: Dave Chinner, linux-xfs On Thu, Jun 23, 2022 at 07:49:46AM -0400, Brian Foster wrote: > On Wed, Jun 22, 2022 at 05:25:36PM -0700, Darrick J. Wong wrote: > > On Wed, Jun 22, 2022 at 12:13:54PM -0400, Brian Foster wrote: > > > On Tue, Jun 21, 2022 at 10:20:46PM -0700, Darrick J. Wong wrote: > > > > On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote: > > > > > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote: > > > > > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote: > > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > > > > > > > Currently inodegc work can sit queued on the per-cpu queue until > > > > > > > the workqueue is either flushed of the queue reaches a depth that > > > > > > > triggers work queuing (and later throttling). This means that we > > > > > > > could queue work that waits for a long time for some other event to > > > > > > > trigger flushing. > > > > > > > > > > > > > > Hence instead of just queueing work at a specific depth, use a > > > > > > > delayed work that queues the work at a bound time. We can still > > > > > > > schedule the work immediately at a given depth, but we no long need > > > > > > > to worry about leaving a number of items on the list that won't get > > > > > > > processed until external events prevail. > > > > > > > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > > > > --- > > > > > > > fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- > > > > > > > fs/xfs/xfs_mount.h | 2 +- > > > > > > > fs/xfs/xfs_super.c | 2 +- > > > > > > > 3 files changed, 24 insertions(+), 16 deletions(-) > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > > > > index 374b3bafaeb0..46b30ecf498c 100644 > > > > > > > --- a/fs/xfs/xfs_icache.c > > > > > > > +++ b/fs/xfs/xfs_icache.c > > > > > > ... > > > > > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan( > > > > > > > unsigned int h = READ_ONCE(gc->shrinker_hits); > > > > > > > > > > > > > > WRITE_ONCE(gc->shrinker_hits, h + 1); > > > > > > > - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); > > > > > > > + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); > > > > > > > no_items = false; > > > > > > > } > > > > > > > > > > > > This all seems reasonable to me, but is there much practical benefit to > > > > > > shrinker infra/feedback just to expedite a delayed work item by one > > > > > > jiffy? Maybe there's a use case to continue to trigger throttling..? > > > > > > > > > > I haven't really considered doing anything other than fixing the > > > > > reported bug. That just requires an API conversion for the existing > > > > > "queue immediately" semantics and is the safest minimum change > > > > > to fix the issue at hand. > > > > > > > > > > So, yes, the shrinker code may (or may not) be superfluous now, but > > > > > I haven't looked at it and done analysis of the behaviour without > > > > > the shrinkers enabled. I'll do that in a completely separate > > > > > patchset if it turns out that it is not needed now. > > > > > > > > I think the shrinker part is still necessary -- bulkstat and xfs_scrub > > > > on a very low memory machine (~560M RAM) opening and closing tens of > > > > millions of files can still OOM the machine if one doesn't have a means > > > > to slow down ->destroy_inode (and hence the next open()) when reclaim > > > > really starts to dig in. Without the shrinker bits, it's even easier to > > > > trigger OOM storms when xfs has timer-delayed inactivation... which is > > > > something that Brian pointed out a year ago when we were reviewing the > > > > initial inodegc patchset. > > > > > > > > > > It wouldn't surprise me if the infrastructure is still necessary for the > > > throttling use case. In that case, I'm more curious about things like > > > whether it's still as effective as intended with such a small scheduling > > > delay, or whether it still might be worth simplifying in various ways > > > (i.e., does the scheduling delay actually make a difference? do we still > > > need a per cpu granular throttle? etc.). > > > > It can still be useful for certain g*dawful scenarios -- > > > > Let's say you have a horribly misconfigured cloudy system with a tiny > > log, hundreds of CPUs, a memory hogging process, another process with > > many hundreds of threads that are performing small appending synchronous > > writes to a large number of files, and some other process repeatedly > > opens and closes files. Background writeback completion will create > > enough workers to tie up the log such that writeback and inodegc contend > > for log grant space and make slow progress. If memory is also tight, > > we want to slow down the file scanning process so that it doesn't shove > > /more/ inodes into the cache and push the system towards OOM behavior. > > > > Yeah, I get the general idea/purpose of the throttling. What I'm probing > at here is whether a case like this is still handled effectively with > such a short scheduling delay. Given the trace_xfs_inodegc_shrinker_* data I've collected on my simulator, I think the 1 jiffy delay is handling this well enough that we only trip that tracepoint about a half dozen times in 2h of trying to exercising the inode caches. That said, I haven't really had time to try this on (say) a 20CPU machine with ridiculously small memory to see if I get different results. (Note: I'm still trying to figure out why generic/522 reports corruption, and hoping that willy's folios pull today just fixes it magically...) > Presumably there is some window before > list size based throttling triggers for which the shrinker is expected > to cover, so that implies the shrinker historically is able to detect > and push populated queues and trigger throttling from the point it is > invoked (whether directly via repeated shrinker invocations or > indirectly via causing larger queue sizes is not clear to me). Back when I merged the inodegc series, it was fairly easy to get the shrinker to trip dozens of times during a stress run, even when the only throttling criteria was the queue size. I think the reflink/rmap transaction reservation optimizations have made the log grant bottlenecks much less severe, since I now see a lot less contention on the grant heads. > The thing that stands out to me as a question wrt to this change is that > the trigger for shrinker induced throttling is the list size at the time > of the callback(s), and that goes from having a lifecycle associated > with the size-oriented scheduling algorithm to a time-based scheduling > lifecycle of one jiffy (also noting that the inodegc worker resets > shrinker_hits before it begins to process inodes). So with that in mind, > how reliable is this lowmem signal based on the list size back to the > tasks creating more work and memory pressure? Once a shrinker invocation > occurs, what are the odds that the callback is able to detect a > populated list and act accordingly? Hrmmm. Decent, I think? If the list is populated but the inactivation worker is not running, then we queue the worker and clear the list; if the worker is already running when the shrinker gets called, we might end up requeuing it unnecessarily, in which case it'll just clear more. > These questions are somewhat rhetorical because this all seems rather > unpredictable when we consider varying resource availability. Not to mention the shrinker itself undergoing various behavioral changes over the years... :( > The > relevant question for this patch is probably just that somebody has > tested and confirmed that the shrinker hasn't been subtly or indirectly > broken in cases like the one you describe above (where perhaps we might > not have many shrinker callback opportunities to act on before OOM). FWIW, the only time I /ever/ saw OOMs (either now with 5.19 or ages ago with 5.9) was with extreme lowmem testing. On VMs with more than a gigabyte or so of memory, I notice that we usually hit the time/length thresholds without shrinkers getting involved. Granted, I did try to pick the shrinker values so that we only get called on the *second* level of shrinking, which is after we've freed some but not enough memory. If my responses seem a bit like handwaving, they are, because (at the moment) this is well off in the weeds. > > Back in the old days when inodegc was a radix tree tag it was fairly > > easy to get OOMs when the delay interval was long (5 seconds). The > > OOM probability went down pretty sharply as the interval approached > > zero, but even at 1 jiffy I could still occasionally trip it, whereas > > the pre-deferred-inactivation kernels would never OOM. > > > > I haven't tested it all that rigorously with Dave's fancy new per-cpu > > list design, but I did throw on my silly test setup (see below) and > > still got it to OOM once in 3 runs with the shrinker bits turned off. > > > > Ok.. so that implies we still need throttling, but I'm not sure what > "fancy percpu list design" refers to. If you have a good test case, I You might recall that the original versions of deferred inode inactivation would set radix tree tags, and a per-AG workqueue function would scan that AG's radix tree for tagged inodes and inactivate them. Dave observed that letting the cpu caches grow cold led to performance problems, and replaced the tagging mechanism with per-cpu lists, which is what we have now. > think the interesting immediate question is: are those OOMs avoided with > this patch but the shrinker infrastructure still in place? That's the $64000 question that I don't know definitively yet. > If not, then I wonder if something is going wonky there. If so, I'm > still a bit curious what the behavior looks like and whether it can be > simplified in light of this change, but that's certainly beyond the > scope of this patch. <nod> For now I'd like to get this going for 5.19 to fix the original complaint about statfs blocking in D state while waiting for inodegc after deleting a ~100m file extents[1], and defer the decision of whether or not we /really/ need the shrinker to a later time (like 5.20). [1] https://lore.kernel.org/linux-xfs/20220509024659.GA62606@onthe.net.au/ > > > > > > > If > > > > > > so, it looks like decent enough overhead to cycle through every cpu in > > > > > > both callbacks that it might be worth spelling out more clearly in the > > > > > > top-level comment. > > > > > > > > > > I'm not sure what you are asking here - mod_delayed_work_on() has > > > > > pretty much the same overhead and behaviour as queue_work() in this > > > > > case, so... ? > > > > > > > > > > I'm just pointing out that the comment around the shrinker > > > infrastructure isn't very informative if the shrinker turns out to still > > > be necessary for reasons other than making the workers run sooner. > > > > <nod> That comment /does/ need to be updated to note the subtlety that a > > lot of shrinker activity can slow down close()ing a file by making user > > tasks wait for the inodegc workers to clear the backlog. > > > > > > <shrug> Looks ok to me, since djwong-dev has had some variant of timer > > > > delayed inactivation in it longer than it hasn't: > > > > > > > > > > Was that with a correspondingly small delay or something larger (on the > > > order of seconds or so)? Either way, it sounds like you have a > > > predictable enough workload that can actually test this continues to > > > work as expected..? > > > > Yeah. I snapshot /home (which has ~20 million inodes) then race > > fsstress and xfs_scrub -n in a VM with 560MB of RAM. > > > > Yeah small delay or yeah large delay? Both -- with large inactivation delays and no shrinker, OOMs happen fairly frequently; with a short delay and no shrinker, they're harder (but still not impossible) to trigger. Granted ... even the overall frequency of OOMs with large inactivation delays seems to have gone down a bit from when I was more actively testing in the 5.9 era. --D > Brian > > > --D > > > > > Brian > > > > > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > > > > > > --D > > > > > > > > > Cheers, > > > > > > > > > > Dave. > > > > > -- > > > > > Dave Chinner > > > > > david@fromorbit.com > > > > > > > > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work 2022-06-23 19:56 ` Darrick J. Wong @ 2022-06-24 12:39 ` Brian Foster 2022-06-25 1:03 ` Darrick J. Wong 0 siblings, 1 reply; 16+ messages in thread From: Brian Foster @ 2022-06-24 12:39 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs On Thu, Jun 23, 2022 at 12:56:40PM -0700, Darrick J. Wong wrote: > On Thu, Jun 23, 2022 at 07:49:46AM -0400, Brian Foster wrote: > > On Wed, Jun 22, 2022 at 05:25:36PM -0700, Darrick J. Wong wrote: > > > On Wed, Jun 22, 2022 at 12:13:54PM -0400, Brian Foster wrote: > > > > On Tue, Jun 21, 2022 at 10:20:46PM -0700, Darrick J. Wong wrote: > > > > > On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote: > > > > > > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote: > > > > > > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote: > > > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > > > > > > > > > Currently inodegc work can sit queued on the per-cpu queue until > > > > > > > > the workqueue is either flushed of the queue reaches a depth that > > > > > > > > triggers work queuing (and later throttling). This means that we > > > > > > > > could queue work that waits for a long time for some other event to > > > > > > > > trigger flushing. > > > > > > > > > > > > > > > > Hence instead of just queueing work at a specific depth, use a > > > > > > > > delayed work that queues the work at a bound time. We can still > > > > > > > > schedule the work immediately at a given depth, but we no long need > > > > > > > > to worry about leaving a number of items on the list that won't get > > > > > > > > processed until external events prevail. > > > > > > > > > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > > > > > --- > > > > > > > > fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- > > > > > > > > fs/xfs/xfs_mount.h | 2 +- > > > > > > > > fs/xfs/xfs_super.c | 2 +- > > > > > > > > 3 files changed, 24 insertions(+), 16 deletions(-) > > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > > > > > index 374b3bafaeb0..46b30ecf498c 100644 > > > > > > > > --- a/fs/xfs/xfs_icache.c > > > > > > > > +++ b/fs/xfs/xfs_icache.c > > > > > > > ... > > > > > > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan( > > > > > > > > unsigned int h = READ_ONCE(gc->shrinker_hits); > > > > > > > > > > > > > > > > WRITE_ONCE(gc->shrinker_hits, h + 1); > > > > > > > > - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); > > > > > > > > + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); > > > > > > > > no_items = false; > > > > > > > > } > > > > > > > > > > > > > > This all seems reasonable to me, but is there much practical benefit to > > > > > > > shrinker infra/feedback just to expedite a delayed work item by one > > > > > > > jiffy? Maybe there's a use case to continue to trigger throttling..? > > > > > > > > > > > > I haven't really considered doing anything other than fixing the > > > > > > reported bug. That just requires an API conversion for the existing > > > > > > "queue immediately" semantics and is the safest minimum change > > > > > > to fix the issue at hand. > > > > > > > > > > > > So, yes, the shrinker code may (or may not) be superfluous now, but > > > > > > I haven't looked at it and done analysis of the behaviour without > > > > > > the shrinkers enabled. I'll do that in a completely separate > > > > > > patchset if it turns out that it is not needed now. > > > > > > > > > > I think the shrinker part is still necessary -- bulkstat and xfs_scrub > > > > > on a very low memory machine (~560M RAM) opening and closing tens of > > > > > millions of files can still OOM the machine if one doesn't have a means > > > > > to slow down ->destroy_inode (and hence the next open()) when reclaim > > > > > really starts to dig in. Without the shrinker bits, it's even easier to > > > > > trigger OOM storms when xfs has timer-delayed inactivation... which is > > > > > something that Brian pointed out a year ago when we were reviewing the > > > > > initial inodegc patchset. > > > > > > > > > > > > > It wouldn't surprise me if the infrastructure is still necessary for the > > > > throttling use case. In that case, I'm more curious about things like > > > > whether it's still as effective as intended with such a small scheduling > > > > delay, or whether it still might be worth simplifying in various ways > > > > (i.e., does the scheduling delay actually make a difference? do we still > > > > need a per cpu granular throttle? etc.). > > > > > > It can still be useful for certain g*dawful scenarios -- > > > > > > Let's say you have a horribly misconfigured cloudy system with a tiny > > > log, hundreds of CPUs, a memory hogging process, another process with > > > many hundreds of threads that are performing small appending synchronous > > > writes to a large number of files, and some other process repeatedly > > > opens and closes files. Background writeback completion will create > > > enough workers to tie up the log such that writeback and inodegc contend > > > for log grant space and make slow progress. If memory is also tight, > > > we want to slow down the file scanning process so that it doesn't shove > > > /more/ inodes into the cache and push the system towards OOM behavior. > > > > > > > Yeah, I get the general idea/purpose of the throttling. What I'm probing > > at here is whether a case like this is still handled effectively with > > such a short scheduling delay. > > Given the trace_xfs_inodegc_shrinker_* data I've collected on my > simulator, I think the 1 jiffy delay is handling this well enough that > we only trip that tracepoint about a half dozen times in 2h of trying to > exercising the inode caches. That said, I haven't really had time to > try this on (say) a 20CPU machine with ridiculously small memory to see > if I get different results. > > (Note: I'm still trying to figure out why generic/522 reports > corruption, and hoping that willy's folios pull today just fixes it > magically...) > > > Presumably there is some window before > > list size based throttling triggers for which the shrinker is expected > > to cover, so that implies the shrinker historically is able to detect > > and push populated queues and trigger throttling from the point it is > > invoked (whether directly via repeated shrinker invocations or > > indirectly via causing larger queue sizes is not clear to me). > > Back when I merged the inodegc series, it was fairly easy to get the > shrinker to trip dozens of times during a stress run, even when the only > throttling criteria was the queue size. I think the reflink/rmap > transaction reservation optimizations have made the log grant > bottlenecks much less severe, since I now see a lot less contention on > the grant heads. > > > The thing that stands out to me as a question wrt to this change is that > > the trigger for shrinker induced throttling is the list size at the time > > of the callback(s), and that goes from having a lifecycle associated > > with the size-oriented scheduling algorithm to a time-based scheduling > > lifecycle of one jiffy (also noting that the inodegc worker resets > > shrinker_hits before it begins to process inodes). So with that in mind, > > how reliable is this lowmem signal based on the list size back to the > > tasks creating more work and memory pressure? Once a shrinker invocation > > occurs, what are the odds that the callback is able to detect a > > populated list and act accordingly? > > Hrmmm. Decent, I think? If the list is populated but the inactivation > worker is not running, then we queue the worker and clear the list; if > the worker is already running when the shrinker gets called, we might > end up requeuing it unnecessarily, in which case it'll just clear more. > Sure, that sounds harmless, but I was thinking more about the likelihood of missing throttles than spurious work cycles and such.. Generally with this change, the worker is pretty much always scheduled once an inode is queued. It executes once a tick passes where another destroy hasn't otherwise come along to kick it out by another tick. So if I run a sustained removal, I see similar batching/scheduling as since before this change. If I run something with a mix of removes and $something_else, then it doesn't seem to take much to skip batching and run a worker cycle per inode. So what happens if the shrinker scan sees one or more workers that happen to execute before the scan callback is able to actually trigger throttling? Is that critical, or is this enough of a fallback mechanism that it's unlikely to be important? Similarly if one is running a workload that results in behavior as described above, is this sort of "oneshot shrinker to worker throttle" actually still going to slow down the worker tasks? Is shrinker callback frequency high enough to actually make an impact? Etc. These are questions I don't really have enough context to answer atm, nor can say whether or not may result in problems... > > These questions are somewhat rhetorical because this all seems rather > > unpredictable when we consider varying resource availability. > > Not to mention the shrinker itself undergoing various behavioral changes > over the years... :( > > > The > > relevant question for this patch is probably just that somebody has > > tested and confirmed that the shrinker hasn't been subtly or indirectly > > broken in cases like the one you describe above (where perhaps we might > > not have many shrinker callback opportunities to act on before OOM). > > FWIW, the only time I /ever/ saw OOMs (either now with 5.19 or ages ago > with 5.9) was with extreme lowmem testing. On VMs with more than a > gigabyte or so of memory, I notice that we usually hit the time/length > thresholds without shrinkers getting involved. Granted, I did try to > pick the shrinker values so that we only get called on the *second* > level of shrinking, which is after we've freed some but not enough > memory. > > If my responses seem a bit like handwaving, they are, because (at the > moment) this is well off in the weeds. > > > > Back in the old days when inodegc was a radix tree tag it was fairly > > > easy to get OOMs when the delay interval was long (5 seconds). The > > > OOM probability went down pretty sharply as the interval approached > > > zero, but even at 1 jiffy I could still occasionally trip it, whereas > > > the pre-deferred-inactivation kernels would never OOM. > > > > > > I haven't tested it all that rigorously with Dave's fancy new per-cpu > > > list design, but I did throw on my silly test setup (see below) and > > > still got it to OOM once in 3 runs with the shrinker bits turned off. > > > > > > > Ok.. so that implies we still need throttling, but I'm not sure what > > "fancy percpu list design" refers to. If you have a good test case, I > > You might recall that the original versions of deferred inode > inactivation would set radix tree tags, and a per-AG workqueue function > would scan that AG's radix tree for tagged inodes and inactivate them. > Dave observed that letting the cpu caches grow cold led to performance > problems, and replaced the tagging mechanism with per-cpu lists, which > is what we have now. > > > think the interesting immediate question is: are those OOMs avoided with > > this patch but the shrinker infrastructure still in place? > > That's the $64000 question that I don't know definitively yet. > Ok. If the answer to the above questions is "don't really know, it might be broken, but historical testing shows the shrinker is much less important with smaller queuing delays such that it's unlikely to cause any new problems on anything but the most pathological workloads," then that sounds reasonable to me. I don't mean to harp on it. I just want to make sure the question/concern gets across clearly enough that somebody who knows this system better than I can grok it and make a more intelligent assessment. Brian > > If not, then I wonder if something is going wonky there. If so, I'm > > still a bit curious what the behavior looks like and whether it can be > > simplified in light of this change, but that's certainly beyond the > > scope of this patch. > > <nod> For now I'd like to get this going for 5.19 to fix the original > complaint about statfs blocking in D state while waiting for inodegc > after deleting a ~100m file extents[1], and defer the decision of > whether or not we /really/ need the shrinker to a later time (like > 5.20). > > [1] https://lore.kernel.org/linux-xfs/20220509024659.GA62606@onthe.net.au/ > > > > > > > > > > If > > > > > > > so, it looks like decent enough overhead to cycle through every cpu in > > > > > > > both callbacks that it might be worth spelling out more clearly in the > > > > > > > top-level comment. > > > > > > > > > > > > I'm not sure what you are asking here - mod_delayed_work_on() has > > > > > > pretty much the same overhead and behaviour as queue_work() in this > > > > > > case, so... ? > > > > > > > > > > > > > I'm just pointing out that the comment around the shrinker > > > > infrastructure isn't very informative if the shrinker turns out to still > > > > be necessary for reasons other than making the workers run sooner. > > > > > > <nod> That comment /does/ need to be updated to note the subtlety that a > > > lot of shrinker activity can slow down close()ing a file by making user > > > tasks wait for the inodegc workers to clear the backlog. > > > > > > > > <shrug> Looks ok to me, since djwong-dev has had some variant of timer > > > > > delayed inactivation in it longer than it hasn't: > > > > > > > > > > > > > Was that with a correspondingly small delay or something larger (on the > > > > order of seconds or so)? Either way, it sounds like you have a > > > > predictable enough workload that can actually test this continues to > > > > work as expected..? > > > > > > Yeah. I snapshot /home (which has ~20 million inodes) then race > > > fsstress and xfs_scrub -n in a VM with 560MB of RAM. > > > > > > > Yeah small delay or yeah large delay? > > Both -- with large inactivation delays and no shrinker, OOMs happen > fairly frequently; with a short delay and no shrinker, they're harder > (but still not impossible) to trigger. > > Granted ... even the overall frequency of OOMs with large inactivation > delays seems to have gone down a bit from when I was more actively > testing in the 5.9 era. > > --D > > > Brian > > > > > --D > > > > > > > Brian > > > > > > > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > > > > > > > > --D > > > > > > > > > > > Cheers, > > > > > > > > > > > > Dave. > > > > > > -- > > > > > > Dave Chinner > > > > > > david@fromorbit.com > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work 2022-06-24 12:39 ` Brian Foster @ 2022-06-25 1:03 ` Darrick J. Wong 0 siblings, 0 replies; 16+ messages in thread From: Darrick J. Wong @ 2022-06-25 1:03 UTC (permalink / raw) To: Brian Foster; +Cc: Dave Chinner, linux-xfs On Fri, Jun 24, 2022 at 08:39:32AM -0400, Brian Foster wrote: > On Thu, Jun 23, 2022 at 12:56:40PM -0700, Darrick J. Wong wrote: > > On Thu, Jun 23, 2022 at 07:49:46AM -0400, Brian Foster wrote: > > > On Wed, Jun 22, 2022 at 05:25:36PM -0700, Darrick J. Wong wrote: > > > > On Wed, Jun 22, 2022 at 12:13:54PM -0400, Brian Foster wrote: > > > > > On Tue, Jun 21, 2022 at 10:20:46PM -0700, Darrick J. Wong wrote: > > > > > > On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote: > > > > > > > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote: > > > > > > > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote: > > > > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > > > > > > > > > > > Currently inodegc work can sit queued on the per-cpu queue until > > > > > > > > > the workqueue is either flushed of the queue reaches a depth that > > > > > > > > > triggers work queuing (and later throttling). This means that we > > > > > > > > > could queue work that waits for a long time for some other event to > > > > > > > > > trigger flushing. > > > > > > > > > > > > > > > > > > Hence instead of just queueing work at a specific depth, use a > > > > > > > > > delayed work that queues the work at a bound time. We can still > > > > > > > > > schedule the work immediately at a given depth, but we no long need > > > > > > > > > to worry about leaving a number of items on the list that won't get > > > > > > > > > processed until external events prevail. > > > > > > > > > > > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > > > > > > --- > > > > > > > > > fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- > > > > > > > > > fs/xfs/xfs_mount.h | 2 +- > > > > > > > > > fs/xfs/xfs_super.c | 2 +- > > > > > > > > > 3 files changed, 24 insertions(+), 16 deletions(-) > > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > > > > > > index 374b3bafaeb0..46b30ecf498c 100644 > > > > > > > > > --- a/fs/xfs/xfs_icache.c > > > > > > > > > +++ b/fs/xfs/xfs_icache.c > > > > > > > > ... > > > > > > > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan( > > > > > > > > > unsigned int h = READ_ONCE(gc->shrinker_hits); > > > > > > > > > > > > > > > > > > WRITE_ONCE(gc->shrinker_hits, h + 1); > > > > > > > > > - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); > > > > > > > > > + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); > > > > > > > > > no_items = false; > > > > > > > > > } > > > > > > > > > > > > > > > > This all seems reasonable to me, but is there much practical benefit to > > > > > > > > shrinker infra/feedback just to expedite a delayed work item by one > > > > > > > > jiffy? Maybe there's a use case to continue to trigger throttling..? > > > > > > > > > > > > > > I haven't really considered doing anything other than fixing the > > > > > > > reported bug. That just requires an API conversion for the existing > > > > > > > "queue immediately" semantics and is the safest minimum change > > > > > > > to fix the issue at hand. > > > > > > > > > > > > > > So, yes, the shrinker code may (or may not) be superfluous now, but > > > > > > > I haven't looked at it and done analysis of the behaviour without > > > > > > > the shrinkers enabled. I'll do that in a completely separate > > > > > > > patchset if it turns out that it is not needed now. > > > > > > > > > > > > I think the shrinker part is still necessary -- bulkstat and xfs_scrub > > > > > > on a very low memory machine (~560M RAM) opening and closing tens of > > > > > > millions of files can still OOM the machine if one doesn't have a means > > > > > > to slow down ->destroy_inode (and hence the next open()) when reclaim > > > > > > really starts to dig in. Without the shrinker bits, it's even easier to > > > > > > trigger OOM storms when xfs has timer-delayed inactivation... which is > > > > > > something that Brian pointed out a year ago when we were reviewing the > > > > > > initial inodegc patchset. > > > > > > > > > > > > > > > > It wouldn't surprise me if the infrastructure is still necessary for the > > > > > throttling use case. In that case, I'm more curious about things like > > > > > whether it's still as effective as intended with such a small scheduling > > > > > delay, or whether it still might be worth simplifying in various ways > > > > > (i.e., does the scheduling delay actually make a difference? do we still > > > > > need a per cpu granular throttle? etc.). > > > > > > > > It can still be useful for certain g*dawful scenarios -- > > > > > > > > Let's say you have a horribly misconfigured cloudy system with a tiny > > > > log, hundreds of CPUs, a memory hogging process, another process with > > > > many hundreds of threads that are performing small appending synchronous > > > > writes to a large number of files, and some other process repeatedly > > > > opens and closes files. Background writeback completion will create > > > > enough workers to tie up the log such that writeback and inodegc contend > > > > for log grant space and make slow progress. If memory is also tight, > > > > we want to slow down the file scanning process so that it doesn't shove > > > > /more/ inodes into the cache and push the system towards OOM behavior. > > > > > > > > > > Yeah, I get the general idea/purpose of the throttling. What I'm probing > > > at here is whether a case like this is still handled effectively with > > > such a short scheduling delay. > > > > Given the trace_xfs_inodegc_shrinker_* data I've collected on my > > simulator, I think the 1 jiffy delay is handling this well enough that > > we only trip that tracepoint about a half dozen times in 2h of trying to > > exercising the inode caches. That said, I haven't really had time to > > try this on (say) a 20CPU machine with ridiculously small memory to see > > if I get different results. > > > > (Note: I'm still trying to figure out why generic/522 reports > > corruption, and hoping that willy's folios pull today just fixes it > > magically...) > > > > > Presumably there is some window before > > > list size based throttling triggers for which the shrinker is expected > > > to cover, so that implies the shrinker historically is able to detect > > > and push populated queues and trigger throttling from the point it is > > > invoked (whether directly via repeated shrinker invocations or > > > indirectly via causing larger queue sizes is not clear to me). > > > > Back when I merged the inodegc series, it was fairly easy to get the > > shrinker to trip dozens of times during a stress run, even when the only > > throttling criteria was the queue size. I think the reflink/rmap > > transaction reservation optimizations have made the log grant > > bottlenecks much less severe, since I now see a lot less contention on > > the grant heads. > > > > > The thing that stands out to me as a question wrt to this change is that > > > the trigger for shrinker induced throttling is the list size at the time > > > of the callback(s), and that goes from having a lifecycle associated > > > with the size-oriented scheduling algorithm to a time-based scheduling > > > lifecycle of one jiffy (also noting that the inodegc worker resets > > > shrinker_hits before it begins to process inodes). So with that in mind, > > > how reliable is this lowmem signal based on the list size back to the > > > tasks creating more work and memory pressure? Once a shrinker invocation > > > occurs, what are the odds that the callback is able to detect a > > > populated list and act accordingly? > > > > Hrmmm. Decent, I think? If the list is populated but the inactivation > > worker is not running, then we queue the worker and clear the list; if > > the worker is already running when the shrinker gets called, we might > > end up requeuing it unnecessarily, in which case it'll just clear more. > > > > Sure, that sounds harmless, but I was thinking more about the likelihood > of missing throttles than spurious work cycles and such.. > > Generally with this change, the worker is pretty much always scheduled > once an inode is queued. It executes once a tick passes where another > destroy hasn't otherwise come along to kick it out by another tick. So > if I run a sustained removal, I see similar batching/scheduling as since > before this change. If I run something with a mix of removes and > $something_else, then it doesn't seem to take much to skip batching and > run a worker cycle per inode. > > So what happens if the shrinker scan sees one or more workers that Er... by "shrinker scan", do you mean xfs_inodegc_shrinker_count? And by "scan callback", do you mean xfs_inodegc_shrinker_scan? > happen to execute before the scan callback is able to actually trigger > throttling? Is that critical, or is this enough of a fallback mechanism > that it's unlikely to be important? I /think/ it's possible to miss throttling opportunities, but I also suspect it's relatively unimportant because of the low maximum queue size. Let's say _count runs, discovers non-empty lists, and triggers the callback. If the workers run after _count returns but before _scan gets a chance to run, then they'll have processed the queued inodes and _scan won't set shrinker_hits anywhere. Yes, a _destroy_inode won't get throttled, but the system is at least making some forward progress clearing the backlog and (hopefully) freeing memory. If there are a /lot/ of destroy_inode calls, they'll eventually hit the queue size limit (32) and throttle on the workers. If there aren't any further destroy_inode calls, there's nothing to throttle. If however there's some destroy_inode calls but not enough to hit the queue limit, then I suppose we're reliant on the shrinker to get called again. This is of course trivially satisfied if there are other processes that also enter the shrinker code and trigger more scans. However, I suppose there /is/ a small hole here -- if there was only that one process in the shrinker and _scan doesn't find anything, the shrinker will not call us back. In that case, the only way we trigger throttling is if there's enough destroy_inode activity to hit a queue limit. Dave and I have talked about closing this hole by queuing the worker immediately if the inode being destroyed has more than a certain number of extents or something that acts as a proxy for memory consumption, but so far as we can tell, shrinker invocations are rare and missed throttling hasn't shown up in a noticeable way. > Similarly if one is running a > workload that results in behavior as described above, is this sort of > "oneshot shrinker to worker throttle" actually still going to slow down > the worker tasks? Is shrinker callback frequency high enough to actually > make an impact? Etc. It seems to, but with a smallish sample size on a normal system, which means I've only seen the shrinker throttle show up in the trace output if I use the insane lowmem scenario or artificially crank up the delay to 30 seconds. The good news is that in those scenarios, it /does/ slow things down enough that we mostly don't OOM. (Though at this point it's really hard to go back for an apples-apples comparison with the pre-inodegc code.) > These are questions I don't really have enough context to answer atm, > nor can say whether or not may result in problems... > > > > These questions are somewhat rhetorical because this all seems rather > > > unpredictable when we consider varying resource availability. > > > > Not to mention the shrinker itself undergoing various behavioral changes > > over the years... :( > > > > > The > > > relevant question for this patch is probably just that somebody has > > > tested and confirmed that the shrinker hasn't been subtly or indirectly > > > broken in cases like the one you describe above (where perhaps we might > > > not have many shrinker callback opportunities to act on before OOM). > > > > FWIW, the only time I /ever/ saw OOMs (either now with 5.19 or ages ago > > with 5.9) was with extreme lowmem testing. On VMs with more than a > > gigabyte or so of memory, I notice that we usually hit the time/length > > thresholds without shrinkers getting involved. Granted, I did try to > > pick the shrinker values so that we only get called on the *second* > > level of shrinking, which is after we've freed some but not enough > > memory. > > > > If my responses seem a bit like handwaving, they are, because (at the > > moment) this is well off in the weeds. > > > > > > Back in the old days when inodegc was a radix tree tag it was fairly > > > > easy to get OOMs when the delay interval was long (5 seconds). The > > > > OOM probability went down pretty sharply as the interval approached > > > > zero, but even at 1 jiffy I could still occasionally trip it, whereas > > > > the pre-deferred-inactivation kernels would never OOM. > > > > > > > > I haven't tested it all that rigorously with Dave's fancy new per-cpu > > > > list design, but I did throw on my silly test setup (see below) and > > > > still got it to OOM once in 3 runs with the shrinker bits turned off. > > > > > > > > > > Ok.. so that implies we still need throttling, but I'm not sure what > > > "fancy percpu list design" refers to. If you have a good test case, I > > > > You might recall that the original versions of deferred inode > > inactivation would set radix tree tags, and a per-AG workqueue function > > would scan that AG's radix tree for tagged inodes and inactivate them. > > Dave observed that letting the cpu caches grow cold led to performance > > problems, and replaced the tagging mechanism with per-cpu lists, which > > is what we have now. > > > > > think the interesting immediate question is: are those OOMs avoided with > > > this patch but the shrinker infrastructure still in place? > > > > That's the $64000 question that I don't know definitively yet. > > > > Ok. If the answer to the above questions is "don't really know, it might > be broken, but historical testing shows the shrinker is much less > important with smaller queuing delays such that it's unlikely to cause > any new problems on anything but the most pathological workloads," then > that sounds reasonable to me. Yep, pretty much. :/ > I don't mean to harp on it. I just want to make sure the > question/concern gets across clearly enough that somebody who knows this > system better than I can grok it and make a more intelligent assessment. Oh, I'm glad you're harping on it -- anything involving shrinkers and deferred work ought to get a closer look. --D > Brian > > > > If not, then I wonder if something is going wonky there. If so, I'm > > > still a bit curious what the behavior looks like and whether it can be > > > simplified in light of this change, but that's certainly beyond the > > > scope of this patch. > > > > <nod> For now I'd like to get this going for 5.19 to fix the original > > complaint about statfs blocking in D state while waiting for inodegc > > after deleting a ~100m file extents[1], and defer the decision of > > whether or not we /really/ need the shrinker to a later time (like > > 5.20). > > > > [1] https://lore.kernel.org/linux-xfs/20220509024659.GA62606@onthe.net.au/ > > > > > > > > > > > > > If > > > > > > > > so, it looks like decent enough overhead to cycle through every cpu in > > > > > > > > both callbacks that it might be worth spelling out more clearly in the > > > > > > > > top-level comment. > > > > > > > > > > > > > > I'm not sure what you are asking here - mod_delayed_work_on() has > > > > > > > pretty much the same overhead and behaviour as queue_work() in this > > > > > > > case, so... ? > > > > > > > > > > > > > > > > I'm just pointing out that the comment around the shrinker > > > > > infrastructure isn't very informative if the shrinker turns out to still > > > > > be necessary for reasons other than making the workers run sooner. > > > > > > > > <nod> That comment /does/ need to be updated to note the subtlety that a > > > > lot of shrinker activity can slow down close()ing a file by making user > > > > tasks wait for the inodegc workers to clear the backlog. > > > > > > > > > > <shrug> Looks ok to me, since djwong-dev has had some variant of timer > > > > > > delayed inactivation in it longer than it hasn't: > > > > > > > > > > > > > > > > Was that with a correspondingly small delay or something larger (on the > > > > > order of seconds or so)? Either way, it sounds like you have a > > > > > predictable enough workload that can actually test this continues to > > > > > work as expected..? > > > > > > > > Yeah. I snapshot /home (which has ~20 million inodes) then race > > > > fsstress and xfs_scrub -n in a VM with 560MB of RAM. > > > > > > > > > > Yeah small delay or yeah large delay? > > > > Both -- with large inactivation delays and no shrinker, OOMs happen > > fairly frequently; with a short delay and no shrinker, they're harder > > (but still not impossible) to trigger. > > > > Granted ... even the overall frequency of OOMs with large inactivation > > delays seems to have gone down a bit from when I was more actively > > testing in the 5.9 era. > > > > --D > > > > > Brian > > > > > > > --D > > > > > > > > > Brian > > > > > > > > > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > > > > > > > > > > --D > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > Dave. > > > > > > > -- > > > > > > > Dave Chinner > > > > > > > david@fromorbit.com > > > > > > > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 2/2] xfs: introduce xfs_inodegc_push() 2022-06-15 22:04 [PATCH 0/2 V2] xfs: xfs: non-blocking inodegc pushes Dave Chinner 2022-06-15 22:04 ` [PATCH 1/2] xfs: bound maximum wait time for inodegc work Dave Chinner @ 2022-06-15 22:04 ` Dave Chinner 2022-06-22 5:21 ` Darrick J. Wong 1 sibling, 1 reply; 16+ messages in thread From: Dave Chinner @ 2022-06-15 22:04 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> The current blocking mechanism for pushing the inodegc queue out to disk can result in systems becoming unusable when there is a long running inodegc operation. This is because the statfs() implementation currently issues a blocking flush of the inodegc queue and a significant number of common system utilities will call statfs() to discover something about the underlying filesystem. This can result in userspace operations getting stuck on inodegc progress, and when trying to remove a heavily reflinked file on slow storage with a full journal, this can result in delays measuring in hours. Avoid this problem by adding "push" function that expedites the flushing of the inodegc queue, but doesn't wait for it to complete. Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this mechanism so they don't block but still ensure that queued operations are expedited. Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues") Reported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_icache.c | 20 +++++++++++++++----- fs/xfs/xfs_icache.h | 1 + fs/xfs/xfs_qm_syscalls.c | 7 +++++-- fs/xfs/xfs_super.c | 7 +++++-- fs/xfs/xfs_trace.h | 1 + 5 files changed, 27 insertions(+), 9 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 46b30ecf498c..aef4097ffd3e 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1865,19 +1865,29 @@ xfs_inodegc_worker( } /* - * Force all currently queued inode inactivation work to run immediately and - * wait for the work to finish. + * Expedite all pending inodegc work to run immediately. This does not wait for + * completion of the work. */ void -xfs_inodegc_flush( +xfs_inodegc_push( struct xfs_mount *mp) { if (!xfs_is_inodegc_enabled(mp)) return; + trace_xfs_inodegc_push(mp, __return_address); + xfs_inodegc_queue_all(mp); +} +/* + * Force all currently queued inode inactivation work to run immediately and + * wait for the work to finish. + */ +void +xfs_inodegc_flush( + struct xfs_mount *mp) +{ + xfs_inodegc_push(mp); trace_xfs_inodegc_flush(mp, __return_address); - - xfs_inodegc_queue_all(mp); flush_workqueue(mp->m_inodegc_wq); } diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h index 2e4cfddf8b8e..6cd180721659 100644 --- a/fs/xfs/xfs_icache.h +++ b/fs/xfs/xfs_icache.h @@ -76,6 +76,7 @@ void xfs_blockgc_stop(struct xfs_mount *mp); void xfs_blockgc_start(struct xfs_mount *mp); void xfs_inodegc_worker(struct work_struct *work); +void xfs_inodegc_push(struct xfs_mount *mp); void xfs_inodegc_flush(struct xfs_mount *mp); void xfs_inodegc_stop(struct xfs_mount *mp); void xfs_inodegc_start(struct xfs_mount *mp); diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c index 74ac9ca9e119..a30f4d067746 100644 --- a/fs/xfs/xfs_qm_syscalls.c +++ b/fs/xfs/xfs_qm_syscalls.c @@ -454,9 +454,12 @@ xfs_qm_scall_getquota( struct xfs_dquot *dqp; int error; - /* Flush inodegc work at the start of a quota reporting scan. */ + /* + * Expedite pending inodegc work at the start of a quota reporting + * scan but don't block waiting for it to complete. + */ if (id == 0) - xfs_inodegc_flush(mp); + xfs_inodegc_push(mp); /* * Try to get the dquot. We don't want it allocated on disk, so don't diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 651ae75a7e23..4edee1d3784a 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -798,8 +798,11 @@ xfs_fs_statfs( xfs_extlen_t lsize; int64_t ffree; - /* Wait for whatever inactivations are in progress. */ - xfs_inodegc_flush(mp); + /* + * Expedite background inodegc but don't wait. We do not want to block + * here waiting hours for a billion extent file to be truncated. + */ + xfs_inodegc_push(mp); statp->f_type = XFS_SUPER_MAGIC; statp->f_namelen = MAXNAMELEN - 1; diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index d32026585c1b..0fa1b7a2918c 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -240,6 +240,7 @@ DEFINE_EVENT(xfs_fs_class, name, \ TP_PROTO(struct xfs_mount *mp, void *caller_ip), \ TP_ARGS(mp, caller_ip)) DEFINE_FS_EVENT(xfs_inodegc_flush); +DEFINE_FS_EVENT(xfs_inodegc_push); DEFINE_FS_EVENT(xfs_inodegc_start); DEFINE_FS_EVENT(xfs_inodegc_stop); DEFINE_FS_EVENT(xfs_inodegc_queue); -- 2.35.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] xfs: introduce xfs_inodegc_push() 2022-06-15 22:04 ` [PATCH 2/2] xfs: introduce xfs_inodegc_push() Dave Chinner @ 2022-06-22 5:21 ` Darrick J. Wong 0 siblings, 0 replies; 16+ messages in thread From: Darrick J. Wong @ 2022-06-22 5:21 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Thu, Jun 16, 2022 at 08:04:16AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The current blocking mechanism for pushing the inodegc queue out to > disk can result in systems becoming unusable when there is a long > running inodegc operation. This is because the statfs() > implementation currently issues a blocking flush of the inodegc > queue and a significant number of common system utilities will call > statfs() to discover something about the underlying filesystem. > > This can result in userspace operations getting stuck on inodegc > progress, and when trying to remove a heavily reflinked file on slow > storage with a full journal, this can result in delays measuring in > hours. > > Avoid this problem by adding "push" function that expedites the > flushing of the inodegc queue, but doesn't wait for it to complete. > > Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this > mechanism so they don't block but still ensure that queued > operations are expedited. > > Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues") > Reported-by: Chris Dunlop <chris@onthe.net.au> > Signed-off-by: Dave Chinner <dchinner@redhat.com> Looks good, Reviewed-by: Darrick J. Wong <djwong@kernel.org> --D > --- > fs/xfs/xfs_icache.c | 20 +++++++++++++++----- > fs/xfs/xfs_icache.h | 1 + > fs/xfs/xfs_qm_syscalls.c | 7 +++++-- > fs/xfs/xfs_super.c | 7 +++++-- > fs/xfs/xfs_trace.h | 1 + > 5 files changed, 27 insertions(+), 9 deletions(-) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 46b30ecf498c..aef4097ffd3e 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -1865,19 +1865,29 @@ xfs_inodegc_worker( > } > > /* > - * Force all currently queued inode inactivation work to run immediately and > - * wait for the work to finish. > + * Expedite all pending inodegc work to run immediately. This does not wait for > + * completion of the work. > */ > void > -xfs_inodegc_flush( > +xfs_inodegc_push( > struct xfs_mount *mp) > { > if (!xfs_is_inodegc_enabled(mp)) > return; > + trace_xfs_inodegc_push(mp, __return_address); > + xfs_inodegc_queue_all(mp); > +} > > +/* > + * Force all currently queued inode inactivation work to run immediately and > + * wait for the work to finish. > + */ > +void > +xfs_inodegc_flush( > + struct xfs_mount *mp) > +{ > + xfs_inodegc_push(mp); > trace_xfs_inodegc_flush(mp, __return_address); > - > - xfs_inodegc_queue_all(mp); > flush_workqueue(mp->m_inodegc_wq); > } > > diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h > index 2e4cfddf8b8e..6cd180721659 100644 > --- a/fs/xfs/xfs_icache.h > +++ b/fs/xfs/xfs_icache.h > @@ -76,6 +76,7 @@ void xfs_blockgc_stop(struct xfs_mount *mp); > void xfs_blockgc_start(struct xfs_mount *mp); > > void xfs_inodegc_worker(struct work_struct *work); > +void xfs_inodegc_push(struct xfs_mount *mp); > void xfs_inodegc_flush(struct xfs_mount *mp); > void xfs_inodegc_stop(struct xfs_mount *mp); > void xfs_inodegc_start(struct xfs_mount *mp); > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c > index 74ac9ca9e119..a30f4d067746 100644 > --- a/fs/xfs/xfs_qm_syscalls.c > +++ b/fs/xfs/xfs_qm_syscalls.c > @@ -454,9 +454,12 @@ xfs_qm_scall_getquota( > struct xfs_dquot *dqp; > int error; > > - /* Flush inodegc work at the start of a quota reporting scan. */ > + /* > + * Expedite pending inodegc work at the start of a quota reporting > + * scan but don't block waiting for it to complete. > + */ > if (id == 0) > - xfs_inodegc_flush(mp); > + xfs_inodegc_push(mp); > > /* > * Try to get the dquot. We don't want it allocated on disk, so don't > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 651ae75a7e23..4edee1d3784a 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -798,8 +798,11 @@ xfs_fs_statfs( > xfs_extlen_t lsize; > int64_t ffree; > > - /* Wait for whatever inactivations are in progress. */ > - xfs_inodegc_flush(mp); > + /* > + * Expedite background inodegc but don't wait. We do not want to block > + * here waiting hours for a billion extent file to be truncated. > + */ > + xfs_inodegc_push(mp); > > statp->f_type = XFS_SUPER_MAGIC; > statp->f_namelen = MAXNAMELEN - 1; > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index d32026585c1b..0fa1b7a2918c 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -240,6 +240,7 @@ DEFINE_EVENT(xfs_fs_class, name, \ > TP_PROTO(struct xfs_mount *mp, void *caller_ip), \ > TP_ARGS(mp, caller_ip)) > DEFINE_FS_EVENT(xfs_inodegc_flush); > +DEFINE_FS_EVENT(xfs_inodegc_push); > DEFINE_FS_EVENT(xfs_inodegc_start); > DEFINE_FS_EVENT(xfs_inodegc_stop); > DEFINE_FS_EVENT(xfs_inodegc_queue); > -- > 2.35.1 > ^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC PATCH 0/2] xfs: non-blocking inodegc pushes @ 2022-05-24 6:38 Dave Chinner 2022-05-24 6:38 ` [PATCH 1/2] xfs: bound maximum wait time for inodegc work Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2022-05-24 6:38 UTC (permalink / raw) To: linux-xfs; +Cc: chris Hi folks, I've had time to forward port the non-blocking inodegc push changes I had in a different LOD to the current for-next tree. I've run it through fstests auto group a couple of times and it hasn't caused any space accounting related failures on the machines I've run it on. The first patch introduces the bound maximum work start time for the inodegc queues - it's short, only 10ms (IIRC), but we don't want to delay inodegc for an arbitrarily long period of time. However, it means that work always starts quickly and so that reduces the need for statfs() to have to wait for background inodegc to start and complete to catch space "freed" by recent unlinks. The second patch converts statfs to use a "push" rather than a "flush". The push simply schedules any pending work that hasn't yet timed out to run immediately and returns. It does not wait for the inodegc work to complete - that's what a flush does, and that's what caused all the problems for statfs(). Hence statfs() is converted to push semantics at the same time, thereby removing the blocking behaviour it currently has. This should prevent a large amount of the issues that Chris has been seeing with lots of processes stuck in statfs() - that will no long happen. The only time user processes should get stuck now is when the inodegc throttle kicks in (unlinks only at this point) or if we are waiting for a lock a long running inodegc operation holds to be released. We had those specific problems before background inodegc - they manifested as unkillable unlink operations that had every backed up on them instead of background inodegc that has everything backed up on them. Hence I think these patches largely restore the status quo that we had before the background inodegc code was added. Comments, thoughts and testing appreciated. Cheers, Dave. ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/2] xfs: bound maximum wait time for inodegc work 2022-05-24 6:38 [RFC PATCH 0/2] xfs: non-blocking inodegc pushes Dave Chinner @ 2022-05-24 6:38 ` Dave Chinner 2022-05-24 16:54 ` Darrick J. Wong 0 siblings, 1 reply; 16+ messages in thread From: Dave Chinner @ 2022-05-24 6:38 UTC (permalink / raw) To: linux-xfs; +Cc: chris From: Dave Chinner <dchinner@redhat.com> Currently inodegc work can sit queued on the per-cpu queue until the workqueue is either flushed of the queue reaches a depth that triggers work queuing (and later throttling). This means that we could queue work that waits for a long time for some other event to trigger flushing. Hence instead of just queueing work at a specific depth, use a delayed work that queues the work at a bound time. We can still schedule the work immediately at a given depth, but we no long need to worry about leaving a number of items on the list that won't get processed until external events prevail. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- fs/xfs/xfs_mount.h | 2 +- fs/xfs/xfs_super.c | 2 +- 3 files changed, 24 insertions(+), 16 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 5269354b1b69..786702273621 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -440,7 +440,7 @@ xfs_inodegc_queue_all( for_each_online_cpu(cpu) { gc = per_cpu_ptr(mp->m_inodegc, cpu); if (!llist_empty(&gc->list)) - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); } } @@ -1841,8 +1841,8 @@ void xfs_inodegc_worker( struct work_struct *work) { - struct xfs_inodegc *gc = container_of(work, struct xfs_inodegc, - work); + struct xfs_inodegc *gc = container_of(to_delayed_work(work), + struct xfs_inodegc, work); struct llist_node *node = llist_del_all(&gc->list); struct xfs_inode *ip, *n; @@ -2014,6 +2014,7 @@ xfs_inodegc_queue( struct xfs_inodegc *gc; int items; unsigned int shrinker_hits; + unsigned long queue_delay = 1; trace_xfs_inode_set_need_inactive(ip); spin_lock(&ip->i_flags_lock); @@ -2025,19 +2026,26 @@ xfs_inodegc_queue( items = READ_ONCE(gc->items); WRITE_ONCE(gc->items, items + 1); shrinker_hits = READ_ONCE(gc->shrinker_hits); - put_cpu_ptr(gc); - if (!xfs_is_inodegc_enabled(mp)) + /* + * We queue the work while holding the current CPU so that the work + * is scheduled to run on this CPU. + */ + if (!xfs_is_inodegc_enabled(mp)) { + put_cpu_ptr(gc); return; - - if (xfs_inodegc_want_queue_work(ip, items)) { - trace_xfs_inodegc_queue(mp, __return_address); - queue_work(mp->m_inodegc_wq, &gc->work); } + if (xfs_inodegc_want_queue_work(ip, items)) + queue_delay = 0; + + trace_xfs_inodegc_queue(mp, __return_address); + mod_delayed_work(mp->m_inodegc_wq, &gc->work, queue_delay); + put_cpu_ptr(gc); + if (xfs_inodegc_want_flush_work(ip, items, shrinker_hits)) { trace_xfs_inodegc_throttle(mp, __return_address); - flush_work(&gc->work); + flush_delayed_work(&gc->work); } } @@ -2054,7 +2062,7 @@ xfs_inodegc_cpu_dead( unsigned int count = 0; dead_gc = per_cpu_ptr(mp->m_inodegc, dead_cpu); - cancel_work_sync(&dead_gc->work); + cancel_delayed_work_sync(&dead_gc->work); if (llist_empty(&dead_gc->list)) return; @@ -2073,12 +2081,12 @@ xfs_inodegc_cpu_dead( llist_add_batch(first, last, &gc->list); count += READ_ONCE(gc->items); WRITE_ONCE(gc->items, count); - put_cpu_ptr(gc); if (xfs_is_inodegc_enabled(mp)) { trace_xfs_inodegc_queue(mp, __return_address); - queue_work(mp->m_inodegc_wq, &gc->work); + mod_delayed_work(mp->m_inodegc_wq, &gc->work, 0); } + put_cpu_ptr(gc); } /* @@ -2173,7 +2181,7 @@ xfs_inodegc_shrinker_scan( unsigned int h = READ_ONCE(gc->shrinker_hits); WRITE_ONCE(gc->shrinker_hits, h + 1); - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); no_items = false; } } diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 8c42786e4942..377c5e59f6a0 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -61,7 +61,7 @@ struct xfs_error_cfg { */ struct xfs_inodegc { struct llist_head list; - struct work_struct work; + struct delayed_work work; /* approximate count of inodes in the list */ unsigned int items; diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 51ce127a0cc6..62f6b97355a2 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1073,7 +1073,7 @@ xfs_inodegc_init_percpu( gc = per_cpu_ptr(mp->m_inodegc, cpu); init_llist_head(&gc->list); gc->items = 0; - INIT_WORK(&gc->work, xfs_inodegc_worker); + INIT_DELAYED_WORK(&gc->work, xfs_inodegc_worker); } return 0; } -- 2.35.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work 2022-05-24 6:38 ` [PATCH 1/2] xfs: bound maximum wait time for inodegc work Dave Chinner @ 2022-05-24 16:54 ` Darrick J. Wong 2022-05-24 23:03 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Darrick J. Wong @ 2022-05-24 16:54 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, chris On Tue, May 24, 2022 at 04:38:01PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Currently inodegc work can sit queued on the per-cpu queue until > the workqueue is either flushed of the queue reaches a depth that > triggers work queuing (and later throttling). This means that we > could queue work that waits for a long time for some other event to > trigger flushing. > > Hence instead of just queueing work at a specific depth, use a > delayed work that queues the work at a bound time. We can still > schedule the work immediately at a given depth, but we no long need Nit: "no longer need..." > to worry about leaving a number of items on the list that won't get > processed until external events prevail. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- > fs/xfs/xfs_mount.h | 2 +- > fs/xfs/xfs_super.c | 2 +- > 3 files changed, 24 insertions(+), 16 deletions(-) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 5269354b1b69..786702273621 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -440,7 +440,7 @@ xfs_inodegc_queue_all( > for_each_online_cpu(cpu) { > gc = per_cpu_ptr(mp->m_inodegc, cpu); > if (!llist_empty(&gc->list)) > - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); > + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); > } > } > > @@ -1841,8 +1841,8 @@ void > xfs_inodegc_worker( > struct work_struct *work) > { > - struct xfs_inodegc *gc = container_of(work, struct xfs_inodegc, > - work); > + struct xfs_inodegc *gc = container_of(to_delayed_work(work), > + struct xfs_inodegc, work); > struct llist_node *node = llist_del_all(&gc->list); > struct xfs_inode *ip, *n; > > @@ -2014,6 +2014,7 @@ xfs_inodegc_queue( > struct xfs_inodegc *gc; > int items; > unsigned int shrinker_hits; > + unsigned long queue_delay = 1; A default delay of one clock tick, correct? Just out of curiosity, how does this shake out wrt fstests that do a thing and then measure free space? I have a dim recollection of a bug that I found in one of the preproduction iterations of inodegc back when I used delayed_work to schedule the background gc. If memory serves, calling mod_delayed_work on a delayed_work object that is currently running does /not/ cause the delayed_work object to be requeued, even if delay==0. Aha, I found a description in my notes. I've adapted them to the current patchset, since in those days inodegc used a radix tree tag and per-AG workers instead of a locklesslist and per-cpu workers. If the following occurs: Worker 1 Thread 2 xfs_inodegc_worker <starts up> node = llist_del_all() <start processing inodes> <block on something, yield> xfs_irele xfs_inode_mark_reclaimable llist_add mod_delayed_work() <exit> <process the rest of nodelist> return Then at the end of this sequence, we'll end up with thread 2's inode queued to the gc list but the delayed work is /not/ queued. That inode remains on the gc list (and unprocessed) until someone comes along to push that CPU's gc list, whether it's a statfs, or an unmount, or someone hitting ENOSPC and triggering blockgc. I observed this bug while digging into online repair occasionally stalling for a long time or erroring out during inode scans. If you'll recall, earlier inodegc iterations allowed iget to recycle inodes that were queued for inactivation, but later iterations didn't, so it became the responsibility of the online repair's inode scanner to push the inodegc workers when iget found an inode that was queued for inactivation. (The current online repair inode scanner is smarter in the sense that it will try inodegc_flush a few times before backing out to userspace, and if it does, xfs_scrub will generally requeue the entire scrub operation.) --D > > trace_xfs_inode_set_need_inactive(ip); > spin_lock(&ip->i_flags_lock); > @@ -2025,19 +2026,26 @@ xfs_inodegc_queue( > items = READ_ONCE(gc->items); > WRITE_ONCE(gc->items, items + 1); > shrinker_hits = READ_ONCE(gc->shrinker_hits); > - put_cpu_ptr(gc); > > - if (!xfs_is_inodegc_enabled(mp)) > + /* > + * We queue the work while holding the current CPU so that the work > + * is scheduled to run on this CPU. > + */ > + if (!xfs_is_inodegc_enabled(mp)) { > + put_cpu_ptr(gc); > return; > - > - if (xfs_inodegc_want_queue_work(ip, items)) { > - trace_xfs_inodegc_queue(mp, __return_address); > - queue_work(mp->m_inodegc_wq, &gc->work); > } > > + if (xfs_inodegc_want_queue_work(ip, items)) > + queue_delay = 0; > + > + trace_xfs_inodegc_queue(mp, __return_address); > + mod_delayed_work(mp->m_inodegc_wq, &gc->work, queue_delay); > + put_cpu_ptr(gc); > + > if (xfs_inodegc_want_flush_work(ip, items, shrinker_hits)) { > trace_xfs_inodegc_throttle(mp, __return_address); > - flush_work(&gc->work); > + flush_delayed_work(&gc->work); > } > } > > @@ -2054,7 +2062,7 @@ xfs_inodegc_cpu_dead( > unsigned int count = 0; > > dead_gc = per_cpu_ptr(mp->m_inodegc, dead_cpu); > - cancel_work_sync(&dead_gc->work); > + cancel_delayed_work_sync(&dead_gc->work); > > if (llist_empty(&dead_gc->list)) > return; > @@ -2073,12 +2081,12 @@ xfs_inodegc_cpu_dead( > llist_add_batch(first, last, &gc->list); > count += READ_ONCE(gc->items); > WRITE_ONCE(gc->items, count); > - put_cpu_ptr(gc); > > if (xfs_is_inodegc_enabled(mp)) { > trace_xfs_inodegc_queue(mp, __return_address); > - queue_work(mp->m_inodegc_wq, &gc->work); > + mod_delayed_work(mp->m_inodegc_wq, &gc->work, 0); > } > + put_cpu_ptr(gc); > } > > /* > @@ -2173,7 +2181,7 @@ xfs_inodegc_shrinker_scan( > unsigned int h = READ_ONCE(gc->shrinker_hits); > > WRITE_ONCE(gc->shrinker_hits, h + 1); > - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); > + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); > no_items = false; > } > } > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > index 8c42786e4942..377c5e59f6a0 100644 > --- a/fs/xfs/xfs_mount.h > +++ b/fs/xfs/xfs_mount.h > @@ -61,7 +61,7 @@ struct xfs_error_cfg { > */ > struct xfs_inodegc { > struct llist_head list; > - struct work_struct work; > + struct delayed_work work; > > /* approximate count of inodes in the list */ > unsigned int items; > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 51ce127a0cc6..62f6b97355a2 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -1073,7 +1073,7 @@ xfs_inodegc_init_percpu( > gc = per_cpu_ptr(mp->m_inodegc, cpu); > init_llist_head(&gc->list); > gc->items = 0; > - INIT_WORK(&gc->work, xfs_inodegc_worker); > + INIT_DELAYED_WORK(&gc->work, xfs_inodegc_worker); > } > return 0; > } > -- > 2.35.1 > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work 2022-05-24 16:54 ` Darrick J. Wong @ 2022-05-24 23:03 ` Dave Chinner 0 siblings, 0 replies; 16+ messages in thread From: Dave Chinner @ 2022-05-24 23:03 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, chris On Tue, May 24, 2022 at 09:54:51AM -0700, Darrick J. Wong wrote: > On Tue, May 24, 2022 at 04:38:01PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > Currently inodegc work can sit queued on the per-cpu queue until > > the workqueue is either flushed of the queue reaches a depth that > > triggers work queuing (and later throttling). This means that we > > could queue work that waits for a long time for some other event to > > trigger flushing. > > > > Hence instead of just queueing work at a specific depth, use a > > delayed work that queues the work at a bound time. We can still > > schedule the work immediately at a given depth, but we no long need > > Nit: "no longer need..." > > > to worry about leaving a number of items on the list that won't get > > processed until external events prevail. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- > > fs/xfs/xfs_mount.h | 2 +- > > fs/xfs/xfs_super.c | 2 +- > > 3 files changed, 24 insertions(+), 16 deletions(-) > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index 5269354b1b69..786702273621 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -440,7 +440,7 @@ xfs_inodegc_queue_all( > > for_each_online_cpu(cpu) { > > gc = per_cpu_ptr(mp->m_inodegc, cpu); > > if (!llist_empty(&gc->list)) > > - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); > > + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); > > } > > } > > > > @@ -1841,8 +1841,8 @@ void > > xfs_inodegc_worker( > > struct work_struct *work) > > { > > - struct xfs_inodegc *gc = container_of(work, struct xfs_inodegc, > > - work); > > + struct xfs_inodegc *gc = container_of(to_delayed_work(work), > > + struct xfs_inodegc, work); > > struct llist_node *node = llist_del_all(&gc->list); > > struct xfs_inode *ip, *n; > > > > @@ -2014,6 +2014,7 @@ xfs_inodegc_queue( > > struct xfs_inodegc *gc; > > int items; > > unsigned int shrinker_hits; > > + unsigned long queue_delay = 1; > > A default delay of one clock tick, correct? > > Just out of curiosity, how does this shake out wrt fstests that do a > thing and then measure free space? No regressions on a 5.18+for-next kernel on the two machines (one ramdisk, one SSD) I ran it on yesterday. The runs were clean, which is why I posted it for comments. > I have a dim recollection of a bug that I found in one of the > preproduction iterations of inodegc back when I used delayed_work to > schedule the background gc. If memory serves, calling mod_delayed_work > on a delayed_work object that is currently running does /not/ cause the > delayed_work object to be requeued, even if delay==0. I don't think that is correct - I actually went through the code to check this because I wanted to be certain that it behaved the way I needed it to. Indeed, the documented behaviour of mod_delayed_work_on() is: * If @dwork is idle, equivalent to queue_delayed_work_on(); otherwise, * modify @dwork's timer so that it expires after @delay. If @delay is * zero, @work is guaranteed to be scheduled immediately regardless of its * current state. In terms of the implementation, try_to_grab_pending() will grab the delayed work and set/steal the WORK_STRUCT_PENDING_BIT, and mod_delayed_work_on() will loop until it owns the bit or the dwork is cancelled. Once it owns the PENDING bit, it will call __queue_delayed_work(), which either queues the work immediately (delay = 0) or sets up a timer to expire in delay ticks. The PENDING bit is cleared by the kworker thread before it calls the work->current_func() to execute the work, so if the work is currenlty running, try_to_grab_pending() will be able to set/steal the WORK_STRUCT_PENDING_BIT without issues, and so even if the work is currently running, we should be able queue it again via mod_delayed_work_on(). So, AFAICT, the comment and behaviour match, and mod_delayed_work() will result in queuing of the dwork even if it is currently running. > Aha, I found a description in my notes. I've adapted them to the > current patchset, since in those days inodegc used a radix tree tag > and per-AG workers instead of a locklesslist and per-cpu workers. > If the following occurs: > > Worker 1 Thread 2 > > xfs_inodegc_worker > <starts up> > node = llist_del_all() > <start processing inodes> > <block on something, yield> > xfs_irele > xfs_inode_mark_reclaimable > llist_add > mod_delayed_work() > <exit> > <process the rest of nodelist> > return > > Then at the end of this sequence, we'll end up with thread 2's inode > queued to the gc list but the delayed work is /not/ queued. That inode > remains on the gc list (and unprocessed) until someone comes along to > push that CPU's gc list, whether it's a statfs, or an unmount, or > someone hitting ENOSPC and triggering blockgc. Right, if mod_delayed_work() didn't queue the work then this would be an issue, but AFAICT mod_delayed_work() will requeue in this case and it will not get hung up in this case. I certainly haven't seen any evidence that it's not working as I expected (so far). Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2022-06-25 1:04 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-06-15 22:04 [PATCH 0/2 V2] xfs: xfs: non-blocking inodegc pushes Dave Chinner 2022-06-15 22:04 ` [PATCH 1/2] xfs: bound maximum wait time for inodegc work Dave Chinner 2022-06-17 16:34 ` Brian Foster 2022-06-17 21:52 ` Dave Chinner 2022-06-22 5:20 ` Darrick J. Wong 2022-06-22 16:13 ` Brian Foster 2022-06-23 0:25 ` Darrick J. Wong 2022-06-23 11:49 ` Brian Foster 2022-06-23 19:56 ` Darrick J. Wong 2022-06-24 12:39 ` Brian Foster 2022-06-25 1:03 ` Darrick J. Wong 2022-06-15 22:04 ` [PATCH 2/2] xfs: introduce xfs_inodegc_push() Dave Chinner 2022-06-22 5:21 ` Darrick J. Wong -- strict thread matches above, loose matches on Subject: below -- 2022-05-24 6:38 [RFC PATCH 0/2] xfs: non-blocking inodegc pushes Dave Chinner 2022-05-24 6:38 ` [PATCH 1/2] xfs: bound maximum wait time for inodegc work Dave Chinner 2022-05-24 16:54 ` Darrick J. Wong 2022-05-24 23:03 ` Dave Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.