Re: Highly reflinked and fragmented considered harmful?

From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Chris Dunlop <chris@onthe.net.au>,
	Amir Goldstein <amir73il@gmail.com>,
	linux-xfs <linux-xfs@vger.kernel.org>
Subject: Re: Highly reflinked and fragmented considered harmful?
Date: Wed, 11 May 2022 11:36:54 +1000	[thread overview]
Message-ID: <20220511013654.GC1098723@dread.disaster.area> (raw)
In-Reply-To: <20220511003708.GA27195@magnolia>

On Tue, May 10, 2022 at 05:37:08PM -0700, Darrick J. Wong wrote:
> On Wed, May 11, 2022 at 07:54:11AM +1000, Dave Chinner wrote:
> > On Tue, May 10, 2022 at 12:19:18PM -0700, Darrick J. Wong wrote:
> > > On Tue, May 10, 2022 at 06:16:32PM +1000, Dave Chinner wrote:
> > > > I think we should just accept that statfs() can never really report
> > > > exactly accurate space usagei to userspace and get rid of the flush.
> > > 
> > > What about all the code that flushes gc work when we hit ENOSPC/EDQUOT?
> > > Do we let those threads stall too because the fs is out of resources and
> > > they can just wait?  Or would that also be better off with a flush
> > > timeout and userspace can just eat the ENOSPC/EDQUOT after 30 seconds?
> > 
> > 1. Not an immediate problem we need to solve.
> > 2. flush at enospc/edquot is best effort, so doesn't need to block
> >    waiting on inodegc. the enospc/edquot flush will get repeated
> >    soon enough, so that will take into account progress made by
> >    long running inodegc ops.
> > 3. we leave pending work on the per-cpu queues under the
> >    flush/throttle thresholds indefinitely.
> > 4. to be accurate, statfs() needs to flush #3.
> 
> Yes, that's the state of things currently...
> 
> > 5. While working on the rework of inode reclaimation, I converted the
> >    inodegc queues to use delayed works to ensure work starts on
> >    per-cpu queues within 10ms of queueing to avoid #3 causing
> >    problems.
> > 6. I added a non-blocking async flush mechanism that works by
> >    modifying the queue timer to 0 to trigger immedate work
> >    scheduling for anything that hasn't been run.
> 
> *OH*, you've already built those two things?  Well that makes this
> /much/ easier.  I think all we'd need to fix #4 is an xfs_inodegc_push()
> function that statfs/enospc/edquot can call to do xfs_inodegc_queue_all
> and return immediately.

*nod*.

I think that's the thing that some people have missed in in this
thread - I've know for a while now the scope of problems blocking
flushes from statfs() can cause - any issue with background
inodegc not making progress can deadlock the filesystem. I've lost
count of the number of times I had inodegc go wrong or crash and the
only possible recovery was to power cycle because nothing could be
executed from the command line.

That's because statfs() appears to be used in some commonly used
library function and so many utility programs on the system will get
stuck and be unable to run when inodegc dies, deadlocks, or takes a
real long time to make progress.

Hence I didn't need to do any analysis of Chris' system to know
exactly what was going on - I've seen it many, many times myself and
have work in progress that avoids those issues with inodegc work
that never completes.

IOWs, everything is understood, fixes are already written, and
there's no actual threat of data loss or corruption from this issue.
It's just lots of stuff getting stuck behind a long running
operation...

All this report means is that I have to bump the priority of that
work, separate it out into it's own series and make sure it's ready
for the next cycle.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com