Re: Highly reflinked and fragmented considered harmful?

From: "Darrick J. Wong" <djwong@kernel.org>
To: Chris Dunlop <chris@onthe.net.au>
Cc: Dave Chinner <david@fromorbit.com>, linux-xfs@vger.kernel.org
Subject: Re: Highly reflinked and fragmented considered harmful?
Date: Mon, 9 May 2022 22:14:31 -0700	[thread overview]
Message-ID: <20220510051431.GZ27195@magnolia> (raw)
In-Reply-To: <20220510025541.GA192172@onthe.net.au>

On Tue, May 10, 2022 at 12:55:41PM +1000, Chris Dunlop wrote:
> Hi Dave,
> 
> On Tue, May 10, 2022 at 09:09:18AM +1000, Dave Chinner wrote:
> > On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote:
> > > Is it to be expected that removing 29TB of highly reflinked and fragmented
> > > data could take days, the entire time blocking other tasks like "rm" and
> > > "df" on the same filesystem?
> ...
> > At some point, you have to pay the price of creating billions of
> > random fine-grained cross references in tens of TBs of data spread
> > across weeks and months of production. You don't notice the scale of
> > the cross-reference because it's taken weeks and months of normal
> > operations to get there. It's only when you finally have to perform
> > an operation that needs to iterate all those references that the
> > scale suddenly becomes apparent. XFS scales to really large numbers
> > without significant degradation, so people don't notice things like
> > object counts or cross references until something like this
> > happens.
> > 
> > I don't think there's much we can do at the filesystem level to help
> > you at this point - the inode output in the transaction dump above
> > indicates that you haven't been using extent size hints to limit
> > fragmentation or extent share/COW sizes, so the damage is already
> > present and we can't really do anything to fix that up.
> 
> Thanks for taking the time to provide a detailed and informative
> exposition, it certainly helps me understand what I'm asking of the fs, the
> areas that deserve more attention, and how to approach analyzing the
> situation.
> 
> At this point I'm about 3 days from completing copying the data (from a
> snapshot of the troubled fs mounted with 'norecovery') over to a brand new
> fs. Unfortunately the new fs is also rmapbt=1 so I'll go through all the
> copying again (under more controlled circumstances) to get onto a rmapbt=0
> fs (losing the ability to do online repairs whenever that arrives -
> hopefully that won't come back to haunt me).

Hmm.  Were most of the stuck processes running xfs_inodegc_flush?  Maybe
we should try to switch that to something that will stop waiting after
30s, since most of the (non-fsfreeze) callers don't actually *require*
that the work actually finish, they're just trying to return accurate
space accounting to userspace.

> Out of interest:
> 
> > > - with a reboot/remount, does the log replay continue from where it left
> > > off, or start again?
> 
> Sorry, if you provided an answer to this, I didn't understand it.
> 
> Basically the question is, if a recovery on mount were going to take 10
> hours, but the box rebooted and fs mounted again at 8 hours, would the
> recovery this time take 2 hours or once again 10 hours?

In theory yes, it'll restart where it left off, but if 10 seconds go by
and the extent count *hasn't changed* then yikes did we spend that
entire time doing refcount btree updates??

--D

> Cheers,
> 
> Chris