Re: [PATCH 05/14] xfs: repair free space btrees

From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org, david@fromorbit.com,
	allison.henderson@oracle.com
Subject: Re: [PATCH 05/14] xfs: repair free space btrees
Date: Fri, 10 Aug 2018 06:33:52 -0400	[thread overview]
Message-ID: <20180810103351.GA27330@bfoster> (raw)
In-Reply-To: <20180809155959.GK30972@magnolia>

On Thu, Aug 09, 2018 at 08:59:59AM -0700, Darrick J. Wong wrote:
> On Thu, Aug 09, 2018 at 08:00:28AM -0400, Brian Foster wrote:
> > On Wed, Aug 08, 2018 at 03:42:32PM -0700, Darrick J. Wong wrote:
> > > On Wed, Aug 08, 2018 at 08:29:54AM -0400, Brian Foster wrote:
> > > > On Tue, Aug 07, 2018 at 04:34:58PM -0700, Darrick J. Wong wrote:
> > > > > On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote:
> > > > > > On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote:
> > > > > > > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote:
> > > > > > > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> > > > > > > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > > > > > > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > > > > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
...
> > > > What _seems_ beneficial about that approach is we get (potentially
> > > > external) persistent backing and memory reclaim ability with the
> > > > traditional memory allocation model.
> > > >
> > > > ISTM that if we used a regular file, we'd need to deal with the
> > > > traditional file interface somehow or another (file read/pagecache
> > > > lookup -> record ??).
> > > 
> > > Yes, that's all neatly wrapped up in kernel_read() and kernel_write() so
> > > all we need is a (struct file *).
> > > 
> > > > We could repurpose some existing mechanism like the directory code or
> > > > quota inode mechanism to use xfs buffers for that purpose, but I think
> > > > that would require us to always use an internal inode. Allowing
> > > > userspace to pass an fd/file passes that consideration on to the user,
> > > > which might be more flexible. We could always warn about additional
> > > > limitations if that fd happens to be based on the target fs.
> > > 
> > > <nod> A second advantage of the struct file/kernel_{read,write} approach
> > > is that we if we ever decide to let userspace pass in a fd, it's trivial
> > > to feed that struct file to the kernel io routines instead of a memfd
> > > one.
> > > 
> > 
> > Yeah, I like this flexibility. In fact, I'm wondering why we wouldn't do
> > something like this anyways. Could/should xfs_scrub be responsible for
> > allocating a memfd and passing along the fd? Another advantage of doing
> > that is whatever logic we may need to clean up old repair files or
> > whatever is pushed to userspace.
> 
> There are two ways we could do this -- one is to have the kernel manage
> the memfd creation internally (like my patches do now); the other is for
> xfs_scrub to pass in creat(O_TMPFILE).
> 
> When repair fputs the file (or fdputs the fd if we switch to using
> that), the kernel will perform the usual deletion of the zero-linkcount
> zero-refcount file.  We get all the "cleanup" for free by closing the
> file.
> 

Ok. FWIW, the latter approach where xfs_scrub creates a file and passes
the fd along to the kernel seems preferable to me, but perhaps others
have different opinions. We could accept a pathname from the user to
create the file or otherwise attempt to allocate an memfd by default and
pass that along.

> One other potential complication is that a couple of the repair
> functions need two memfds.  The extended attribute repair creates a
> fixed-record array for attr keys and an xblob to hold names and values;
> each structure gets its own memfd.  The refcount repair creates two
> fixed-record arrays, one for refcount records and another to act as a
> stack of rmaps to compute reference counts.
> 

Hmm, I guess there's nothing stopping scrub from passing in two fds.
Maybe it would make more sense for the userspace option to be a path
basename or directory where scrub is allowed to create whatever scratch
files it needs.

That aside, is there any reason the repair mechanism couldn't emulate
multiple files with a single fd via a magic offset delimeter or
something? E.g., "file 1" starts at offset 0, "file 2" starts at offset
1TB, etc. (1TB is probably overkill, but you get the idea..).

Brian

> (In theory the xbitmap could also be converted to use the fixed record
> array, but in practice they haven't (yet) become large enough to warrant
> it, and there's currently no way to insert or delete records from the
> middle of the array.)
> 
> > > > > > I'm not familiar with memfd. The manpage suggests it's ram backed, is it
> > > > > > swappable or something?
> > > > > 
> > > > > It's supposed to be.  The quick test I ran (allocate a memfd, write 1GB
> > > > > of junk to it on a VM with 400M of RAM) seemed to push about 980MB into
> > > > > the swap file.
> > > > > 
> > > > 
> > > > Ok.
> > > > 
> > > > > > If so, that sounds a reasonable option provided the swap space
> > > > > > requirement can be made clear to users
> > > > > 
> > > > > We can document it.  I don't think it's any worse than xfs_repair being
> > > > > able to use up all the memory + swap... and since we're probably only
> > > > > going to be repairing one thing at a time, most likely scrub won't need
> > > > > as much memory.
> > > > > 
> > > > 
> > > > Right, but as noted below, my concerns with the xfs_repair comparison
> > > > are that 1.) the kernel generally has more of a limit on anonymous
> > > > memory allocations than userspace (i.e., not swappable AFAIU?) and 2.)
> > > > it's not clear how effectively running the system out of memory via the
> > > > kernel will behave from a failure perspective.
> > > > 
> > > > IOW, xfs_repair can run the system out of memory but for the most part
> > > > that ends up being a simple problem for the system: OOM kill the bloated
> > > > xfs_repair process. For an online repair in a similar situation, I have
> > > > no idea what's going to happen.
> > > 
> > > Back in the days of the huge linked lists the oom killer would target
> > > other proceses because it doesn't know that the online repair thread is
> > > sitting on a ton of pinned kernel memory...
> > > 
> > 
> > Makes sense, kind of what I'd expect...
> > 
> > > > The hope is that the online repair hits -ENOMEM and unwinds, but ISTM
> > > > we'd still be at risk of other subsystems running into memory
> > > > allocation problems, filling up swap, the OOM killer going after
> > > > unrelated processes, etc.  What if, for example, the OOM killer starts
> > > > picking off processes in service to a running online repair that
> > > > immediately consumes freed up memory until the system is borked?
> > > 
> > > Yeah.  One thing we /could/ do is register an oom notifier that would
> > > urge any running repair threads to bail out if they can.  It seems to me
> > > that the oom killer blocks on the oom_notify_list chain, so our handler
> > > could wait until at least one thread exits before returning.
> > > 
> > 
> > Ok, something like that could be useful. I agree that we probably don't
> > need to go that far until the mechanism is nailed down and testing shows
> > that OOM is a problem.
> 
> It already is a problem on my contrived "2TB hardlink/reflink farm fs" +
> "400M of RAM and no swap" scenario.  Granted, pretty much every other
> xfs utility also blows out on that so I'm not sure how hard I really
> need to try...
> 
> > > > I don't know how likely that is or if it really ends up much different
> > > > from the analogous xfs_repair situation. My only point right now is
> > > > that failure scenario is something we should explore for any solution
> > > > we ultimately consider because it may be an unexpected use case of the
> > > > underlying mechanism.
> > > 
> > > Ideally, online repair would always be the victim since we know we have
> > > a reasonable fallback.  At least for memfd, however, I think the only
> > > clues we have to decide the question "is this memfd getting in the way
> > > of other threads?" is either seeing ENOMEM, short writes, or getting
> > > kicked by an oom notification.  Maybe that'll be enough?
> > > 
> > 
> > Hm, yeah. It may be challenging to track memfd usage as such. If
> > userspace has access to the fd on an OOM notification or whatever, it
> > might be able to do more accurate analysis based on an fstat() or
> > something.
> > 
> > Related question... is the online repair sequence currently
> > interruptible, if xfs_scrub receives a fatal signal while pulling in
> > entries during an allocbt scan for example?
> 
> It's interruptible (fatal signals only) during the scan phase, but once
> it starts logging metadata updates it will run all the way to
> completion.
> 
> > > > (To the contrary, just using a cached file seems a natural fit from
> > > > that perspective.)
> > > 
> > > Same here.
> > > 
> > > > > > and the failure characteristics aren't more severe than for userspace.
> > > > > > An online repair that puts the broader system at risk of OOM as
> > > > > > opposed to predictably failing gracefully may not be the most useful
> > > > > > tool.
> > > > > 
> > > > > Agreed.  One huge downside of memfd seems to be the lack of a mechanism
> > > > > for the vm to push back on us if we successfully write all we need to
> > > > > the memfd but then other processes need some memory.  Obviously, if the
> > > > > memfd write itself comes up short or fails then we dump the memfd and
> > > > > error back to userspace.  We might simply have to free array memory
> > > > > while we iterate the records to minimize the time spent at peak memory
> > > > > usage.
> > > > > 
> > > > 
> > > > Hm, yeah. Some kind of fixed/relative size in-core memory pool approach
> > > > may simplify things because we could allocate it up front and know right
> > > > away whether we just don't have enough memory available to repair.
> > > 
> > > Hmm.  Apparently we actually /can/ call fallocate on memfd to grab all
> > > the pages at once, provided we have some guesstimate beforehand of how
> > > much space we think we'll need.
> > > 
> > > So long as my earlier statement about the memory requirements being no
> > > more than the size of the btree leaves is actually true (I haven't
> > > rigorously tried to prove it), we need about (xrep_calc_ag_resblks() *
> > > blocksize) worth of space in the memfd file.  Maybe we ask for 1.5x
> > > that and if we don't get it, we kill the memfd and exit.
> > > 
> > 
> > Indeed. It would be nice if we could do all of the file management bits
> > in userspace.
> 
> Agreed, though no file management would be even better. :)
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > > --D
> > > > > 
> > > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > > Brian
> > > > > > > > 
> > > > > > > > > --D
> > > > > > > > > 
> > > > > > > > > > Brian
> > > > > > > > > > 
> > > > > > > > > > > --D
> > > > > > > > > > > 
> > > > > > > > > > > > Brian
> > > > > > > > > > > > 
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +done:
> > > > > > > > > > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > > > > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > > > > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > > > > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > > > > > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > > > > > > > > > >  	return diff;
> > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > > > > > > > > > +bool
> > > > > > > > > > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > > > > > > > > > +	struct xfs_perag	*pag)
> > > > > > > > > > > > > > > +{
> > > > > > > > > > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > > > > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > RB_EMPTY_ROOT()?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Good suggestion, thank you!
> > > > > > > > > > > > > 
> > > > > > > > > > > > > --D
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Brian
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > > +		return false;
> > > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > > +	return true;
> > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > > > > > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > --
> > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > --
> > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html