All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, hch@infradead.org
Subject: Re: [PATCH 2/9] xfs: deferred inode inactivation
Date: Tue, 8 Jun 2021 18:28:38 -0700	[thread overview]
Message-ID: <20210609012838.GW2945738@locust> (raw)
In-Reply-To: <20210609010109.GN664593@dread.disaster.area>

On Wed, Jun 09, 2021 at 11:01:09AM +1000, Dave Chinner wrote:
> On Mon, Jun 07, 2021 at 09:40:51PM -0700, Darrick J. Wong wrote:
> > On Tue, Jun 08, 2021 at 10:57:53AM +1000, Dave Chinner wrote:
> > > On Mon, Jun 07, 2021 at 03:25:04PM -0700, Darrick J. Wong wrote:
> > > > +/* Queue a new inode gc pass if there are inodes needing inactivation. */
> > > > +static void
> > > > +xfs_inodegc_queue(
> > > > +	struct xfs_mount        *mp)
> > > > +{
> > > > +	if (!xfs_inodegc_running(mp))
> > > > +		return;
> > > > +
> > > > +	rcu_read_lock();
> > > > +	if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_INODEGC_TAG))
> > > > +		queue_delayed_work(mp->m_gc_workqueue, &mp->m_inodegc_work, 0);
> > > > +	rcu_read_unlock();
> > > > +}
> > > 
> > > I have no idea why we are checking if the gc is running here. All
> > > our other background stuff runs and re-queues until it is directly
> > > stopped or there's nothing left in the tree. Hence I'm a bit
> > > clueless right now about what this semantic is for...
> > 
> > The opflag is supposed to control whether inactivation actually runs.
> > As you might guess from _running calls being scattered everywhere, it's
> > pretty ugly code.  All this crap exists because there's no easy solution
> > to shutting down background threads after we commit to freezing the fs
> > but before an fs freeze hits SB_FREEZE_FS and we can't allocate new
> > transactions.
> > 
> > Fixing inactivation to use NO_WRITECOUNT means auditing every function
> > call that xfs_inactive makes to look for an xfs_trans_alloc* call and
> > probably modifying all of them to be able to switch between regular and
> > NOWRITECOUNT mode.  I tried that, it's ugly.
> > 
> > Another solution is to add ->freeze_super and ->thaw_super functions
> > to prevent a FITHAW caller from racing with a FIFREEZE caller and
> > accidentally rearming the inodegc while a freeze starts.
> 
> This seems like the right way to proceed.
> 
> Of course, all the freeze/thaw exclusion is in freeze_super and
> thaw_super via the umount lock, so to do this we need to factor
> freeze_super() so that we can take the active references to the
> superblock ourselves, then turn off inodegc, then run the generic
> freeze_super() code...
> 
> The unfreeze side where we'd turn the inode gc back on is already
> inside the protected region (via ->unfreeze_fs callout in
> thaw_super_locked()) so we don't need to do anything special there.

<nod> Not sure I want to go factoring vfs functions if I can avoid it,
though I guess there's always the copy-pasta approach :P

But for now, I /think/ I figured out a way to make it work and avoid all
that (see below).

> [I'll reorder bits to address all the other freeze stuff now so it's
> not all mixed up with the stat/df stuff]
> 
> > > > +/* Stop all queued inactivation work. */
> > > > +void
> > > > +xfs_inodegc_stop(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
> > > > +	cancel_delayed_work_sync(&mp->m_inodegc_work);
> > > > +}
> > > 
> > > what's to stop racing invocations of stop/start? Perhaps:
> > > 
> > > 	if (test_and_clear_bit())
> > > 		cancel_delayed_work_sync(&mp->m_inodegc_work);
> > 
> > That horrible hack below.
> >
> > > > +
> > > > +/* Schedule deferred inode inactivation work. */
> > > > +void
> > > > +xfs_inodegc_start(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	set_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags);
> > > > +	xfs_inodegc_queue(mp);
> > > > +}
> > > 
> > > 	if (test_and_set_bit())
> > > 		xfs_inodegc_queue(mp);
> > > 
> > > So that the running state will remain in sync with the actual queue
> > > operation? Though I'm still not sure why we need the running bit...
> > 
> > (see ugly sync_fs SB_FREEZE_PAGEFAULTS hack)
> 
> I'm not sure how that addresses any sort of concurrent set/clear
> that could occur as it doesn't guarantee that the running state
> matches the opflag bit state...
> 
> > > Ok, "opflags" are undocumented as to how they work, what their
> > > consistency model is, etc. I understand you want an atomic flag to
> > > indicate that something is running, and mp->m_flags is not that
> > > (despite being used that way historically). 
> > > 
> > > I dislike the "_BIT" annotations for a variable that is only to be
> > > used as an index bit field. Or maybe it's a flag field and you
> > > haven't defined any bitwise flags for it because you're not using it
> > > that way yet.
> > > 
> > > So, is m_opflags an indexed bit field or a normal flag field like
> > > m_flags?
> > 
> > It's an indexed bit field, which is why I named it _BIT.  I'll try to
> > add more documentation around what this thing is and how the flags work:
> > 
> > struct xfs_mount {
> > 	...
> > 	/*
> > 	 * This atomic bitset controls flags that alter the behavior of
> > 	 * the filesystem.  Use only the atomic bit helper functions
> > 	 * here; see XFS_OPFLAG_* for information about the actual
> > 	 * flags.
> > 	 */
> > 	unsigned long		m_opflags;
> > 	...
> > };
> > 
> > /*
> >  * Operation flags -- each entry here is a bit index into m_opflags and
> >  * is not itself a flag value.
> >  */
> > 
> > /* Are we allowed to run the inode inactivation worker? */
> > #define XFS_OPFLAG_INODEGC_RUNNING_BIT	(0)
> 
> This doesn't really address my comments - there's still the _BIT
> annotation mixed with "flags" variables. Other examples of this are
> that "operational flags" or state variables are updated via
> set/clear/test/etc bit op wrappers. An example of this is page and
> bufferhead state bits and variables...

Urk, it was late last night and I forgot to update the reply after I
changed that to an enum.

> I mentioned on #xfs an older patchset I had that cleaned up a lot of
> this cruft in the xfs_mount flags fields by separating feature flags
> from state flags. That can be found here:
> 
> https://lore.kernel.org/linux-xfs/20180820044851.414-1-david@fromorbit.com/
> 
> I think if we are going to introduce dynamic mount state flags, we
> need to move towards that sort of separation. So leave this patch
> set as it is now with the opflags, and I'll update my flag vs state
> rework patchset and merge this new code into it...
> 
> That all said, I still don't really see a need for a state bit here
> if we can stop the inode gc before we start the freeze process as
> via a xfs_fs_freeze_super() method.
> 
> (and that's freeze done...)
> 
> > > > @@ -947,6 +963,7 @@ xfs_mountfs(
> > > >  	 * qm_unmount_quotas and therefore rely on qm_unmount to release the
> > > >  	 * quota inodes.
> > > >  	 */
> > > > +	xfs_inodegc_flush(mp);
> > > >  	xfs_unmount_flush_inodes(mp);
> > > 
> > > Why isn't xfs_inodegc_flush() part of xfs_unmount_flush_inodes()?
> > > Because, really, xfs_unmount_flush_inodes() depends on all the
> > > inodes first being inactivated so that all transactions on inodes
> > > are complete....
> > 
> > The teardown sequence is not the same between a regular unmount and an
> > aborted mount...
> > 
> > > >   out_log_dealloc:
> > > >  	xfs_log_mount_cancel(mp);
> > > > @@ -983,6 +1000,12 @@ xfs_unmountfs(
> > > >  	uint64_t		resblks;
> > > >  	int			error;
> > > >  
> > > > +	/*
> > > > +	 * Flush all the queued inode inactivation work to disk before tearing
> > > > +	 * down rt metadata and quotas.
> > > > +	 */
> > > > +	xfs_inodegc_flush(mp);
> > > > +
> > > >  	xfs_blockgc_stop(mp);
> > > >  	xfs_fs_unreserve_ag_blocks(mp);
> > > >  	xfs_qm_unmount_quotas(mp);
> > > 
> > > FWIW, there's inconsistency in the order of operations between
> > > failure handling in xfs_mountfs() w.r.t. inode flushing and quotas
> > > vs what xfs_unmountfs() is now doing....
> > 
> > ...because during regular unmountfs, we want to inactivate inodes while
> > we still have a per-ag reservation protecting finobt expansions.  During
> > an aborted mount, we don't necessarily have the reservation set up but
> > we have to clean everything out, so the inodegc flush comes much later.
> > 
> > It's convoluted, but do you want me to clean /that/ up too?  That's a
> > pretty heavy lift; I already tried to fix those two paths, ran out of
> > brain cells, and gave up.
> 
> No, I was just noting that they are different and there was no clear
> explaination of why. A comment explaining the difference is really
> all I am looking for here...

Ok.  I'll document why xfs_unmountfs calls inodegc_flush explicitly:

	/*
	 * Perform all on-disk metadata updates required to inactivate
	 * inodes.  Freeing inodes can cause shape changes to the
	 * finobt, so take care of this before we lose the per-AG space
	 * reservations.
	 */
	xfs_inodegc_flush(mp);

It shouldn't be necessary to call it again from xfs_unmount_flush_inodes
since the only inodes that should be in memory at that point are the
quota, realtime, and root directory inodes, none of which ever go
through inactivation (quota/rt are metadata, and the root dir should
never get deleted).

However, it's a bit murkier for the failed mountfs case -- I /think/ I
put the flush in there to make sure that we always sent all inodes to
reclaim no matter what weird things happened during mount.  In theory
it's always a NOP since log recovery inactivates all the inodes it
touches (if necessary) and the only other inodes that mount touches are
the quota/rt/rootdir inodes.

> (and now df vs unlink....)
> 
> > > > @@ -80,4 +80,37 @@ int xfs_icache_inode_is_allocated(struct xfs_mount *mp, struct xfs_trans *tp,
> > > >  void xfs_blockgc_stop(struct xfs_mount *mp);
> > > >  void xfs_blockgc_start(struct xfs_mount *mp);
> > > >  
> > > > +void xfs_inodegc_worker(struct work_struct *work);
> > > > +void xfs_inodegc_flush(struct xfs_mount *mp);
> > > > +void xfs_inodegc_stop(struct xfs_mount *mp);
> > > > +void xfs_inodegc_start(struct xfs_mount *mp);
> > > > +int xfs_inodegc_free_space(struct xfs_mount *mp, struct xfs_icwalk *icw);
> > > > +
> > > > +/*
> > > > + * Process all pending inode inactivations immediately (sort of) so that a
> > > > + * resource usage report will be mostly accurate with regards to files that
> > > > + * have been unlinked recently.
> > > > + *
> > > > + * It isn't practical to maintain a count of the resources used by unlinked
> > > > + * inodes to adjust the values reported by this function.  Resources that are
> > > > + * shared (e.g. reflink) when an inode is queued for inactivation cannot be
> > > > + * counted towards the adjustment, and cross referencing data extents with the
> > > > + * refcount btree is the only way to decide if a resource is shared.  Worse,
> > > > + * unsharing of any data blocks in the system requires either a second
> > > > + * consultation with the refcount btree, or training users to deal with the
> > > > + * free space counts possibly fluctuating upwards as inactivations occur.
> > > > + *
> > > > + * Hence we guard the inactivation flush with a ratelimiter so that the counts
> > > > + * are not way out of whack while ignoring workloads that hammer us with statfs
> > > > + * calls.  Once per clock tick seems frequent enough to avoid complaints about
> > > > + * inaccurate counts.
> > > > + */
> > > > +static inline void
> > > > +xfs_inodegc_summary_flush(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	if (__ratelimit(&mp->m_inodegc_ratelimit))
> > > > +		xfs_inodegc_flush(mp);
> > > > +}
> > > 
> > > ONce per clock tick is still quite fast - once a millisecond on a
> > > 1000Hz kernel. I'd prefer that we use a known timer base for this
> > > sort of thing, not something that changes with kernel config. Every
> > > 100ms, perhaps?
> > 
> > I tried a number of ratelimit values here: 1ms, 4ms, 10ms, 100ms, 500ms,
> > 2s, and 15s.  fstests and most everything else seemed to act the same up
> > to 10ms.  At 100ms, the tests that delete something and immediately run
> > df will start to fail, and above that all hell breaks loose because many
> > tests (from which I extrapolate most programmers) expect that statfs
> > won't run until unlink has deleted everything.
> 
> So the main problem I have with this is that it it blocks the caller
> until inactivation is done. For something that is supposed to be
> fast and non-blocking, this is a bad thing to do.
> 
> The quota usage (called on every get/get_next syscall) is really the
> only one that should be called with any frequency - if anyone is
> calling statfs so fast that we have to rate limit gc flushes, then
> they aren't getting any useful information in the delta between
> calls a millisecond apart.
> 
> Hence I suspect that flushes and/or rate limited flushes are not
> necessary at all here. Why not just deal with it like we do the
> inode flush at ENOSPC (i.e. xfs_flush_inodes())? i.e. we try to
> flush the work first, and if that returns true we waited on a flush
> already in progress and we don't need to do our own? Indeed, why
> aren't all the inodegc flushes done this way?
> 
> For the quota case, I think doing a flush on the first get call
> would be sufficient - doing one for every "next" call doesn't make
> much sense, because we've already done a flush at the start of the
> dquot get walk. IOWs, we've done the work necessary for a point in
> time snapshot of the quota state that is largely consistent across
> all the quotas at the time the walk started. Hence I don't think we
> need to keep flushing over and over again....

<nod> The quota case might get put back, then.

> For fs_counts, it is non-blocking, best effort only.  The percpu
> counters are read, not summed, so they are inaccurate to begin with.
> Hence there's not need to flush inactivated inodes there becuse the
> counters are not guaranteed accurate. If we do need a flush, then
> just do it unconditionally because anyone calling this with
> extremely high frequency really isn't going to get anything useful
> from it.
> 
> For statfs, we actually sum the percpu counters, so we should just
> flush the inodegc before this. If someone is calling statfs with
> high enough frequency that rate limiting inodegc flushes is actually
> needed, then they've already got substantial problems on large CPU
> count machines..
> 
> Hence I think we should just have flushes where they are needed, and
> change the flush to block and return if a flush is already in
> progress rather than doing an entire new flush itself. That
> effectively rate limits the flushing, but doesn't prevent a flush
> when none has been done in a while due to ratelimit state...

For now, I'm taking your suggestion to kick off the inactivation work at
the end of xfs_fs_syncfs, since regular syncfs is known to involve disk
access already, and the people who "unlink(); syncfs(); statfs();" like
they're supposed to will still get the results they expect.

I also made it so that if fdblocks is below the low space thresholds
then we'll delay the inode/blockgc workers less and less, and removed
xfs_inodegc_summary_flush completely.

I /think/ that flushing inodegc during a syncfs also solves the problem
of coordinating inodegc flush and stop with the rest of freeze.  I might
still need the opflag to prevent requeuing if a destroy_inode races with
a readonly remount, but I need to recheck that a little more carefully
tomorrow when I'm not tired.

But, at worst, if that doesn't work, I can open-code freeze_super to
call the parts of XFS that we really want.

> 
> > > I suspect that's really going to hurt stat performance. I guess
> > > benchmarks are in order...
> > 
> > Ok, so here's the question I have: Does POSIX (or any other standard we
> > care to fit) actually define any behavior between the following
> > sequence:
> > 
> > unlink("/foo");	/* no sync calls of any kind */
> > statfs("/");
> 
> None that I know of. the man page for statfs even says:
> 
> "buf is a pointer to a statfs structure defined approximately as
> follows"
> 
> so even the man page is extremely cagey about what is actually
> returned in the statfs buffer. Free space counters not being totally
> accurate at any specific point in time fits with the "approximately
> defined" behaviour description in the man page. As long as we do, in
> the near term, correct account for deferred operations, then we're
> all good.
> 
> > As I've mentioned in the past, the only reason we need these inodegc
> > flushes for summary reporting is because users expect that if they
> > delete an unshared 10GB file, they can immediately df and see that the
> > inode count went down by one and free space went up by 10GB.
> > 
> > I /guess/ we could modify every single fstest to sync before checking
> > summary counts instead of doing this, but I bet there will be some users
> > who will be surprised that xfs now has *trfs df logic.
> 
> If fstests needs accurate counters, it should use 'df --sync' and we
> should make sure that the syncfs implementation triggers a flush of
> inactive inodes before it returns. We don't have to guarantee that
> the inactivation is on stable storage, but it would make sense that
> we trigger background ops to get the free space accounting near to
> accurate...

<nod> Well let's see how it does overnight...

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

  reply	other threads:[~2021-06-09  1:28 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-07 22:24 [PATCHSET v6 0/9] xfs: deferred inode inactivation Darrick J. Wong
2021-06-07 22:24 ` [PATCH 1/9] xfs: refactor the inode recycling code Darrick J. Wong
2021-06-07 22:59   ` Dave Chinner
2021-06-08  0:14     ` Darrick J. Wong
2021-06-07 22:25 ` [PATCH 2/9] xfs: deferred inode inactivation Darrick J. Wong
2021-06-08  0:57   ` Dave Chinner
2021-06-08  4:40     ` Darrick J. Wong
2021-06-09  1:01       ` Dave Chinner
2021-06-09  1:28         ` Darrick J. Wong [this message]
2021-06-07 22:25 ` [PATCH 3/9] xfs: expose sysfs knob to control inode inactivation delay Darrick J. Wong
2021-06-08  1:09   ` Dave Chinner
2021-06-08  2:02     ` Darrick J. Wong
2021-06-07 22:25 ` [PATCH 4/9] xfs: force inode inactivation and retry fs writes when there isn't space Darrick J. Wong
2021-06-07 22:25 ` [PATCH 5/9] xfs: force inode garbage collection before fallocate when space is low Darrick J. Wong
2021-06-08  1:26   ` Dave Chinner
2021-06-08 11:48     ` Brian Foster
2021-06-08 15:32       ` Darrick J. Wong
2021-06-08 16:06         ` Brian Foster
2021-06-08 21:55         ` Dave Chinner
2021-06-09  0:25           ` Darrick J. Wong
2021-06-07 22:25 ` [PATCH 6/9] xfs: parallelize inode inactivation Darrick J. Wong
2021-06-07 22:25 ` [PATCH 7/9] xfs: create a polled function to force " Darrick J. Wong
2021-06-07 22:25 ` [PATCH 8/9] xfs: don't run speculative preallocation gc when fs is frozen Darrick J. Wong
2021-06-07 22:25 ` [PATCH 9/9] xfs: avoid buffer deadlocks when walking fs inodes Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2021-03-26  0:21 [PATCHSET v5 0/9] xfs: deferred inode inactivation Darrick J. Wong
2021-03-26  0:22 ` [PATCH 2/9] " Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210609012838.GW2945738@locust \
    --to=djwong@kernel.org \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.