Re: [RFC[RAP] PATCH] xfs: allow setting and clearing of log incompat feature flags

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>,
	Allison Henderson <allison.henderson@oracle.com>,
	xfs <linux-xfs@vger.kernel.org>
Subject: Re: [RFC[RAP] PATCH] xfs: allow setting and clearing of log incompat feature flags
Date: Fri, 11 Dec 2020 15:35:07 -0800	[thread overview]
Message-ID: <20201211233507.GP1943235@magnolia> (raw)
In-Reply-To: <20201211133901.GA2032335@bfoster>

On Fri, Dec 11, 2020 at 08:39:01AM -0500, Brian Foster wrote:
> On Fri, Dec 11, 2020 at 08:50:04AM +1100, Dave Chinner wrote:
> > On Thu, Dec 10, 2020 at 09:23:58AM -0500, Brian Foster wrote:
> > > On Thu, Dec 10, 2020 at 07:51:32AM +1100, Dave Chinner wrote:
> > > > For changing incompat log flags, covering also provides exactly what
> > > > we need - an empty log with only a dirty superblock in the journal
> > > > to recover. All that we need then is to recheck feature flags after
> > > > recovery (not just clear log incompat flags) because we might need
> > > > to use this to also set incompat feature flags dynamically.
> > > > 
> > > 
> > > I'd love it if we use a better term for a log isolated to the superblock
> > > buffer. So far we've discussed an empty log with a dummy superblock, an
> > > empty log with a dirty superblock, and I suppose we could also have an
> > > actual empty log. :P "Covered log," perhaps?
> > 
> > On terminology: "covered log" has been around for 25 years
> > in XFS, and it's very definition is "an empty, up-to-date log except
> > for a dummy object we logged without any changes to update the log
> > tail". And, by definition, that dummy object is dirty in the log

Hmm, can we capture in a comment in the logging code the precise meaning
of "covered log" in whatever patch we end up generating?  I've long
suspected that's more or less what a covered log meant (the log has one
item just to prove that head==tail and there's nothing else to see here)
but it sure would've been nice to confirm that. :)

> > even if it contains no actual modifications when it was logged.
> > So in this case, "dummy" is directly interchangable with "dirty"
> > when looking at the log contents w.r.t. recovery behaviour.
> > 
> > It's worth looking at a historical change, too. Log covering
> > originally used the root inode and not the superblock. That caused
> > problems on linux dirtying VFS inode state, so we changed it to the
> > superblock in commit 1a387d3be2b3 ("xfs: dummy transactions should
> > not dirty VFS state") about a decade ago.  Note the name of the
> > function at the time (xfs_fs_log_dummy()) and that it's description
> > explicitly mentions that a dummy transaction used for updating the
> > log tail when covering it.
> > 
> > The only difference in this discussion is fact that we may be
> > replacing the "dummy" with a modified superblock. Both are still
> > dirty superblock objects in the log, and serve the same purpose for
> > covering the log, but instead of using a dummy object we update the
> > log incompat flags so taht the only thing that gets replayed if we
> > crash is the modification to the superblock flags...
> > 
> 
> Makes sense, thanks.
> 
> > Finally, no, we can't have a truly empty log while the filesystem is
> > mounted because log transaction records must not be empty. Further,
> > we can only bring the log down to an "empty but not quite clean"
> > state while mounted, because if the log is actually marked clean and
> > we crash then log recovery will not run and so we will not clean up
> > files that were open-but-unlinked when the system crashed.

Whatever happened to Eric's patchset to make us process unlinked inodes
at mount time so that freeze images wouldn't require recovery?

> Yeah, I wasn't suggesting to do that, just surmising about terminology.
> It sounds like "covered log" refers to the current state logging an
> unmodified superblock object.
> 
> > 
> > > > > > but didn't necessarily clean the log. I wonder if we can fit that state
> > > > > > into the existing log covering mechanism by logging the update earlier
> > > > > > (or maybe via an intermediate state..)? I'll need to go stare at that
> > > > > > code a bit..
> > > > 
> > > > That's kinda what I'd like to see done - it gives us a generic way
> > > > of altering superblock incompat feature fields and filesystem
> > > > structures dynamically with no risk of log recovery needing to parse
> > > > structures it may not know about.
> > > > 
> > > 
> > > We might have to think about if we want to clear such feature bits in
> > > all log covering scenarios (idle, freeze) vs. just unmount.
> > 
> > I think clearing bits can probably be lazy. Set a state flag to tell
> > log covering that it needs to do an update, and so when the covering
> > code finally runs the feature bit modification just happens.
> > 
> 
> That's reasonable on its own, it just means we have to support dynamic
> setting of the associated bit(s) at runtime...

The one downside of clearing the log incompat flags when the log goes
idle is that right now I print a big EXPERIMENTAL warning whenever we
turn on the atomic swapext feature, and doing this will cause that
message to cycle over and over...

That's kind of a minor point though.  Either we add a superblock flag to
remember that we've already warned about the experimental feature, or we
just don't bother at all.

Next time I have spare cycles I'll look into adapting this patch to make
it clear log incompat flags at unmount and/or covering time, assuming
nobody beats me to it.

> > > My previous suggestion in the other sub-thread was to set bits on
> > > mount and clear on unmount because that (hopefully) simplifies the
> > > broader implementation.  Otherwise we need to manage dynamic
> > > setting of bits when the associated log items are active, and that
> > > still isn't truly representative because the bits would remain set
> > > long after active items fall out of the log, until the log is
> > > covered. Either way is possible, but I'm curious what others
> > > think.
> > 
> > I think that unless we are setting a log incompat flag, log covering
> > should unconditionally clear the log incompat flags. Because the log
> > is empty, and recovery of a superblock buffer should -always- be
> > possible, then by definition we have no log incompat state
> > present....
> > 
> 
> It should, but in practice will depend on whether the superblock was
> written back or not before a crash while in the covered state. So
> recovery may or may not actually work on an older kernel after the log
> is covered. I think that is fine from a correctness standpoint because
> the important part is to make sure an older kernel will not attempt to
> recover incompatible items, and we make that guarantee by covering the
> log before clearing the bit. I'm merely pointing this out because I
> still think it's more straightforward to just enforce that a kernel with
> the associated feature bit be required to recover a dirty log.

<nod>

> > As for a mechanism for dynamically adding log incompat flags?
> > Perhaps we just do that in xfs_trans_alloc() - add an log incompat
> > flags field into the transaction reservation structure, and if
> > xfs_trans_alloc() sees an incompat field set and the superblock
> > doesn't have it set, the first thing it does is run a "set log
> > incompat flag" transaction before then doing it's normal work...
> > 
> > This should be rare enough it doesn't have any measurable
> > performance overhead, and it's flexible enough to support any log
> > incompat feature we might need to implement...
> > 
> 
> But I don't think that is sufficient. As Darrick pointed out up-thread,
> the updated superblock has to be written back before we're allowed to
> commit transactions with incompatible items. Otherwise, an older kernel
> can attempt log recovery with incompatible items present if the
> filesystem crashes before the superblock is written back.
> 
> We could do some sync transaction and/or sync write dance at runtime,
> but I think the performance/overhead aspect becomes slightly less
> deterministic. It's not clear to me how many bits we'd support over
> time, and whether users would notice hiccups when running some sustained
> workload and happen to trigger sync transaction/write or AIL push
> sequences to set internal bits.

Probably few, since I imagined that most new log items are going to get
built on behalf of new disk format features.  OTOH Allison and I are
both building features that don't change the disk format...

> My question is how flexible do we really need to make incompatible log
> recovery support? Why not just commit the superblock once at mount time
> with however many bits the current kernel supports and clear them on
> unmount? (Or perhaps consider a lazy setting variant where we set all
> supported bits on the first modification..?)

I don't think setting it at mount (or first mod) time is a good idea
either, since that means that the fs cannot be recovered on an old
kernel even if we never actually used the new log items.

--D

> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
>