All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Allison Henderson <allison.henderson@oracle.com>
Cc: Catherine Hoang <catherine.hoang@oracle.com>,
	"david@fromorbit.com" <david@fromorbit.com>,
	"willy@infradead.org" <willy@infradead.org>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
	Chandan Babu <chandan.babu@oracle.com>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"hch@infradead.org" <hch@infradead.org>
Subject: Re: [PATCH 09/14] xfs: document online file metadata repair code
Date: Thu, 16 Feb 2023 15:05:35 -0800	[thread overview]
Message-ID: <Y+62v5kfhv5lr/R8@magnolia> (raw)
In-Reply-To: <addbcebf5cc6b59d5910df8cce80d47f445eb891.camel@oracle.com>

On Thu, Feb 16, 2023 at 03:46:30PM +0000, Allison Henderson wrote:
> On Sun, 2022-10-02 at 11:19 -0700, Darrick J. Wong wrote:

Not sure why this reply is to the earlier submission of the design doc,
but oh well, it didn't change much between October and December of 2022.

> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add to the fifth chapter of the online fsck design documentation,
> > where
> > we discuss the details of the data structures and algorithms used by
> > the
> > kernel to repair file metadata.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  150
> > ++++++++++++++++++++
> >  1 file changed, 150 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index c41f089549a0..10709dc74dcb 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -2872,3 +2872,153 @@ The allocation group free block list (AGFL)
> > is repaired as follows:
> >  4. Once the AGFL is full, reap any blocks leftover.
> >  
> >  5. The next operation to fix the freelist will right-size the list.
> > +
> > +Inode Record Repairs
> > +--------------------
> > +
> > +Inode records must be handled carefully, because they have both
> > ondisk records
> > +("dinodes") and an in-memory ("cached") representation.
> > +There is a very high potential for cache coherency issues if online
> > fsck is not
> > +careful to access the ondisk metadata *only* when the ondisk
> > metadata is so
> > +badly damaged that the filesystem cannot load the in-memory
> > representation.
> > +When online fsck wants to open a damaged file for scrubbing, it must
> > use
> > +specialized resource acquisition functions that return either the
> > in-memory
> > +representation *or* a lock on whichever object is necessary to
> > prevent any
> > +update to the ondisk location.
> > +
> > +The only repairs that should be made to the ondisk inode buffers are
> > whatever
> > +is necessary to get the in-core structure loaded.
> > +This means fixing whatever is caught by the inode cluster buffer and
> > inode fork
> > +verifiers, and retrying the ``iget`` operation.
> > +If the second ``iget`` fails, the repair has failed.
> > +
> > +Once the in-memory representation is loaded, repair can lock the
> > inode and can
> > +subject it to comprehensive checks, repairs, and optimizations.
> > +Most inode attributes are easy to check and constrain, or are user-
> > controlled
> > +arbitrary bit patterns; these are both easy to fix.
> > +Dealing with the data and attr fork extent counts and the file block
> > counts is
> > +more complicated, because computing the correct value requires
> > traversing the
> > +forks, or if that fails, leaving the fields invalid and waiting for
> > the fork
> > +fsck functions to run.
> > +
> > +The proposed patchset is the
> > +`inode
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-inodes>`_
> > +repair series.
> > +
> > +Quota Record Repairs
> > +--------------------
> > +
> > +Similar to inodes, quota records ("dquots") also have both ondisk
> > records and
> > +an in-memory representation, and hence are subject to the same cache
> > coherency
> > +issues.
> > +Somewhat confusingly, both are known as dquots in the XFS codebase.
> > +
> > +The only repairs that should be made to the ondisk quota record
> > buffers are
> > +whatever is necessary to get the in-core structure loaded.
> > +Once the in-memory representation is loaded, the only attributes
> > needing
> > +checking are obviously bad limits and timer values.
> > +
> > +Quota usage counters are checked, repaired, and discussed separately
> > in the
> > +section about :ref:`live quotacheck <quotacheck>`.
> > +
> > +The proposed patchset is the
> > +`quota
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-quota>`_
> > +repair series.
> > +
> > +.. _fscounters:
> > +
> > +Freezing to Fix Summary Counters
> > +--------------------------------
> > +
> > +Filesystem summary counters track availability of filesystem
> > resources such
> > +as free blocks, free inodes, and allocated inodes.
> > +This information could be compiled by walking the free space and
> > inode indexes,
> > +but this is a slow process, so XFS maintains a copy in the ondisk
> > superblock
> > +that should reflect the ondisk metadata, at least when the
> > filesystem has been
> > +unmounted cleanly.
> > +For performance reasons, XFS also maintains incore copies of those
> > counters,
> > +which are key to enabling resource reservations for active
> > transactions.
> > +Writer threads reserve the worst-case quantities of resources from
> > the
> > +incore counter and give back whatever they don't use at commit time.
> > +It is therefore only necessary to serialize on the superblock when
> > the
> > +superblock is being committed to disk.
> > +
> > +The lazy superblock counter feature introduced in XFS v5 took this
> > even further
> > +by training log recovery to recompute the summary counters from the
> > AG headers,
> > +which eliminated the need for most transactions even to touch the
> > superblock.
> > +The only time XFS commits the summary counters is at filesystem
> > unmount.
> > +To reduce contention even further, the incore counter is implemented
> > as a
> > +percpu counter, which means that each CPU is allocated a batch of
> > blocks from a
> > +global incore counter and can satisfy small allocations from the
> > local batch.
> > +
> > +The high-performance nature of the summary counters makes it
> > difficult for
> > +online fsck to check them, since there is no way to quiesce a percpu
> > counter
> > +while the system is running.
> > +Although online fsck can read the filesystem metadata to compute the
> > correct
> > +values of the summary counters, there's no way to hold the value of
> > a percpu
> > +counter stable, so it's quite possible that the counter will be out
> > of date by
> > +the time the walk is complete.
> > +Earlier versions of online scrub would return to userspace with an
> > incomplete
> > +scan flag, but this is not a satisfying outcome for a system
> > administrator.
> > +For repairs, the in-memory counters must be stabilize while walking
> > the
> nit: stablilized

Fixed, thank you.

> > +filesystem metadata to get an accurate reading and install it in the
> > percpu
> > +counter.
> > +
> > +To satisfy this requirement, online fsck must prevent other programs
> > in the
> > +system from initiating new writes to the filesystem, it must disable
> > background
> > +garbage collection threads, and it must wait for existing writer
> > programs to
> > +exit the kernel.
> > +Once that has been established, scrub can walk the AG free space
> > indexes, the
> > +inode btrees, and the realtime bitmap to compute the correct value
> > of all
> > +four summary counters.
> > +This is very similar to a filesystem freeze.
> > +
> > +The initial implementation used the actual VFS filesystem freeze
> > mechanism to
> > +quiesce filesystem activity.
> 
> 
> > +With the filesystem frozen, it is possible to resolve the counter
> > values with
> > +exact precision, but there are many problems with calling the VFS
> > methods
> > +directly:
> > +
> > +- Other programs can unfreeze the filesystem without our knowledge.
> > +  This leads to incorrect scan results and incorrect repairs.
> > +
> > +- Adding an extra lock to prevent others from thawing the filesystem
> > required
> > +  the addition of a ``->freeze_super`` function to wrap
> > ``freeze_fs()``.
> > +  This in turn caused other subtle problems because it turns out
> > that the VFS
> > +  ``freeze_super`` and ``thaw_super`` functions can drop the last
> > reference to
> > +  the VFS superblock, and any subsequent access becomes a UAF bug!
> > +  This can happen if the filesystem is unmounted while the
> > underlying block
> > +  device has frozen the filesystem.
> > +  This problem could be solved by grabbing extra references to the
> > superblock,
> > +  but it felt suboptimal given the other inadequacies of this
> > approach:
> > +
> > +- The log need not be quiesced to check the summary counters, but a
> > VFS freeze
> > +  initiates one anyway.
> > +  This adds unnecessary runtime to live fscounter fsck operations.
> > +
> > +- Quiescing the log means that XFS flushes the (possibly incorrect)
> > counters to
> > +  disk as part of cleaning the log.
> > +
> > +- A bug in the VFS meant that freeze could complete even when
> > sync_filesystem
> > +  fails to flush the filesystem and returns an error.
> > +  This bug was fixed in Linux 5.17.
> > +
> > +The author 
> 
> I think the above is a candidate for culling or maybe the side bar if
> you really feel the need to keep it.

I'll turn it into a sidebar, since it took me a while to figure all that
out, and maybe some day I'll need to refer to what amounts to historic
notes.

> Aside from the fact that it was
> simply bug prone, it makes sense to avoid an entire freeze if we can
> help it.  I think it would be fine to just pop into the current
> proposition:
> 
> "However, this approach was found to be both costly and unstable. 
> Later optimizations of ofsck established ..."

How about:

"...Once that has been established, scrub can walk the AG free space
indexes, the inode btrees, and the realtime bitmap to compute the
correct value of all four summary counters.  This is very similar to a
filesystem freeze, though not all of the pieces are necessary:

- "The final freeze state is set one higher than ``SB_FREEZE_COMPLETE``
  to prevent other threads from thawing the filesystem, or other scrub
  threads from initiating another fscounters freeze.

- IIt does not quiesce the log.

"With this code in place, it is now possible to pause the filesystem for
just long enough to check and correct the summary counters."

<giant historic sidebar here>

?

> > established that the only component of online fsck that requires the
> > +ability to freeze the filesystem is the fscounter scrubber, so the
> > code for
> > +this could be localized to that source file.
> > +fscounter freeze behaves the same as the VFS freeze method, except:
> > +
> > +- The final freeze state is set one higher than
> > ``SB_FREEZE_COMPLETE`` to
> > +  prevent other threads from thawing the filesystem.
> > +
> > +- It does not quiesce the log.
> > +
> > +With this code in place, it is now possible to pause the filesystem
> > for just
> > +long enough to check and correct the summary counters.
> > +
> > +The proposed patchset is the
> > +`summary counter cleanup
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-fscounters>`_
> > +series.
> > 
> 
> Otherwise looks good.  :-)

Thanks!

--D

> Allison
> 

  reply	other threads:[~2023-02-16 23:05 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-02 18:19 [PATCHSET v23.3 00/14] xfs: design documentation for online fsck Darrick J. Wong
2022-10-02 18:19 ` [PATCH 02/14] xfs: document the general theory underlying online fsck design Darrick J. Wong
2022-10-02 18:19 ` [PATCH 01/14] xfs: document the motivation for " Darrick J. Wong
2022-10-02 18:19 ` [PATCH 09/14] xfs: document online file metadata repair code Darrick J. Wong
2023-02-16 15:46   ` Allison Henderson
2023-02-16 23:05     ` Darrick J. Wong [this message]
2023-02-25  7:32       ` Allison Henderson
2022-10-02 18:19 ` [PATCH 06/14] xfs: document how online fsck deals with eventual consistency Darrick J. Wong
2022-10-02 18:19 ` [PATCH 03/14] xfs: document the testing plan for online fsck Darrick J. Wong
2022-10-02 18:19 ` [PATCH 07/14] xfs: document pageable kernel memory Darrick J. Wong
2022-10-02 18:19 ` [PATCH 05/14] xfs: document the filesystem metadata checking strategy Darrick J. Wong
2022-10-02 18:19 ` [PATCH 04/14] xfs: document the user interface for online fsck Darrick J. Wong
2022-10-02 18:19 ` [PATCH 08/14] xfs: document btree bulk loading Darrick J. Wong
2022-10-02 18:19 ` [PATCH 10/14] xfs: document full filesystem scans for online fsck Darrick J. Wong
2022-10-02 18:19 ` [PATCH 14/14] xfs: document future directions of " Darrick J. Wong
2022-10-02 18:19 ` [PATCH 12/14] xfs: document directory tree repairs Darrick J. Wong
2022-10-02 18:19 ` [PATCH 11/14] xfs: document metadata file repair Darrick J. Wong
2022-10-02 18:19 ` [PATCH 13/14] xfs: document the userspace fsck driver program Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2023-03-07  1:30 [PATCHSET v24.3 00/14] xfs: design documentation for online fsck Darrick J. Wong
2023-03-07  1:31 ` [PATCH 09/14] xfs: document online file metadata repair code Darrick J. Wong
2022-12-30 22:10 [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
2022-12-30 22:10 ` [PATCH 09/14] xfs: document online file metadata repair code Darrick J. Wong
2022-08-07 18:30 [PATCHSET v2 00/14] xfs: design documentation for online fsck Darrick J. Wong
2022-08-07 18:30 ` [PATCH 09/14] xfs: document online file metadata repair code Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y+62v5kfhv5lr/R8@magnolia \
    --to=djwong@kernel.org \
    --cc=allison.henderson@oracle.com \
    --cc=catherine.hoang@oracle.com \
    --cc=chandan.babu@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.