All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, willy@infradead.org,
	chandan.babu@oracle.com, allison.henderson@oracle.com,
	linux-fsdevel@vger.kernel.org, hch@infradead.org,
	catherine.hoang@oracle.com
Subject: Re: [PATCH 05/14] xfs: document the filesystem metadata checking strategy
Date: Mon, 15 Aug 2022 19:37:03 -0700	[thread overview]
Message-ID: <YvsCz20x8MGyIVRa@magnolia> (raw)
In-Reply-To: <20220811011742.GP3600936@dread.disaster.area>

On Thu, Aug 11, 2022 at 11:17:42AM +1000, Dave Chinner wrote:
> On Sun, Aug 07, 2022 at 11:30:33AM -0700, Darrick J. Wong wrote:
> > +Reverse Mapping
> > +---------------
> > +
> > +The original design of XFS (circa 1993) is an improvement upon 1980s Unix
> > +filesystem design.
> > +In those days, storage density was expensive, CPU time was scarce, and
> > +excessive seek time could kill performance.
> > +For performance reasons, filesystem authors were reluctant to add redundancy to
> > +the filesystem, even at the cost of data integrity.
> > +Filesystems designers in the early 21st century choose different strategies to
> > +increase internal redundancy -- either storing nearly identical copies of
> > +metadata, or more space-efficient techniques such as erasure coding.
> > +Obvious corruptions are typically repaired by copying replicas or
> > +reconstructing from codes.
> > +
> > +For XFS, a different redundancy strategy was chosen to modernize the design:
> > +a secondary space usage index that maps allocated disk extents back to their
> > +owners.
> > +By adding a new index, the filesystem retains most of its ability to scale
> > +well to heavily threaded workloads involving large datasets, since the primary
> > +file metadata (the directory tree, the file block map, and the allocation
> > +groups) remain unchanged.
> > +Although the reverse-mapping feature increases overhead costs for space
> > +mapping activities just like any other system that improves redundancy, it
> > +has two critical advantages: first, the reverse index is key to enabling online
> > +fsck and other requested functionality such as filesystem reorganization,
> > +better media failure reporting, and shrinking.
> > +Second, the different ondisk storage format of the reverse mapping btree
> > +defeats device-level deduplication, because the filesystem requires real
> > +redundancy.
> 
> "defeats device-level deduplication" took me a bit of thought to
> follow what you were trying to describe here.
> 
> I think what you mean is this:
> 
> "Second, the reverse mapping metadata uses a different on-disk
> format for storage. This prevents storage device level deduplication
> from removing redundant metadata the filesystem stores for recovery
> purposes. Hence we will always have two physical copies of the
> metadata we need for recovery in the storage device, regardless of
> the technologies it implements."

Correct.  I'll work that in.

> > +A criticism of adding the secondary index is that it does nothing to improve
> > +the robustness of user data storage itself.
> > +This is a valid point, but adding a new index for file data block checksums
> > +increases write amplification and turns data overwrites into copy-writes, which
> > +age the filesystem prematurely.
> 
> And can come with a significant IO performance penalty.

Right, I shouldn't be assuming that people can draw the connection
between write amplification/aging fs metadata and slow performance.

> > +Checking Extended Attributes
> > +````````````````````````````
> > +
> > +Extended attributes implement a key-value store that enable fragments of data
> > +to be attached to any file.
> > +Both the kernel and userspace can access the keys and values, subject to
> > +namespace and privilege restrictions.
> > +Most typically these fragments are metadata about the file -- origins, security
> > +contexts, user-supplied labels, indexing information, etc.
> > +
> > +Names can be as long as 255 bytes and can exist in several different
> > +namespaces.
> > +Values can be as large as 64KB.
> > +A file's extended attributes are stored in blocks mapped by the attr fork.
> > +The mappings point to leaf blocks, remote value blocks, or dabtree blocks.
> > +Block 0 in the attribute fork is always the top of the structure, but otherwise
> > +each of the three types of blocks can be found at any offset in the attr fork.
> > +Leaf blocks contain attribute key records that point to the name and the value.
> > +Names are always stored elsewhere in the same leaf block.
> > +Values that are less than 3/4 the size of a filesystem block are also stored
> > +elsewhere in the same leaf block.
> > +Remote value blocks contain values that are too large to fit inside a leaf.
> > +If the leaf information exceeds a single filesystem block, a dabtree (also
> > +rooted at block 0) is created to map hashes of the attribute names to leaf
> > +blocks in the attr fork.
> > +
> > +Checking an extended attribute structure is not so straightfoward due to the
> > +lack of separation between attr blocks and index blocks.
> > +Scrub must read each block mapped by the attr fork and ignore the non-leaf
> > +blocks:
> > +
> > +1. Walk the dabtree in the attr fork (if present) to ensure that there are no
> > +   irregularities in the blocks or dabtree mappings that do not point to
> > +   attr leaf blocks.
> > +
> > +2. Walk the blocks of the attr fork looking for leaf blocks.
> > +   For each entry inside a leaf:
> > +
> > +   a. Validate that the name does not contain invalid characters.
> > +
> > +   b. Read the attr value.
> > +      This performs a named lookup of the attr name to ensure the correctness
> > +      of the dabtree.
> > +      If the value is stored in a remote block, this also validates the
> > +      integrity of the remote value block.
> 
> I'm assuming that checking remote xattr blocks involves checking the headers
> are all valid, the overall length is correct, they have valid rmap
> entries, etc?

Yep.  The headers are checked by reading the attr value; and the space
mappings are checked by the attr fork scanner.

> > +Checking and Cross-Referencing Directories
> > +``````````````````````````````````````````
> > +
> > +The filesystem directory tree is a directed acylic graph structure, with files
> > +constituting the nodes, and directory entries (dirents) constituting the edges.
> 
> "with hashed file names consituting the nodes" ?



> > +Directories are a special type of file containing a set of mappings from a
> > +255-byte sequence (name) to an inumber.
> > +These are called directory entries, or dirents for short.
> > +Each directory file must have exactly one directory pointing to the file.
> > +A root directory points to itself.
> > +Directory entries point to files of any type.
> 
> "point to inodes"
> 
> > +Each non-directory file may have multiple directories point to it.
> 
> "Each non directory inode may have multiple dirents point to it"

Both fixed.

> > +
> > +In XFS, directories are implemented as a file containing up to three 32GB
> > +partitions.
> 
> "implemented as offset-based file data"

Fixed.

> > +The first partition contains directory entry data blocks.
> > +Each data block contains variable-sized records associating a user-provided
> > +name with an inumber and, optionally, a file type.
> > +If the directory entry data grows beyond one block, the second partition (which
> > +exists as post-EOF extents) is populated with a block containing free space
> > +information and an index that maps hashes of the dirent names to directory data
> > +blocks in the first partition.
> > +This makes directory name lookups very fast.
> 
> "This is known as a "LEAF 1" format directory."

Good addition!

> > +If this second partition grows beyond one block, the third partition is
> > +populated with a linear array of free space information for faster
> > +expansions.
> 
> s/for faster expansions/to speed up first fit searches when inserting
> new dirents/.
> 
> "This is known as a "NODE" format directory."

Ok.

> > +If the free space has been separated and the second partition grows again
> > +beyond one block, then a dabtree is used to map hashes of dirent names to
> > +directory data blocks.
> 
> The second partition is changed to a dabtree format during NODE
> format conversion - it's just the single root block form. It grows
> as a btree from there by normal split/grow btree operations at that
> point.  Hence I'm not sure this needs to be mentioned as a separate
> step - the node form conversion should mention the second partition
> transitions to the dabtree index format, and it's all obvious from
> there...

I put that nugget in because *I* keep forgetting that there are
directories with a single block in the second partition that doesn't
have the DA3_NODE magicnum. ;)

How about:

"This is known as a 'NODE' format directory.
The second partition contains a dabtree that ranges in size from a
single block to many levels."

> > +Checking a directory is pretty straightfoward:
> > +
> > +1. Walk the dabtree in the second partition (if present) to ensure that there
> > +   are no irregularities in the blocks or dabtree mappings that do not point to
> > +   dirent blocks.
> > +
> > +2. Walk the blocks of the first partition looking for directory entries.
> > +   Each dirent is checked as follows:
> > +
> > +   a. Does the name contain no invalid characters?
> > +
> > +   b. Does the inumber correspond to an actual, allocated inode?
> > +
> > +   c. Does the child inode have a nonzero link count?
> > +
> > +   d. If a file type is included in the dirent, does it match the type of the
> > +      inode?
> > +
> > +   e. If the child is a subdirectory, does the child's dotdot pointer point
> > +      back to the parent?
> > +
> > +   f. If the directory has a second partition, perform a named lookup of the
> > +      dirent name to ensure the correctness of the dabtree.
> > +
> > +3. Walk the free space list in the third partition (if present) to ensure that
> > +   the free spaces it describes are really unused.
> 
> Actaully, they should reflect the largest free space in the given
> block, so this has to be checked against the bests array in the
> given block.
> 
> Also, if we are walking the directory blocks, we should be checking
> the empty ranges for valid tags and sizes, and that the largest 3
> holes in the block are correctly ordered in the bests array, too.

<nod> I think the code actually does that, but I got lazy with writing
here. :/

> > +
> > +Checking operations involving :ref:`parents <dirparent>` and
> > +:ref:`file link counts <nlinks>` are discussed in more detail in later
> > +sections.
> > +
> > +Checking Directory/Attribute Btrees
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +As stated in previous sections, the directory/attribute btree (dabtree) index
> > +maps user-provided names to improve lookup times by avoiding linear scans.
> > +Internally, it maps a 32-bit hash of the name to a block offset within the
> > +appropriate file fork.
> > +
> > +The internal structure of a dabtree closely resembles the btrees that record
> > +fixed-size metadata records -- each dabtree block contains a magic number, a
> > +checksum, sibling pointers, a UUID, a tree level, and a log sequence number.
> > +The format of leaf and node records are the same -- each entry points to the
> > +next level down in the hierarchy, with dabtree node records pointing to dabtree
> > +leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
> > +in the fork.
> > +
> > +Checking and cross-referencing the dabtree is very similar to what is done for
> > +space btrees:
> > +
> > +- Does the type of data stored in the block match what scrub is expecting?
> > +
> > +- Does the block belong to the owning structure that asked for the read?
> > +
> > +- Do the records fit within the block?
> > +
> > +- Are the records contained inside the block free of obvious corruptions?
> > +
> > +- Are the name hashes in the correct order?
> > +
> > +- Do node pointers within the dabtree point to valid fork offsets for dabtree
> > +  blocks?
> > +
> > +- Do leaf pointers within the dabtree point to valid fork offsets for directory
> > +  or attr leaf blocks?
> > +
> > +- Do child pointers point towards the leaves?
> > +
> > +- Do sibling pointers point across the same level?
> > +
> > +- For each dabtree node record, does the record key accurate reflect the
> > +  contents of the child dabtree block?
> > +
> > +- For each dabtree leaf record, does the record key accurate reflect the
> > +  contents of the directory or attr block?
> 
> This last one is checking that they point to a valid dirent and
> check that the hash of the dirent name actually matches the hash
> that points to it?

Yes.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

  reply	other threads:[~2022-08-16  7:07 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-07 18:30 [PATCHSET v2 00/14] xfs: design documentation for online fsck Darrick J. Wong
2022-08-07 18:30 ` [PATCH 01/14] xfs: document the motivation for online fsck design Darrick J. Wong
2022-08-07 18:30 ` [PATCH 02/14] xfs: document the general theory underlying " Darrick J. Wong
2022-08-07 18:30 ` [PATCH 03/14] xfs: document the testing plan for online fsck Darrick J. Wong
2022-08-11  0:09   ` Dave Chinner
2022-08-16  2:18     ` Darrick J. Wong
2022-08-07 18:30 ` [PATCH 04/14] xfs: document the user interface " Darrick J. Wong
2022-08-11  0:20   ` Dave Chinner
2022-08-16  2:30     ` Darrick J. Wong
2022-08-07 18:30 ` [PATCH 05/14] xfs: document the filesystem metadata checking strategy Darrick J. Wong
2022-08-11  1:17   ` Dave Chinner
2022-08-16  2:37     ` Darrick J. Wong [this message]
2022-08-07 18:30 ` [PATCH 06/14] xfs: document how online fsck deals with eventual consistency Darrick J. Wong
2022-08-07 18:30 ` [PATCH 07/14] xfs: document pageable kernel memory Darrick J. Wong
2022-08-07 18:30 ` [PATCH 08/14] xfs: document btree bulk loading Darrick J. Wong
2022-08-07 18:30 ` [PATCH 09/14] xfs: document online file metadata repair code Darrick J. Wong
2022-08-07 18:31 ` [PATCH 10/14] xfs: document full filesystem scans for online fsck Darrick J. Wong
2022-08-07 18:31 ` [PATCH 11/14] xfs: document metadata file repair Darrick J. Wong
2022-08-07 18:31 ` [PATCH 12/14] xfs: document directory tree repairs Darrick J. Wong
2022-08-07 18:31 ` [PATCH 13/14] xfs: document the userspace fsck driver program Darrick J. Wong
2022-08-07 18:31 ` [PATCH 14/14] xfs: document future directions of online fsck Darrick J. Wong
2022-10-02 18:19 [PATCHSET v23.3 00/14] xfs: design documentation for " Darrick J. Wong
2022-10-02 18:19 ` [PATCH 05/14] xfs: document the filesystem metadata checking strategy Darrick J. Wong
2022-12-30 22:10 [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
2022-12-30 22:10 ` [PATCH 05/14] xfs: document the filesystem metadata checking strategy Darrick J. Wong
2023-01-21  1:38   ` Allison Henderson
2023-02-02 19:04     ` Darrick J. Wong
2023-02-09  5:41       ` Allison Henderson
2023-03-07  1:30 [PATCHSET v24.3 00/14] xfs: design documentation for online fsck Darrick J. Wong
2023-03-07  1:31 ` [PATCH 05/14] xfs: document the filesystem metadata checking strategy Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YvsCz20x8MGyIVRa@magnolia \
    --to=djwong@kernel.org \
    --cc=allison.henderson@oracle.com \
    --cc=catherine.hoang@oracle.com \
    --cc=chandan.babu@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.