[PATCH 7/8] xfs: document specific technical aspects of userspace driver program

From: "Darrick J. Wong" <djwong@kernel.org>
To: djwong@kernel.org
Cc: linux-xfs@vger.kernel.org, willy@infradead.org,
	chandan.babu@oracle.com, allison.henderson@oracle.com,
	linux-fsdevel@vger.kernel.org, hch@infradead.org,
	catherine.hoang@oracle.com
Subject: [PATCH 7/8] xfs: document specific technical aspects of userspace driver program
Date: Mon, 06 Jun 2022 18:49:23 -0700	[thread overview]
Message-ID: <165456656324.167418.13728838498028089183.stgit@magnolia> (raw)
In-Reply-To: <165456652256.167418.912764930038710353.stgit@magnolia>

From: Darrick J. Wong <djwong@kernel.org>

Add the sixth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
driver program xfs_scrub.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  289 ++++++++++++++++++++
 1 file changed, 289 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index d9662c9653c9..47bedce146b8 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -268,6 +268,9 @@ The seven phases are as follows:
 7. The final phase re-checks the summary counters and presents the caller with
    a summary of space usage and file counts.
 
+This allocation of responsibilities will be :ref:`revisited <scrubcheck>`
+later in this document.
+
 Steps for Each Scrub Item
 -------------------------
 
@@ -3383,3 +3386,289 @@ The proposed patches are in the
 `directory repair
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
 series.
+
+Userspace Algorithms and Data Structures
+========================================
+
+In this chapter, we discuss the key algorithms and data structures employed by
+the driver program, ``xfs_scrub``, to check and repair metadata, verify file
+data, and look for other potential problems.
+
+.. _scrubcheck:
+
+Checking Metadata
+-----------------
+
+Recall the overview of the :ref:`work phases <scrubphases>` outlined earlier.
+This structure follows naturally from the data dependencies designed into the
+filesystem from its beginnings in 1993.
+In XFS, there are several groups of metadata dependencies:
+
+a. Filesystem summary counts depend on consistency within the inode indices,
+   the allocation group space btrees, and the realtime volume space information.
+
+b. Quota resource counts depend on consistency within the quota file data forks
+   and the file forks of every file on the system.
+
+c. The naming hierarchy depends on consistency within the directory and
+   extended attribute structures.
+   This includes file link counts.
+
+d. Directories, extended attributes, and file data depend on consistency within
+   the file forks that map directory and extended attribute data to physical
+   storage media.
+
+e. The file forks depends on consistency within inode records and the space
+   metadata indices of the allocation groups and the realtime volume.
+   This includes quota and realtime metadata files.
+
+f. The inode records depends on consistency within the inode metadata indices.
+
+g. The realtime space metadata depend on the data forks of the realtime
+   metadata inodes.
+
+h. The allocation group metadata indices (free space, inodes, reference count,
+   and reverse mapping btrees) depend on consistency within the AG headers and
+   between all the AG metadata btrees.
+
+i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support
+   for online fsck functionality.
+
+Therefore, a metadata dependency graph is a convenient way to schedule checking
+operations in the ``xfs_scrub`` program, and that is exactly what we do:
+
+- Phase 1 checks that the provided path maps to an XFS filesystem and detect
+  the kernel's scrubbing abilities, which validates group (i).
+
+- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
+
+- Phase 3 checks groups (f), (e), and (d), in that order.
+  These groups are all file metadata, which means that we can scan the inodes
+  in parallel.
+
+- Phase 5 starts by checking groups (b) and (c) in parallel before moving on
+  to checking names.
+
+- Phase 6 depends on phase 2 to have validated all the metadata that it uses
+  to find file data blocks to verify.
+
+- Phase 7 checks group (a), having validated everything else.
+
+Notice that the data dependencies between groups are enforced by the structure
+of the program flow.
+
+Parallel Inode Scans
+--------------------
+
+An XFS filesystem can easily contain many millions of inodes.
+Given that XFS targets installations with large high-performance storage,
+we'd like to be able to scrub inodes in parallel to minimize runtime.
+This requires careful scheduling to keep the threads as evenly loaded as
+possible.
+
+Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single
+workqueue and scheduled a single work item per AG.
+Each work item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find inode
+chunks, called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough information to
+construct file handles, and invoked a callback function with the bulkstat
+information and the handle.
+This leads to thread balancing problems in phase 3 if the filesystem contains
+one AG with a few large sparse files, and the rest of the AGs contain many
+smaller files, because we serialized the per-AG part of the inode scan.
+
+Thanks to Dave Chinner, bounded workqueues in userspace enable us to fix this
+problem with ease by adding a second workqueue.
+Just like before, the first workqueue is seeded with one work item per AG, and
+it uses INUMBERS to find inode btree chunks.
+The second workqueue, however, is created with an upper bound on the number of
+items that can be pending.
+Each inode btree chunk found by the first workqueue's workers are queued as a
+work item for the second workqueue, and it is this second workqueue that
+queries BULKSTAT and actually invokes the callback.
+This doesn't completely solve the balancing problem, but reduces it enough to
+move on to more pressing issues.
+
+The proposed patchsets are the scrub
+`performance tweaks
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
+and the
+`inode scan rebalance
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
+series.
+
+.. _scrubrepair:
+
+Scheduling Repairs
+------------------
+
+During phase 2, corruptions and inconsistencies reported in any AGI header or
+inode btree are repaired immediately, because phase 3 relies on proper
+functioning of the inode indices to find inodes to scan.
+Failed repairs are rescheduled to phase 4.
+Problems and optimization opportunities reported in any other space metadata
+are deferred to phase 4.
+
+During phase 3, corruptions and inconsistencies reported in any part of a
+file's metadata can be repaired immediately if all space metadata were
+validated.
+Otherwise, unfixed repairs and optimization opportunities are scheduled for
+phase 4.
+
+In the original design of ``xfs_scrub``, we thought that repairs would be so
+infrequent that we could get away with using the ``struct xfs_scrub_metadata``
+that we use to communicate with the kernel as the primary object to control
+repairs.
+Unfortunately, with recent increases in the number of optimizations possible
+for a given primary filesystem object (allocation group, realtime volume, file,
+the entire filesystem), it became much more memory-efficient to track all
+eligible repairs for a given filesystem object with a single repair item.
+
+Phase 4 is responsible for scheduling a lot of repair work in as quick a
+manner as is practical.
+The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which
+means that we must try to complete the repair work scheduled by phase 2 before
+trying the work scheduled by phase 3.
+
+1. Start a round of repair with a workqueue and enough workers to keep the CPUs
+   as busy as the user desires.
+
+   a. For each repair item queued by phase 2,
+
+      i.   Ask the kernel to repair everything listed in the repair item for a
+           given filesystem object.
+
+      ii.  Make a note if the kernel made any progress in reducing the number of
+           repairs needed for this object.
+
+      iii. If the object no longer requires repairs, revalidate all metadata
+           associated with this object.
+           If the revalidation succeeds, drop the repair item.
+           If not, requeue the item for more repairs.
+
+   b. While we're making progress on repairs, jump back to 1a to retry all the
+      phase 2 items.
+
+   c. For each repair item queued by phase 3,
+
+      i.   Ask the kernel to repair everything listed in the repair item for a
+           given filesystem object.
+
+      ii.  Make a note if the kernel made any progress in reducing the number of
+           repairs needed for this object.
+
+      iii. If the object no longer requires repairs, revalidate all metadata
+           associated with this object.
+           If the revalidation succeeds, drop the repair item.
+           If not, requeue the item for more repairs.
+
+   d. While we're making progress on repairs, jump back to 1c to retry all the
+      phase 3 items.
+
+2. If step 1 made any repair progress of any kind, jump back to step 1 to start
+   another round of repair.
+
+3. If there are items left to repair, run them all serially one more time.
+   This time we complain if the repairs were not successful, since this is
+   the last chance to any work in phase 4.
+
+Corruptions and inconsistencies encountered during phases 5 and 7 are repaired
+immediately.
+
+The proposed patchsets are the
+`repair warning improvements
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
+refactoring of the
+`repair data dependency
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
+and
+`object tracking
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
+and the
+`repair scheduling
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
+improvement series.
+
+Checking Names for Confusable Unicode Sequences
+-----------------------------------------------
+
+If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of
+phase 4, it moves on to checking for suspicious looking names in the
+filesystem.
+These names consist of the filesystem label, names in directory entries, and
+the names of extended attributes.
+Like most Unix filesystems, XFS imposes the sparest of constraints on the
+contents of a name -- slashes and null bytes are not allowed in directory
+entries; and null bytes are not allowed in extended attributes and the
+filesystem label.
+However, the reality of most modern-day Linux systems is that programs work
+with Unicode character code points to support international languages.
+These programs encode those code points in UTF-8 when talking to the kernel,
+which means that in the common case, names found in an XFS filesystem are
+actually UTF-8 encoded Unicode data.
+
+To maximize its expressiveness, the Unicode standard defines separate control
+points for various characters that are rendered similarly or identically in
+writing systems around the world.
+For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders
+identically to "Latin Small Letter A" U+0061 "a".
+
+The standard also permits characters to be constructed in multiple ways --
+either by using a defined code point, or by combining one code point with
+various combining marks.
+For example, the character "Angstrom Sign U+212B "Å" can also be expressed
+as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above"
+U+030A "◌̊".
+
+Like the standards that preceded it, Unicode also defines various control
+characters to alter the presentation of text.
+For example, the character "Right-to-Left Override" U+202E can trick some
+programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png".
+A second category of rendering problems involves whitespace characters.
+If the character "Zero Width Space" U+200B is encountered in a file name,
+the name will be rendered with no whitespace between the previous and the next
+character.
+If there is a name in the same name domain with the previous and next
+characters in sequence but omitting the zero width space, the rendering will be
+identical and user may be confused by it.
+The kernel, in its indifference to byte encoding schemes, permits this.
+
+As you can see, the flexibility of Unicode comes at a cost -- malicious actors
+can create names that can be used to mislead users.
+UTF-8 codecs faithfully translate whatever Unicode sequences are provided,
+which means that these sequences are persisted to disk.
+
+Techniques for detecting confusable names are explained in great detail in
+sections 4 and 5 of the
+`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_
+document.
+``xfs_scrub``, when it detects UTF-8 encoding in use on a system, uses the
+Unicode normalization form NFD in conjunction with the confusable name
+detection component of
+`libicu <https://github.com/unicode-org/icu>`_
+to identify names with a directory or within a file's extended attributes that
+could be confused for each other.
+Names are also checked for control characters, non-rendering characters, and
+mixing of bidirectional characters.
+All of these potential issues are reported to the system administrator during
+phase 5.
+
+Media Verification of File Data Extents
+---------------------------------------
+
+The system administrator can elect to initiate a media scan of all file data
+blocks.
+This scan after validation of all filesystem metadata (except for the summary
+counters) as phase 6.
+The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map
+to find areas that are allocated to file data fork extents.
+Gaps betweeen data fork extents that are smaller than 64k are treated as if
+they were data fork extents to reduce the command setup overhead.
+When the space map scan accumulates a region larger than 32MB, a media
+verification request is sent to the disk, either as a SCSI_VERIFY command, or a
+directio read of the raw block device.
+
+If the verification read fails, we retry with single-block reads to narrow down
+the failure to the specific region of the media and recorded.
+When we have finished issuing verification requests, we again use the space map
+ioctl to map the recorded media errors back to the metadata structures or files
+that own the space, and report the data have been lost.