All of lore.kernel.org
 help / color / mirror / Atom feed
* [NYE DELUGE 1/4] xfs: all pending online scrub improvements
@ 2022-12-30 21:13 Darrick J. Wong
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
                   ` (22 more replies)
  0 siblings, 23 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 21:13 UTC (permalink / raw)
  To: Dave Chinner, Allison Henderson, Chandan Babu R, Catherine Hoang, djwong
  Cc: xfs, greg.marsden, shirley.ma, konrad.wilk, fstests, Zorro Lang,
	Carlos Maiolino

Hi everyone,

As I've mentioned several times throughout 2022, I would like to merge
the online fsck feature in time for the 2023 LTS kernel.  The first big
step in this process is to merge all the pending bug fixes, validation
improvements, and general reorganization of the existing metadata
scrubbing functionality.

This first deluge starts with the design document for the entirety of
the online fsck feature.  The design doc should be familiar to most of
you, as it's been on the list for review for months already.  It
outlines in brief the problems we're trying to solve, the use cases and
testing plan, and the fundamental data structures and algorithms
underlying the entire feature.

After that come all the code changes to wrap up the metadata checking
part of the feature.  The biggest piece here is the scrub drains that
allow scrub to quiesce deferred ops targeting AGs so that it can
cross-reference recordsets.  Most of the rest is tweaking the btree code
so that we can do keyspace scans to look for conflicting records.

For this review, I would like people to focus the following:

- Are the major subsystems sufficiently documented that you could figure
  out what the code does?

- Do you see any problems that are severe enough to cause long term
  support hassles? (e.g. bad API design, writing weird metadata to disk)

- Can you spot mis-interactions between the subsystems?

- What were my blind spots in devising this feature?

- Are there missing pieces that you'd like to help build?

- Can I just merge all of this?

The one thing that is /not/ in scope for this review are requests for
more refactoring of existing subsystems.  While there are usually valid
arguments for performing such cleanups, those are separate tasks to be
prioritized separately.  I will get to them after merging online fsck.

I've been running daily online scrubs of every computer I own for the
last five years, which has helped me iron out real problems in (limited
scope) production.  All issues observed in that time have been corrected
in this submission.

As a warning, the patches will likely take several days to trickle in.
All four patch deluges are based off kernel 6.2-rc1, xfsprogs 6.1, and
fstests 2022-12-25.

Thank you all for your participation in the XFS community.  Have a safe
New Years, and I'll see you all next year!

--D

^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 00/14] xfs: design documentation for online fsck
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
@ 2022-12-30 22:10 ` Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 02/14] xfs: document the general theory underlying online fsck design Darrick J. Wong
                     ` (15 more replies)
  2022-12-30 22:10 ` [PATCHSET v24.0 0/8] xfs: variable and naming cleanups for intent items Darrick J. Wong
                   ` (21 subsequent siblings)
  22 siblings, 16 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

Hi all,

After six years of development and a nearly two year hiatus from
patchbombing, I think it is time to resume the process of merging the
online fsck feature into XFS.  The full patchset comprises 105 separate
patchsets that capture 470 patches across the kernel, xfsprogs, and
fstests projects.

I would like to merge this feature into upstream in time for the 2023
LTS kernel.  As of 5.15 (aka last year's LTS), we have merged all
generally useful infrastructure improvements into the regular
filesystem.  The only changes to the core filesystem that remain are the
ones that are only useful to online fsck itself.  In other words, the
vast majority of the new code in the patchsets comprising the online
fsck feature are is mostly self contained and can be turned off via
Kconfig.

Many of you readers might be wondering -- why have I chosen to make one
large submission with 100+ patchsets comprising ~500 patches?  Why
didn't I merge small pieces of functionality bit by bit and revise
common code as necessary?  Well, the simple answer is that in the past
six years, the fundamental algorithms have been revised repeatedly as
I've built out the functionality.  In other words, the codebase as it is
now has the benefit that I now know every piece that's necessary to get
the job done in a reasonable manner and within the constraints laid out
by community reviews.  I believe this has reduced code churn in mainline
and freed up my time so that I can iterate faster.

As a concession to the mail servers, I'm breaking up the submission into
smaller pieces; I'm only pushing the design document and the revisions
to the existing scrub code, which is the first 20% of the patches.
Also, I'm arbitrarily restarting the version numbering by reversioning
all patchsets from version 22 to epoch 23, version 1.

The big question to everyone reading this is: How might I convince you
that there is more merit in merging the whole feature and dealing with
the consequences than continuing to maintain it out of tree?

---------

To prepare the XFS community and potential patch reviewers for the
upstream submission of the online fsck feature, I decided to write a
document capturing the broader picture behind the online repair
development effort.  The document begins by defining the problems that
online fsck aims to solve and outlining specific use cases for the
functionality.

Using that as a base, the rest of the design document presents the high
level algorithms that fulfill the goals set out at the start and the
interactions between the large pieces of the system.  Case studies round
out the design documentation by adding the details of exactly how
specific parts of the online fsck code integrate the algorithms with the
filesystem.

The goal of this effort is to help the XFS community understand how the
gigantic online repair patchset works.  The questions I submit to the
community reviewers are:

1. As you read the design doc (and later the code), do you feel that you
   understand what's going on well enough to try to fix a bug if you
   found one?

2. What sorts of interactions between systems (or between scrub and the
   rest of the kernel) am I missing?

3. Do you feel confident enough in the implementation as it is now that
   the benefits of merging the feature (as EXPERIMENTAL) outweigh any
   potential disruptions to XFS at large?

4. Are there problematic interactions between subsystems that ought to
   be cleared up before merging?

5. Can I just merge all of this?

I intend to commit this document to the kernel's documentation directory
when we start merging the patchset, albeit without the links to
git.kernel.org.  A much more readable version of this is posted at:
https://djwong.org/docs/xfs-online-fsck-design/

v2: add missing sections about: all the in-kernel data structures and
    new apis that the scrub and repair functions use; how xattrs and
    directories are checked; how space btree records are checked; and
    add more details to the parts where all these bits tie together.
    Proofread for verb tense inconsistencies and eliminate vague 'we'
    usage.  Move all the discussion of what we can do with pageable
    kernel memory into a single source file and section.  Document where
    log incompat feature locks fit into the locking model.

v3: resync with 6.0, fix a few typos, begin discussion of the merging
    plan for this megapatchset.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=online-fsck-design
---
 Documentation/filesystems/index.rst                |    1 
 .../filesystems/xfs-online-fsck-design.rst         | 4975 ++++++++++++++++++++
 .../filesystems/xfs-self-describing-metadata.rst   |    1 
 3 files changed, 4977 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-online-fsck-design.rst


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 01/14] xfs: document the motivation for online fsck design
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 02/14] xfs: document the general theory underlying online fsck design Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2023-01-07  5:01     ` Allison Henderson
  2022-12-30 22:10   ` [PATCH 08/14] xfs: document btree bulk loading Darrick J. Wong
                     ` (13 subsequent siblings)
  15 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Start the first chapter of the online fsck design documentation.
This covers the motivations for creating this in the first place.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 Documentation/filesystems/index.rst                |    1 
 .../filesystems/xfs-online-fsck-design.rst         |  199 ++++++++++++++++++++
 2 files changed, 200 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-online-fsck-design.rst


diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index bee63d42e5ec..fbb2b5ada95b 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -123,4 +123,5 @@ Documentation for filesystem implementations.
    vfat
    xfs-delayed-logging-design
    xfs-self-describing-metadata
+   xfs-online-fsck-design
    zonefs
diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
new file mode 100644
index 000000000000..25717ebb5f80
--- /dev/null
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -0,0 +1,199 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _xfs_online_fsck_design:
+
+..
+        Mapping of heading styles within this document:
+        Heading 1 uses "====" above and below
+        Heading 2 uses "===="
+        Heading 3 uses "----"
+        Heading 4 uses "````"
+        Heading 5 uses "^^^^"
+        Heading 6 uses "~~~~"
+        Heading 7 uses "...."
+
+        Sections are manually numbered because apparently that's what everyone
+        does in the kernel.
+
+======================
+XFS Online Fsck Design
+======================
+
+This document captures the design of the online filesystem check feature for
+XFS.
+The purpose of this document is threefold:
+
+- To help kernel distributors understand exactly what the XFS online fsck
+  feature is, and issues about which they should be aware.
+
+- To help people reading the code to familiarize themselves with the relevant
+  concepts and design points before they start digging into the code.
+
+- To help developers maintaining the system by capturing the reasons
+  supporting higher level decisionmaking.
+
+As the online fsck code is merged, the links in this document to topic branches
+will be replaced with links to code.
+
+This document is licensed under the terms of the GNU Public License, v2.
+The primary author is Darrick J. Wong.
+
+This design document is split into seven parts.
+Part 1 defines what fsck tools are and the motivations for writing a new one.
+Parts 2 and 3 present a high level overview of how online fsck process works
+and how it is tested to ensure correct functionality.
+Part 4 discusses the user interface and the intended usage modes of the new
+program.
+Parts 5 and 6 show off the high level components and how they fit together, and
+then present case studies of how each repair function actually works.
+Part 7 sums up what has been discussed so far and speculates about what else
+might be built atop online fsck.
+
+.. contents:: Table of Contents
+   :local:
+
+1. What is a Filesystem Check?
+==============================
+
+A Unix filesystem has three main jobs: to provide a hierarchy of names through
+which application programs can associate arbitrary blobs of data for any
+length of time, to virtualize physical storage media across those names, and
+to retrieve the named data blobs at any time.
+The filesystem check (fsck) tool examines all the metadata in a filesystem
+to look for errors.
+Simple tools only check for obvious corruptions, but the more sophisticated
+ones cross-reference metadata records to look for inconsistencies.
+People do not like losing data, so most fsck tools also contains some ability
+to deal with any problems found.
+As a word of caution -- the primary goal of most Linux fsck tools is to restore
+the filesystem metadata to a consistent state, not to maximize the data
+recovered.
+That precedent will not be challenged here.
+
+Filesystems of the 20th century generally lacked any redundancy in the ondisk
+format, which means that fsck can only respond to errors by erasing files until
+errors are no longer detected.
+System administrators avoid data loss by increasing the number of separate
+storage systems through the creation of backups; and they avoid downtime by
+increasing the redundancy of each storage system through the creation of RAID.
+More recent filesystem designs contain enough redundancy in their metadata that
+it is now possible to regenerate data structures when non-catastrophic errors
+occur; this capability aids both strategies.
+Over the past few years, XFS has added a storage space reverse mapping index to
+make it easy to find which files or metadata objects think they own a
+particular range of storage.
+Efforts are under way to develop a similar reverse mapping index for the naming
+hierarchy, which will involve storing directory parent pointers in each file.
+With these two pieces in place, XFS uses secondary information to perform more
+sophisticated repairs.
+
+TLDR; Show Me the Code!
+-----------------------
+
+Code is posted to the kernel.org git trees as follows:
+`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
+`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
+`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
+Each kernel patchset adding an online repair function will use the same branch
+name across the kernel, xfsprogs, and fstests git repos.
+
+Existing Tools
+--------------
+
+The online fsck tool described here will be the third tool in the history of
+XFS (on Linux) to check and repair filesystems.
+Two programs precede it:
+
+The first program, ``xfs_check``, was created as part of the XFS debugger
+(``xfs_db``) and can only be used with unmounted filesystems.
+It walks all metadata in the filesystem looking for inconsistencies in the
+metadata, though it lacks any ability to repair what it finds.
+Due to its high memory requirements and inability to repair things, this
+program is now deprecated and will not be discussed further.
+
+The second program, ``xfs_repair``, was created to be faster and more robust
+than the first program.
+Like its predecessor, it can only be used with unmounted filesystems.
+It uses extent-based in-memory data structures to reduce memory consumption,
+and tries to schedule readahead IO appropriately to reduce I/O waiting time
+while it scans the metadata of the entire filesystem.
+The most important feature of this tool is its ability to respond to
+inconsistencies in file metadata and directory tree by erasing things as needed
+to eliminate problems.
+Space usage metadata are rebuilt from the observed file metadata.
+
+Problem Statement
+-----------------
+
+The current XFS tools leave several problems unsolved:
+
+1. **User programs** suddenly **lose access** to information in the computer
+   when unexpected shutdowns occur as a result of silent corruptions in the
+   filesystem metadata.
+   These occur **unpredictably** and often without warning.
+
+2. **Users** experience a **total loss of service** during the recovery period
+   after an **unexpected shutdown** occurs.
+
+3. **Users** experience a **total loss of service** if the filesystem is taken
+   offline to **look for problems** proactively.
+
+4. **Data owners** cannot **check the integrity** of their stored data without
+   reading all of it.
+   This may expose them to substantial billing costs when a linear media scan
+   might suffice.
+
+5. **System administrators** cannot **schedule** a maintenance window to deal
+   with corruptions if they **lack the means** to assess filesystem health
+   while the filesystem is online.
+
+6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem
+   health when doing so requires **manual intervention** and downtime.
+
+7. **Users** can be tricked into **doing things they do not desire** when
+   malicious actors **exploit quirks of Unicode** to place misleading names
+   in directories.
+
+Given this definition of the problems to be solved and the actors who would
+benefit, the proposed solution is a third fsck tool that acts on a running
+filesystem.
+
+This new third program has three components: an in-kernel facility to check
+metadata, an in-kernel facility to repair metadata, and a userspace driver
+program to drive fsck activity on a live filesystem.
+``xfs_scrub`` is the name of the driver program.
+The rest of this document presents the goals and use cases of the new fsck
+tool, describes its major design points in connection to those goals, and
+discusses the similarities and differences with existing tools.
+
++--------------------------------------------------------------------------+
+| **Note**:                                                                |
++--------------------------------------------------------------------------+
+| Throughout this document, the existing offline fsck tool can also be     |
+| referred to by its current name "``xfs_repair``".                        |
+| The userspace driver program for the new online fsck tool can be         |
+| referred to as "``xfs_scrub``".                                          |
+| The kernel portion of online fsck that validates metadata is called      |
+| "online scrub", and portion of the kernel that fixes metadata is called  |
+| "online repair".                                                         |
++--------------------------------------------------------------------------+
+
+Secondary metadata indices enable the reconstruction of parts of a damaged
+primary metadata object from secondary information.
+XFS filesystems shard themselves into multiple primary objects to enable better
+performance on highly threaded systems and to contain the blast radius when
+problems happen.
+The naming hierarchy is broken up into objects known as directories and files;
+and the physical space is split into pieces known as allocation groups.
+The division of the filesystem into principal objects (allocation groups and
+inodes) means that there are ample opportunities to perform targeted checks and
+repairs on a subset of the filesystem.
+While this is going on, other parts continue processing IO requests.
+Even if a piece of filesystem metadata can only be regenerated by scanning the
+entire system, the scan can still be done in the background while other file
+operations continue.
+
+In summary, online fsck takes advantage of resource sharding and redundant
+metadata to enable targeted checking and repair operations while the system
+is running.
+This capability will be coupled to automatic system management so that
+autonomous self-healing of XFS maximizes service availability.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 02/14] xfs: document the general theory underlying online fsck design
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2023-01-11  1:25     ` Allison Henderson
  2022-12-30 22:10   ` [PATCH 01/14] xfs: document the motivation for " Darrick J. Wong
                     ` (14 subsequent siblings)
  15 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Start the second chapter of the online fsck design documentation.
This covers the general theory underlying how online fsck works.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  366 ++++++++++++++++++++
 1 file changed, 366 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 25717ebb5f80..a03a7b9f0250 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -197,3 +197,369 @@ metadata to enable targeted checking and repair operations while the system
 is running.
 This capability will be coupled to automatic system management so that
 autonomous self-healing of XFS maximizes service availability.
+
+2. Theory of Operation
+======================
+
+Because it is necessary for online fsck to lock and scan live metadata objects,
+online fsck consists of three separate code components.
+The first is the userspace driver program ``xfs_scrub``, which is responsible
+for identifying individual metadata items, scheduling work items for them,
+reacting to the outcomes appropriately, and reporting results to the system
+administrator.
+The second and third are in the kernel, which implements functions to check
+and repair each type of online fsck work item.
+
++------------------------------------------------------------------+
+| **Note**:                                                        |
++------------------------------------------------------------------+
+| For brevity, this document shortens the phrase "online fsck work |
+| item" to "scrub item".                                           |
++------------------------------------------------------------------+
+
+Scrub item types are delineated in a manner consistent with the Unix design
+philosophy, which is to say that each item should handle one aspect of a
+metadata structure, and handle it well.
+
+Scope
+-----
+
+In principle, online fsck should be able to check and to repair everything that
+the offline fsck program can handle.
+However, the adjective *online* brings with it the limitation that online fsck
+cannot deal with anything that prevents the filesystem from going on line, i.e.
+mounting.
+This limitation means that maintenance of the offline fsck tool will continue.
+A second limitation of online fsck is that it must follow the same resource
+sharing and lock acquisition rules as the regular filesystem.
+This means that scrub cannot take *any* shortcuts to save time, because doing
+so could lead to concurrency problems.
+In other words, online fsck will never be able to fix 100% of the
+inconsistencies that offline fsck can repair, and a complete run of online fsck
+may take longer.
+However, both of these limitations are acceptable tradeoffs to satisfy the
+different motivations of online fsck, which are to **minimize system downtime**
+and to **increase predictability of operation**.
+
+.. _scrubphases:
+
+Phases of Work
+--------------
+
+The userspace driver program ``xfs_scrub`` splits the work of checking and
+repairing an entire filesystem into seven phases.
+Each phase concentrates on checking specific types of scrub items and depends
+on the success of all previous phases.
+The seven phases are as follows:
+
+1. Collect geometry information about the mounted filesystem and computer,
+   discover the online fsck capabilities of the kernel, and open the
+   underlying storage devices.
+
+2. Check allocation group metadata, all realtime volume metadata, and all quota
+   files.
+   Each metadata structure is scheduled as a separate scrub item.
+   If corruption is found in the inode header or inode btree and ``xfs_scrub``
+   is permitted to perform repairs, then those scrub items are repaired to
+   prepare for phase 3.
+   Repairs are implemented by resubmitting the scrub item to the kernel with
+   the repair flag enabled; this is discussed in the next section.
+   Optimizations and all other repairs are deferred to phase 4.
+
+3. Check all metadata of every file in the filesystem.
+   Each metadata structure is also scheduled as a separate scrub item.
+   If repairs are needed, ``xfs_scrub`` is permitted to perform repairs,
+   and there were no problems detected during phase 2, then those scrub items
+   are repaired.
+   Optimizations and unsuccessful repairs are deferred to phase 4.
+
+4. All remaining repairs and scheduled optimizations are performed during this
+   phase, if the caller permits them.
+   Before starting repairs, the summary counters are checked and any necessary
+   repairs are performed so that subsequent repairs will not fail the resource
+   reservation step due to wildly incorrect summary counters.
+   Unsuccesful repairs are requeued as long as forward progress on repairs is
+   made somewhere in the filesystem.
+   Free space in the filesystem is trimmed at the end of phase 4 if the
+   filesystem is clean.
+
+5. By the start of this phase, all primary and secondary filesystem metadata
+   must be correct.
+   Summary counters such as the free space counts and quota resource counts
+   are checked and corrected.
+   Directory entry names and extended attribute names are checked for
+   suspicious entries such as control characters or confusing Unicode sequences
+   appearing in names.
+
+6. If the caller asks for a media scan, read all allocated and written data
+   file extents in the filesystem.
+   The ability to use hardware-assisted data file integrity checking is new
+   to online fsck; neither of the previous tools have this capability.
+   If media errors occur, they will be mapped to the owning files and reported.
+
+7. Re-check the summary counters and presents the caller with a summary of
+   space usage and file counts.
+
+Steps for Each Scrub Item
+-------------------------
+
+The kernel scrub code uses a three-step strategy for checking and repairing
+the one aspect of a metadata object represented by a scrub item:
+
+1. The scrub item of interest is checked for corruptions; opportunities for
+   optimization; and for values that are directly controlled by the system
+   administrator but look suspicious.
+   If the item is not corrupt or does not need optimization, resource are
+   released and the positive scan results are returned to userspace.
+   If the item is corrupt or could be optimized but the caller does not permit
+   this, resources are released and the negative scan results are returned to
+   userspace.
+   Otherwise, the kernel moves on to the second step.
+
+2. The repair function is called to rebuild the data structure.
+   Repair functions generally choose rebuild a structure from other metadata
+   rather than try to salvage the existing structure.
+   If the repair fails, the scan results from the first step are returned to
+   userspace.
+   Otherwise, the kernel moves on to the third step.
+
+3. In the third step, the kernel runs the same checks over the new metadata
+   item to assess the efficacy of the repairs.
+   The results of the reassessment are returned to userspace.
+
+Classification of Metadata
+--------------------------
+
+Each type of metadata object (and therefore each type of scrub item) is
+classified as follows:
+
+Primary Metadata
+````````````````
+
+Metadata structures in this category should be most familiar to filesystem
+users either because they are directly created by the user or they index
+objects created by the user
+Most filesystem objects fall into this class.
+Resource and lock acquisition for scrub code follows the same order as regular
+filesystem accesses.
+
+Primary metadata objects are the simplest for scrub to process.
+The principal filesystem object (either an allocation group or an inode) that
+owns the item being scrubbed is locked to guard against concurrent updates.
+The check function examines every record associated with the type for obvious
+errors and cross-references healthy records against other metadata to look for
+inconsistencies.
+Repairs for this class of scrub item are simple, since the repair function
+starts by holding all the resources acquired in the previous step.
+The repair function scans available metadata as needed to record all the
+observations needed to complete the structure.
+Next, it stages the observations in a new ondisk structure and commits it
+atomically to complete the repair.
+Finally, the storage from the old data structure are carefully reaped.
+
+Because ``xfs_scrub`` locks a primary object for the duration of the repair,
+this is effectively an offline repair operation performed on a subset of the
+filesystem.
+This minimizes the complexity of the repair code because it is not necessary to
+handle concurrent updates from other threads, nor is it necessary to access
+any other part of the filesystem.
+As a result, indexed structures can be rebuilt very quickly, and programs
+trying to access the damaged structure will be blocked until repairs complete.
+The only infrastructure needed by the repair code are the staging area for
+observations and a means to write new structures to disk.
+Despite these limitations, the advantage that online repair holds is clear:
+targeted work on individual shards of the filesystem avoids total loss of
+service.
+
+This mechanism is described in section 2.1 ("Off-Line Algorithm") of
+V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
+Algorithms" <https://dl.acm.org/doi/10.5555/645336.649870>`_,
+*Extending Database Technology*, pp. 293-309, 1992.
+
+Most primary metadata repair functions stage their intermediate results in an
+in-memory array prior to formatting the new ondisk structure, which is very
+similar to the list-based algorithm discussed in section 2.3 ("List-Based
+Algorithms") of Srinivasan.
+However, any data structure builder that maintains a resource lock for the
+duration of the repair is *always* an offline algorithm.
+
+Secondary Metadata
+``````````````````
+
+Metadata structures in this category reflect records found in primary metadata,
+but are only needed for online fsck or for reorganization of the filesystem.
+Resource and lock acquisition for scrub code do not follow the same order as
+regular filesystem accesses, and may involve full filesystem scans.
+
+Secondary metadata objects are difficult for scrub to process, because scrub
+attaches to the secondary object but needs to check primary metadata, which
+runs counter to the usual order of resource acquisition.
+Check functions can be limited in scope to reduce runtime.
+Repairs, however, require a full scan of primary metadata, which can take a
+long time to complete.
+Under these conditions, ``xfs_scrub`` cannot lock resources for the entire
+duration of the repair.
+
+Instead, repair functions set up an in-memory staging structure to store
+observations.
+Depending on the requirements of the specific repair function, the staging
+index can have the same format as the ondisk structure, or it can have a design
+specific to that repair function.
+The next step is to release all locks and start the filesystem scan.
+When the repair scanner needs to record an observation, the staging data are
+locked long enough to apply the update.
+Simultaneously, the repair function hooks relevant parts of the filesystem to
+apply updates to the staging data if the the update pertains to an object that
+has already been scanned by the index builder.
+Once the scan is done, the owning object is re-locked, the live data is used to
+write a new ondisk structure, and the repairs are committed atomically.
+The hooks are disabled and the staging staging area is freed.
+Finally, the storage from the old data structure are carefully reaped.
+
+Introducing concurrency helps online repair avoid various locking problems, but
+comes at a high cost to code complexity.
+Live filesystem code has to be hooked so that the repair function can observe
+updates in progress.
+The staging area has to become a fully functional parallel structure so that
+updates can be merged from the hooks.
+Finally, the hook, the filesystem scan, and the inode locking model must be
+sufficiently well integrated that a hook event can decide if a given update
+should be applied to the staging structure.
+
+In theory, the scrub implementation could apply these same techniques for
+primary metadata, but doing so would make it massively more complex and less
+performant.
+Programs attempting to access the damaged structures are not blocked from
+operation, which may cause application failure or an unplanned filesystem
+shutdown.
+
+Inspiration for the secondary metadata repair strategy was drawn from section
+2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
+and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
+Creating Indexes for Very Large Tables Without Quiescing Updates"
+<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
+
+The sidecar index mentioned above bears some resemblance to the side file
+method mentioned in Srinivasan and Mohan.
+Their method consists of an index builder that extracts relevant record data to
+build the new structure as quickly as possible; and an auxiliary structure that
+captures all updates that would be committed to the index by other threads were
+the new index already online.
+After the index building scan finishes, the updates recorded in the side file
+are applied to the new index.
+To avoid conflicts between the index builder and other writer threads, the
+builder maintains a publicly visible cursor that tracks the progress of the
+scan through the record space.
+To avoid duplication of work between the side file and the index builder, side
+file updates are elided when the record ID for the update is greater than the
+cursor position within the record ID space.
+
+To minimize changes to the rest of the codebase, XFS online repair keeps the
+replacement index hidden until it's completely ready to go.
+In other words, there is no attempt to expose the keyspace of the new index
+while repair is running.
+The complexity of such an approach would be very high and perhaps more
+appropriate to building *new* indices.
+
+**Question**: Can the full scan and live update code used to facilitate a
+repair also be used to implement a comprehensive check?
+
+*Answer*: Probably, though this has not been yet been studied.
+
+Summary Information
+```````````````````
+
+Metadata structures in this last category summarize the contents of primary
+metadata records.
+These are often used to speed up resource usage queries, and are many times
+smaller than the primary metadata which they represent.
+Check and repair both require full filesystem scans, but resource and lock
+acquisition follow the same paths as regular filesystem accesses.
+
+The superblock summary counters have special requirements due to the underlying
+implementation of the incore counters, and will be treated separately.
+Check and repair of the other types of summary counters (quota resource counts
+and file link counts) employ the same filesystem scanning and hooking
+techniques as outlined above, but because the underlying data are sets of
+integer counters, the staging data need not be a fully functional mirror of the
+ondisk structure.
+
+Inspiration for quota and file link count repair strategies were drawn from
+sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View
+Maintenace") of G.  Graefe, `"Concurrent Queries and Updates in Summary Views
+and Their Indexes"
+<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
+
+Since quotas are non-negative integer counts of resource usage, online
+quotacheck can use the incremental view deltas described in section 2.14 to
+track pending changes to the block and inode usage counts in each transaction,
+and commit those changes to a dquot side file when the transaction commits.
+Delta tracking is necessary for dquots because the index builder scans inodes,
+whereas the data structure being rebuilt is an index of dquots.
+Link count checking combines the view deltas and commit step into one because
+it sets attributes of the objects being scanned instead of writing them to a
+separate data structure.
+Each online fsck function will be discussed as case studies later in this
+document.
+
+Risk Management
+---------------
+
+During the development of online fsck, several risk factors were identified
+that may make the feature unsuitable for certain distributors and users.
+Steps can be taken to mitigate or eliminate those risks, though at a cost to
+functionality.
+
+- **Decreased performance**: Adding metadata indices to the filesystem
+  increases the time cost of persisting changes to disk, and the reverse space
+  mapping and directory parent pointers are no exception.
+  System administrators who require the maximum performance can disable the
+  reverse mapping features at format time, though this choice dramatically
+  reduces the ability of online fsck to find inconsistencies and repair them.
+
+- **Incorrect repairs**: As with all software, there might be defects in the
+  software that result in incorrect repairs being written to the filesystem.
+  Systematic fuzz testing (detailed in the next section) is employed by the
+  authors to find bugs early, but it might not catch everything.
+  The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB``
+  and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to
+  accept this risk.
+  The xfsprogs build system has a configure option (``--enable-scrub=no``) that
+  disables building of the ``xfs_scrub`` binary, though this is not a risk
+  mitigation if the kernel functionality remains enabled.
+
+- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
+  repairable.
+  If the keyspaces of several metadata indices overlap in some manner but a
+  coherent narrative cannot be formed from records collected, then the repair
+  fails.
+  To reduce the chance that a repair will fail with a dirty transaction and
+  render the filesystem unusable, the online repair functions have been
+  designed to stage and validate all new records before committing the new
+  structure.
+
+- **Misbehavior**: Online fsck requires many privileges -- raw IO to block
+  devices, opening files by handle, ignoring Unix discretionary access control,
+  and the ability to perform administrative changes.
+  Running this automatically in the background scares people, so the systemd
+  background service is configured to run with only the privileges required.
+  Obviously, this cannot address certain problems like the kernel crashing or
+  deadlocking, but it should be sufficient to prevent the scrub process from
+  escaping and reconfiguring the system.
+  The cron job does not have this protection.
+
+- **Fuzz Kiddiez**: There are many people now who seem to think that running
+  automated fuzz testing of ondisk artifacts to find mischevious behavior and
+  spraying exploit code onto the public mailing list for instant zero-day
+  disclosure is somehow of some social benefit.
+  In the view of this author, the benefit is realized only when the fuzz
+  operators help to **fix** the flaws, but this opinion apparently is not
+  widely shared among security "researchers".
+  The XFS maintainers' continuing ability to manage these events presents an
+  ongoing risk to the stability of the development process.
+  Automated testing should front-load some of the risk while the feature is
+  considered EXPERIMENTAL.
+
+Many of these risks are inherent to software programming.
+Despite this, it is hoped that this new functionality will prove useful in
+reducing unexpected downtime.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 03/14] xfs: document the testing plan for online fsck
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 04/14] xfs: document the user interface for online fsck Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2023-01-18  0:03     ` Allison Henderson
  2022-12-30 22:10   ` [PATCH 05/14] xfs: document the filesystem metadata checking strategy Darrick J. Wong
                     ` (10 subsequent siblings)
  15 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Start the third chapter of the online fsck design documentation.  This
covers the testing plan to make sure that both online and offline fsck
can detect arbitrary problems and correct them without making things
worse.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  187 ++++++++++++++++++++
 1 file changed, 187 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index a03a7b9f0250..d630b6bdbe4a 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -563,3 +563,190 @@ functionality.
 Many of these risks are inherent to software programming.
 Despite this, it is hoped that this new functionality will prove useful in
 reducing unexpected downtime.
+
+3. Testing Plan
+===============
+
+As stated before, fsck tools have three main goals:
+
+1. Detect inconsistencies in the metadata;
+
+2. Eliminate those inconsistencies; and
+
+3. Minimize further loss of data.
+
+Demonstrations of correct operation are necessary to build users' confidence
+that the software behaves within expectations.
+Unfortunately, it was not really feasible to perform regular exhaustive testing
+of every aspect of a fsck tool until the introduction of low-cost virtual
+machines with high-IOPS storage.
+With ample hardware availability in mind, the testing strategy for the online
+fsck project involves differential analysis against the existing fsck tools and
+systematic testing of every attribute of every type of metadata object.
+Testing can be split into four major categories, as discussed below.
+
+Integrated Testing with fstests
+-------------------------------
+
+The primary goal of any free software QA effort is to make testing as
+inexpensive and widespread as possible to maximize the scaling advantages of
+community.
+In other words, testing should maximize the breadth of filesystem configuration
+scenarios and hardware setups.
+This improves code quality by enabling the authors of online fsck to find and
+fix bugs early, and helps developers of new features to find integration
+issues earlier in their development effort.
+
+The Linux filesystem community shares a common QA testing suite,
+`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
+functional and regression testing.
+Even before development work began on online fsck, fstests (when run on XFS)
+would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
+scratch filesystems between each test.
+This provides a level of assurance that the kernel and the fsck tools stay in
+alignment about what constitutes consistent metadata.
+During development of the online checking code, fstests was modified to run
+``xfs_scrub -n`` between each test to ensure that the new checking code
+produces the same results as the two existing fsck tools.
+
+To start development of online repair, fstests was modified to run
+``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
+This ensures that offline repair does not crash, leave a corrupt filesystem
+after it exists, or trigger complaints from the online check.
+This also established a baseline for what can and cannot be repaired offline.
+To complete the first phase of development of online repair, fstests was
+modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
+This enables a comparison of the effectiveness of online repair as compared to
+the existing offline repair tools.
+
+General Fuzz Testing of Metadata Blocks
+---------------------------------------
+
+XFS benefits greatly from having a very robust debugging tool, ``xfs_db``.
+
+Before development of online fsck even began, a set of fstests were created
+to test the rather common fault that entire metadata blocks get corrupted.
+This required the creation of fstests library code that can create a filesystem
+containing every possible type of metadata object.
+Next, individual test cases were created to create a test filesystem, identify
+a single block of a specific type of metadata object, trash it with the
+existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
+particular metadata validation strategy.
+
+This earlier test suite enabled XFS developers to test the ability of the
+in-kernel validation functions and the ability of the offline fsck tool to
+detect and eliminate the inconsistent metadata.
+This part of the test suite was extended to cover online fsck in exactly the
+same manner.
+
+In other words, for a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem:
+
+  * Write garbage to it
+
+  * Test the reactions of:
+
+    1. The kernel verifiers to stop obviously bad metadata
+    2. Offline repair (``xfs_repair``) to detect and fix
+    3. Online repair (``xfs_scrub``) to detect and fix
+
+Targeted Fuzz Testing of Metadata Records
+-----------------------------------------
+
+A quick conversation with the other XFS developers revealed that the existing
+test infrastructure could be extended to provide a much more powerful
+facility: targeted fuzz testing of every metadata field of every metadata
+object in the filesystem.
+``xfs_db`` can modify every field of every metadata structure in every
+block in the filesystem to simulate the effects of memory corruption and
+software bugs.
+Given that fstests already contains the ability to create a filesystem
+containing every metadata format known to the filesystem, ``xfs_db`` can be
+used to perform exhaustive fuzz testing!
+
+For a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem...
+
+  * For each record inside that metadata object...
+
+    * For each field inside that record...
+
+      * For each conceivable type of transformation that can be applied to a bit field...
+
+        1. Clear all bits
+        2. Set all bits
+        3. Toggle the most significant bit
+        4. Toggle the middle bit
+        5. Toggle the least significant bit
+        6. Add a small quantity
+        7. Subtract a small quantity
+        8. Randomize the contents
+
+        * ...test the reactions of:
+
+          1. The kernel verifiers to stop obviously bad metadata
+          2. Offline checking (``xfs_repair -n``)
+          3. Offline repair (``xfs_repair``)
+          4. Online checking (``xfs_scrub -n``)
+          5. Online repair (``xfs_scrub``)
+          6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
+
+This is quite the combinatoric explosion!
+
+Fortunately, having this much test coverage makes it easy for XFS developers to
+check the responses of XFS' fsck tools.
+Since the introduction of the fuzz testing framework, these tests have been
+used to discover incorrect repair code and missing functionality for entire
+classes of metadata objects in ``xfs_repair``.
+The enhanced testing was used to finalize the deprecation of ``xfs_check`` by
+confirming that ``xfs_repair`` could detect at least as many corruptions as
+the older tool.
+
+These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
+allow the online fsck developers to compare online fsck against offline fsck,
+and they enable XFS developers to find deficiencies in the code base.
+
+Proposed patchsets include
+`general fuzzer improvements
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
+`fuzzing baselines
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
+and `improvements in fuzz testing comprehensiveness
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
+
+Stress Testing
+--------------
+
+A unique requirement to online fsck is the ability to operate on a filesystem
+concurrently with regular workloads.
+Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
+impact on the running system, the online repair code should never introduce
+inconsistencies into the filesystem metadata, and regular workloads should
+never notice resource starvation.
+To verify that these conditions are being met, fstests has been enhanced in
+the following ways:
+
+* For each scrub item type, create a test to exercise checking that item type
+  while running ``fsstress``.
+* For each scrub item type, create a test to exercise repairing that item type
+  while running ``fsstress``.
+* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
+  filesystem doesn't cause problems.
+* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
+  force-repairing the whole filesystem doesn't cause problems.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+  freezing and thawing the filesystem.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+  remounting the filesystem read-only and read-write.
+* The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)
+
+Success is defined by the ability to run all of these tests without observing
+any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
+check warnings, or any other sort of mischief.
+
+Proposed patchsets include `general stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
+and the `evolution of existing per-function stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 04/14] xfs: document the user interface for online fsck
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 08/14] xfs: document btree bulk loading Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2023-01-18  0:03     ` Allison Henderson
  2022-12-30 22:10   ` [PATCH 03/14] xfs: document the testing plan " Darrick J. Wong
                     ` (11 subsequent siblings)
  15 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Start the fourth chapter of the online fsck design documentation, which
discusses the user interface and the background scrubbing service.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  114 ++++++++++++++++++++
 1 file changed, 114 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index d630b6bdbe4a..42e82971e036 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -750,3 +750,117 @@ Proposed patchsets include `general stress testing
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
 and the `evolution of existing per-function stress testing
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
+
+4. User Interface
+=================
+
+The primary user of online fsck is the system administrator, just like offline
+repair.
+Online fsck presents two modes of operation to administrators:
+A foreground CLI process for online fsck on demand, and a background service
+that performs autonomous checking and repair.
+
+Checking on Demand
+------------------
+
+For administrators who want the absolute freshest information about the
+metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on
+a command line.
+The program checks every piece of metadata in the filesystem while the
+administrator waits for the results to be reported, just like the existing
+``xfs_repair`` tool.
+Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
+option to increase the verbosity of the information reported.
+
+A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
+correction capabilities of the hardware to check data file contents.
+The media scan is not enabled by default because it may dramatically increase
+program runtime and consume a lot of bandwidth on older storage hardware.
+
+The output of a foreground invocation is captured in the system log.
+
+The ``xfs_scrub_all`` program walks the list of mounted filesystems and
+initiates ``xfs_scrub`` for each of them in parallel.
+It serializes scans for any filesystems that resolve to the same top level
+kernel block device to prevent resource overconsumption.
+
+Background Service
+------------------
+
+To reduce the workload of system administrators, the ``xfs_scrub`` package
+provides a suite of `systemd <https://systemd.io/>`_ timers and services that
+run online fsck automatically on weekends.
+The background service configures scrub to run with as little privilege as
+possible, the lowest CPU and IO priority, and in a CPU-constrained single
+threaded mode.
+It is hoped that this minimizes the amount of load generated on the system and
+avoids starving regular workloads.
+
+The output of the background service is also captured in the system log.
+If desired, reports of failures (either due to inconsistencies or mere runtime
+errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment
+variable in the following service files:
+
+* ``xfs_scrub_fail@.service``
+* ``xfs_scrub_media_fail@.service``
+* ``xfs_scrub_all_fail.service``
+
+The decision to enable the background scan is left to the system administrator.
+This can be done by enabling either of the following services:
+
+* ``xfs_scrub_all.timer`` on systemd systems
+* ``xfs_scrub_all.cron`` on non-systemd systems
+
+This automatic weekly scan is configured out of the box to perform an
+additional media scan of all file data once per month.
+This is less foolproof than, say, storing file data block checksums, but much
+more performant if application software provides its own integrity checking,
+redundancy can be provided elsewhere above the filesystem, or the storage
+device's integrity guarantees are deemed sufficient.
+
+The systemd unit file definitions have been subjected to a security audit
+(as of systemd 249) to ensure that the xfs_scrub processes have as little
+access to the rest of the system as possible.
+This was performed via ``systemd-analyze security``, after which privileges
+were restricted to the minimum required, sandboxing was set up to the maximal
+extent possible with sandboxing and system call filtering; and access to the
+filesystem tree was restricted to the minimum needed to start the program and
+access the filesystem being scanned.
+The service definition files restrict CPU usage to 80% of one CPU core, and
+apply as nice of a priority to IO and CPU scheduling as possible.
+This measure was taken to minimize delays in the rest of the filesystem.
+No such hardening has been performed for the cron job.
+
+Proposed patchset:
+`Enabling the xfs_scrub background service
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.
+
+Health Reporting
+----------------
+
+XFS caches a summary of each filesystem's health status in memory.
+The information is updated whenever ``xfs_scrub`` is run, or whenever
+inconsistencies are detected in the filesystem metadata during regular
+operations.
+System administrators should use the ``health`` command of ``xfs_spaceman`` to
+download this information into a human-readable format.
+If problems have been observed, the administrator can schedule a reduced
+service window to run the online repair tool to correct the problem.
+Failing that, the administrator can decide to schedule a maintenance window to
+run the traditional offline repair tool to correct the problem.
+
+**Question**: Should the health reporting integrate with the new inotify fs
+error notification system?
+
+**Question**: Would it be helpful for sysadmins to have a daemon to listen for
+corruption notifications and initiate a repair?
+
+*Answer*: These questions remain unanswered, but should be a part of the
+conversation with early adopters and potential downstream users of XFS.
+
+Proposed patchsets include
+`wiring up health reports to correction returns
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
+and
+`preservation of sickness info during memory reclaim
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 05/14] xfs: document the filesystem metadata checking strategy
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 03/14] xfs: document the testing plan " Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2023-01-21  1:38     ` Allison Henderson
  2022-12-30 22:10   ` [PATCH 07/14] xfs: document pageable kernel memory Darrick J. Wong
                     ` (9 subsequent siblings)
  15 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Begin the fifth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
kernel to examine filesystem metadata and cross-reference it around the
filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  579 ++++++++++++++++++++
 .../filesystems/xfs-self-describing-metadata.rst   |    1 
 2 files changed, 580 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 42e82971e036..f45bf97fa9c4 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -864,3 +864,582 @@ Proposed patchsets include
 and
 `preservation of sickness info during memory reclaim
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
+
+5. Kernel Algorithms and Data Structures
+========================================
+
+This section discusses the key algorithms and data structures of the kernel
+code that provide the ability to check and repair metadata while the system
+is running.
+The first chapters in this section reveal the pieces that provide the
+foundation for checking metadata.
+The remainder of this section presents the mechanisms through which XFS
+regenerates itself.
+
+Self Describing Metadata
+------------------------
+
+Starting with XFS version 5 in 2012, XFS updated the format of nearly every
+ondisk block header to record a magic number, a checksum, a universally
+"unique" identifier (UUID), an owner code, the ondisk address of the block,
+and a log sequence number.
+When loading a block buffer from disk, the magic number, UUID, owner, and
+ondisk address confirm that the retrieved block matches the specific owner of
+the current filesystem, and that the information contained in the block is
+supposed to be found at the ondisk address.
+The first three components enable checking tools to disregard alleged metadata
+that doesn't belong to the filesystem, and the fourth component enables the
+filesystem to detect lost writes.
+
+The logging code maintains the checksum and the log sequence number of the last
+transactional update.
+Checksums are useful for detecting torn writes and other mischief between the
+computer and its storage devices.
+Sequence number tracking enables log recovery to avoid applying out of date
+log updates to the filesystem.
+
+These two features improve overall runtime resiliency by providing a means for
+the filesystem to detect obvious corruption when reading metadata blocks from
+disk, but these buffer verifiers cannot provide any consistency checking
+between metadata structures.
+
+For more information, please see the documentation for
+Documentation/filesystems/xfs-self-describing-metadata.rst
+
+Reverse Mapping
+---------------
+
+The original design of XFS (circa 1993) is an improvement upon 1980s Unix
+filesystem design.
+In those days, storage density was expensive, CPU time was scarce, and
+excessive seek time could kill performance.
+For performance reasons, filesystem authors were reluctant to add redundancy to
+the filesystem, even at the cost of data integrity.
+Filesystems designers in the early 21st century choose different strategies to
+increase internal redundancy -- either storing nearly identical copies of
+metadata, or more space-efficient techniques such as erasure coding.
+Obvious corruptions are typically repaired by copying replicas or
+reconstructing from codes.
+
+For XFS, a different redundancy strategy was chosen to modernize the design:
+a secondary space usage index that maps allocated disk extents back to their
+owners.
+By adding a new index, the filesystem retains most of its ability to scale
+well to heavily threaded workloads involving large datasets, since the primary
+file metadata (the directory tree, the file block map, and the allocation
+groups) remain unchanged.
+Although the reverse-mapping feature increases overhead costs for space
+mapping activities just like any other system that improves redundancy, it
+has two critical advantages: first, the reverse index is key to enabling online
+fsck and other requested functionality such as filesystem reorganization,
+better media failure reporting, and shrinking.
+Second, the different ondisk storage format of the reverse mapping btree
+defeats device-level deduplication, because the filesystem requires real
+redundancy.
+
+A criticism of adding the secondary index is that it does nothing to improve
+the robustness of user data storage itself.
+This is a valid point, but adding a new index for file data block checksums
+increases write amplification and turns data overwrites into copy-writes, which
+age the filesystem prematurely.
+In keeping with thirty years of precedent, users who want file data integrity
+can supply as powerful a solution as they require.
+As for metadata, the complexity of adding a new secondary index of space usage
+is much less than adding volume management and storage device mirroring to XFS
+itself.
+Perfection of RAID and volume management are best left to existing layers in
+the kernel.
+
+The information captured in a reverse space mapping record is as follows:
+
+.. code-block:: c
+
+	struct xfs_rmap_irec {
+	    xfs_agblock_t    rm_startblock;   /* extent start block */
+	    xfs_extlen_t     rm_blockcount;   /* extent length */
+	    uint64_t         rm_owner;        /* extent owner */
+	    uint64_t         rm_offset;       /* offset within the owner */
+	    unsigned int     rm_flags;        /* state flags */
+	};
+
+The first two fields capture the location and size of the physical space,
+in units of filesystem blocks.
+The owner field tells scrub which metadata structure or file inode have been
+assigned this space.
+For space allocated to files, the offset field tells scrub where the space was
+mapped within the file fork.
+Finally, the flags field provides extra information about the space usage --
+is this an attribute fork extent?  A file mapping btree extent?  Or an
+unwritten data extent?
+
+Online filesystem checking judges the consistency of each primary metadata
+record by comparing its information against all other space indices.
+The reverse mapping index plays a key role in the consistency checking process
+because it contains a centralized alternate copy of all space allocation
+information.
+Program runtime and ease of resource acquisition are the only real limits to
+what online checking can consult.
+For example, a file data extent mapping can be checked against:
+
+* The absence of an entry in the free space information.
+* The absence of an entry in the inode index.
+* The absence of an entry in the reference count data if the file is not
+  marked as having shared extents.
+* The correspondence of an entry in the reverse mapping information.
+
+A key observation here is that only the reverse mapping can provide a positive
+affirmation of correctness if the primary metadata is in doubt.
+The checking code for most primary metadata follows a path similar to the
+one outlined above.
+
+A second observation to make about this secondary index is that proving its
+consistency with the primary metadata is difficult.
+Demonstrating that a given reverse mapping record exactly corresponds to the
+primary space metadata involves a full scan of all primary space metadata,
+which is very time intensive.
+Scanning activity for online fsck can only use non-blocking lock acquisition
+primitives if the locking order is not the regular order as used by the rest of
+the filesystem.
+This means that forward progress during this part of a scan of the reverse
+mapping data cannot be guaranteed if system load is especially heavy.
+Therefore, it is not practical for online check to detect reverse mapping
+records that lack a counterpart in the primary metadata.
+Instead, scrub relies on rigorous cross-referencing during the primary space
+mapping structure checks.
+
+Reverse mappings also play a key role in reconstruction of primary metadata.
+The secondary information is general enough for online repair to synthesize a
+complete copy of any primary space management metadata by locking that
+resource, querying all reverse mapping indices looking for records matching
+the relevant resource, and transforming the mapping into an appropriate format.
+The details of how these records are staged, written to disk, and committed
+into the filesystem are covered in subsequent sections.
+
+Checking and Cross-Referencing
+------------------------------
+
+The first step of checking a metadata structure is to examine every record
+contained within the structure and its relationship with the rest of the
+system.
+XFS contains multiple layers of checking to try to prevent inconsistent
+metadata from wreaking havoc on the system.
+Each of these layers contributes information that helps the kernel to make
+three decisions about the health of a metadata structure:
+
+- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
+- Is this structure inconsistent with the rest of the system
+  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
+- Is there so much damage around the filesystem that cross-referencing is not
+  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
+- Can the structure be optimized to improve performance or reduce the size of
+  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
+- Does the structure contain data that is not inconsistent but deserves review
+  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
+
+The following sections describe how the metadata scrubbing process works.
+
+Metadata Buffer Verification
+````````````````````````````
+
+The lowest layer of metadata protection in XFS are the metadata verifiers built
+into the buffer cache.
+These functions perform inexpensive internal consistency checking of the block
+itself, and answer these questions:
+
+- Does the block belong to this filesystem?
+
+- Does the block belong to the structure that asked for the read?
+  This assumes that metadata blocks only have one owner, which is always true
+  in XFS.
+
+- Is the type of data stored in the block within a reasonable range of what
+  scrub is expecting?
+
+- Does the physical location of the block match the location it was read from?
+
+- Does the block checksum match the data?
+
+The scope of the protections here are very limited -- verifiers can only
+establish that the filesystem code is reasonably free of gross corruption bugs
+and that the storage system is reasonably competent at retrieval.
+Corruption problems observed at runtime cause the generation of health reports,
+failed system calls, and in the extreme case, filesystem shutdowns if the
+corrupt metadata force the cancellation of a dirty transaction.
+
+Every online fsck scrubbing function is expected to read every ondisk metadata
+block of a structure in the course of checking the structure.
+Corruption problems observed during a check are immediately reported to
+userspace as corruption; during a cross-reference, they are reported as a
+failure to cross-reference once the full examination is complete.
+Reads satisfied by a buffer already in cache (and hence already verified)
+bypass these checks.
+
+Internal Consistency Checks
+```````````````````````````
+
+The next higher level of metadata protection is the internal record
+verification code built into the filesystem.
+These checks are split between the buffer verifiers, the in-filesystem users of
+the buffer cache, and the scrub code itself, depending on the amount of higher
+level context required.
+The scope of checking is still internal to the block.
+For performance reasons, regular code may skip some of these checks unless
+debugging is enabled or a write is about to occur.
+Scrub functions, of course, must check all possible problems.
+Either way, these higher level checking functions answer these questions:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- If the block contains records, do the records fit within the block?
+
+- If the block tracks internal free space information, is it consistent with
+  the record areas?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+Record checks in this category are more rigorous and more time-intensive.
+For example, block pointers and inumbers are checked to ensure that they point
+within the dynamically allocated parts of an allocation group and within
+the filesystem.
+Names are checked for invalid characters, and flags are checked for invalid
+combinations.
+Other record attributes are checked for sensible values.
+Btree records spanning an interval of the btree keyspace are checked for
+correct order and lack of mergeability (except for file fork mappings).
+
+Validation of Userspace-Controlled Record Attributes
+````````````````````````````````````````````````````
+
+Various pieces of filesystem metadata are directly controlled by userspace.
+Because of this nature, validation work cannot be more precise than checking
+that a value is within the possible range.
+These fields include:
+
+- Superblock fields controlled by mount options
+- Filesystem labels
+- File timestamps
+- File permissions
+- File size
+- File flags
+- Names present in directory entries, extended attribute keys, and filesystem
+  labels
+- Extended attribute key namespaces
+- Extended attribute values
+- File data block contents
+- Quota limits
+- Quota timer expiration (if resource usage exceeds the soft limit)
+
+Cross-Referencing Space Metadata
+````````````````````````````````
+
+The next higher level of checking is cross-referencing records between metadata
+structures.
+For regular runtime code, the cost of these checks is considered to be
+prohibitively expensive, but as scrub is dedicated to rooting out
+inconsistencies, it must pursue all avenues of inquiry.
+The exact set of cross-referencing is highly dependent on the context of the
+data structure being checked.
+
+The XFS btree code has keyspace scanning functions that online fsck uses to
+cross reference one structure with another.
+Specifically, scrub can scan the key space of an index to determine if that
+keyspace is fully, sparsely, or not at all mapped to records.
+For the reverse mapping btree, it is possible to mask parts of the key for the
+purposes of performing a keyspace scan so that scrub can decide if the rmap
+btree contains records mapping a certain extent of physical space without the
+sparsenses of the rest of the rmap keyspace getting in the way.
+
+Btree blocks undergo the following checks before cross-referencing:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- Do the records fit within the block?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+- Are the name hashes in the correct order?
+
+- Do node pointers within the btree point to valid block addresses for the type
+  of btree?
+
+- Do child pointers point towards the leaves?
+
+- Do sibling pointers point across the same level?
+
+- For each node block record, does the record key accurate reflect the contents
+  of the child block?
+
+Space allocation records are cross-referenced as follows:
+
+1. Any space mentioned by any metadata structure are cross-referenced as
+   follows:
+
+   - Does the reverse mapping index list only the appropriate owner as the
+     owner of each block?
+
+   - Are none of the blocks claimed as free space?
+
+   - If these aren't file data blocks, are none of the blocks claimed as space
+     shared by different owners?
+
+2. Btree blocks are cross-referenced as follows:
+
+   - Everything in class 1 above.
+
+   - If there's a parent node block, do the keys listed for this block match the
+     keyspace of this block?
+
+   - Do the sibling pointers point to valid blocks?  Of the same level?
+
+   - Do the child pointers point to valid blocks?  Of the next level down?
+
+3. Free space btree records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Does the reverse mapping index list no owners of this space?
+
+   - Is this space not claimed by the inode index for inodes?
+
+   - Is it not mentioned by the reference count index?
+
+   - Is there a matching record in the other free space btree?
+
+4. Inode btree records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Is there a matching record in free inode btree?
+
+   - Do cleared bits in the holemask correspond with inode clusters?
+
+   - Do set bits in the freemask correspond with inode records with zero link
+     count?
+
+5. Inode records are cross-referenced as follows:
+
+   - Everything in class 1.
+
+   - Do all the fields that summarize information about the file forks actually
+     match those forks?
+
+   - Does each inode with zero link count correspond to a record in the free
+     inode btree?
+
+6. File fork space mapping records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Is this space not mentioned by the inode btrees?
+
+   - If this is a CoW fork mapping, does it correspond to a CoW entry in the
+     reference count btree?
+
+7. Reference count records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Within the space subkeyspace of the rmap btree (that is to say, all
+     records mapped to a particular space extent and ignoring the owner info),
+     are there the same number of reverse mapping records for each block as the
+     reference count record claims?
+
+Proposed patchsets are the series to find gaps in
+`refcount btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
+`inode btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
+`rmap btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
+to find
+`mergeable records
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
+and to
+`improve cross referencing with rmap
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
+before starting a repair.
+
+Checking Extended Attributes
+````````````````````````````
+
+Extended attributes implement a key-value store that enable fragments of data
+to be attached to any file.
+Both the kernel and userspace can access the keys and values, subject to
+namespace and privilege restrictions.
+Most typically these fragments are metadata about the file -- origins, security
+contexts, user-supplied labels, indexing information, etc.
+
+Names can be as long as 255 bytes and can exist in several different
+namespaces.
+Values can be as large as 64KB.
+A file's extended attributes are stored in blocks mapped by the attr fork.
+The mappings point to leaf blocks, remote value blocks, or dabtree blocks.
+Block 0 in the attribute fork is always the top of the structure, but otherwise
+each of the three types of blocks can be found at any offset in the attr fork.
+Leaf blocks contain attribute key records that point to the name and the value.
+Names are always stored elsewhere in the same leaf block.
+Values that are less than 3/4 the size of a filesystem block are also stored
+elsewhere in the same leaf block.
+Remote value blocks contain values that are too large to fit inside a leaf.
+If the leaf information exceeds a single filesystem block, a dabtree (also
+rooted at block 0) is created to map hashes of the attribute names to leaf
+blocks in the attr fork.
+
+Checking an extended attribute structure is not so straightfoward due to the
+lack of separation between attr blocks and index blocks.
+Scrub must read each block mapped by the attr fork and ignore the non-leaf
+blocks:
+
+1. Walk the dabtree in the attr fork (if present) to ensure that there are no
+   irregularities in the blocks or dabtree mappings that do not point to
+   attr leaf blocks.
+
+2. Walk the blocks of the attr fork looking for leaf blocks.
+   For each entry inside a leaf:
+
+   a. Validate that the name does not contain invalid characters.
+
+   b. Read the attr value.
+      This performs a named lookup of the attr name to ensure the correctness
+      of the dabtree.
+      If the value is stored in a remote block, this also validates the
+      integrity of the remote value block.
+
+Checking and Cross-Referencing Directories
+``````````````````````````````````````````
+
+The filesystem directory tree is a directed acylic graph structure, with files
+constituting the nodes, and directory entries (dirents) constituting the edges.
+Directories are a special type of file containing a set of mappings from a
+255-byte sequence (name) to an inumber.
+These are called directory entries, or dirents for short.
+Each directory file must have exactly one directory pointing to the file.
+A root directory points to itself.
+Directory entries point to files of any type.
+Each non-directory file may have multiple directories point to it.
+
+In XFS, directories are implemented as a file containing up to three 32GB
+partitions.
+The first partition contains directory entry data blocks.
+Each data block contains variable-sized records associating a user-provided
+name with an inumber and, optionally, a file type.
+If the directory entry data grows beyond one block, the second partition (which
+exists as post-EOF extents) is populated with a block containing free space
+information and an index that maps hashes of the dirent names to directory data
+blocks in the first partition.
+This makes directory name lookups very fast.
+If this second partition grows beyond one block, the third partition is
+populated with a linear array of free space information for faster
+expansions.
+If the free space has been separated and the second partition grows again
+beyond one block, then a dabtree is used to map hashes of dirent names to
+directory data blocks.
+
+Checking a directory is pretty straightfoward:
+
+1. Walk the dabtree in the second partition (if present) to ensure that there
+   are no irregularities in the blocks or dabtree mappings that do not point to
+   dirent blocks.
+
+2. Walk the blocks of the first partition looking for directory entries.
+   Each dirent is checked as follows:
+
+   a. Does the name contain no invalid characters?
+
+   b. Does the inumber correspond to an actual, allocated inode?
+
+   c. Does the child inode have a nonzero link count?
+
+   d. If a file type is included in the dirent, does it match the type of the
+      inode?
+
+   e. If the child is a subdirectory, does the child's dotdot pointer point
+      back to the parent?
+
+   f. If the directory has a second partition, perform a named lookup of the
+      dirent name to ensure the correctness of the dabtree.
+
+3. Walk the free space list in the third partition (if present) to ensure that
+   the free spaces it describes are really unused.
+
+Checking operations involving :ref:`parents <dirparent>` and
+:ref:`file link counts <nlinks>` are discussed in more detail in later
+sections.
+
+Checking Directory/Attribute Btrees
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As stated in previous sections, the directory/attribute btree (dabtree) index
+maps user-provided names to improve lookup times by avoiding linear scans.
+Internally, it maps a 32-bit hash of the name to a block offset within the
+appropriate file fork.
+
+The internal structure of a dabtree closely resembles the btrees that record
+fixed-size metadata records -- each dabtree block contains a magic number, a
+checksum, sibling pointers, a UUID, a tree level, and a log sequence number.
+The format of leaf and node records are the same -- each entry points to the
+next level down in the hierarchy, with dabtree node records pointing to dabtree
+leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
+in the fork.
+
+Checking and cross-referencing the dabtree is very similar to what is done for
+space btrees:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- Do the records fit within the block?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+- Are the name hashes in the correct order?
+
+- Do node pointers within the dabtree point to valid fork offsets for dabtree
+  blocks?
+
+- Do leaf pointers within the dabtree point to valid fork offsets for directory
+  or attr leaf blocks?
+
+- Do child pointers point towards the leaves?
+
+- Do sibling pointers point across the same level?
+
+- For each dabtree node record, does the record key accurate reflect the
+  contents of the child dabtree block?
+
+- For each dabtree leaf record, does the record key accurate reflect the
+  contents of the directory or attr block?
+
+Cross-Referencing Summary Counters
+``````````````````````````````````
+
+XFS maintains three classes of summary counters: available resources, quota
+resource usage, and file link counts.
+
+In theory, the amount of available resources (data blocks, inodes, realtime
+extents) can be found by walking the entire filesystem.
+This would make for very slow reporting, so a transactional filesystem can
+maintain summaries of this information in the superblock.
+Cross-referencing these values against the filesystem metadata should be a
+simple matter of walking the free space and inode metadata in each AG and the
+realtime bitmap, but there are complications that will be discussed in
+:ref:`more detail <fscounters>` later.
+
+:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
+checking are sufficiently complicated to warrant separate sections.
+
+Post-Repair Reverification
+``````````````````````````
+
+After performing a repair, the checking code is run a second time to validate
+the new structure, and the results of the health assessment are recorded
+internally and returned to the calling process.
+This step is critical for enabling system administrator to monitor the status
+of the filesystem and the progress of any repairs.
+For developers, it is a useful means to judge the efficacy of error detection
+and correction in the online and offline checking tools.
diff --git a/Documentation/filesystems/xfs-self-describing-metadata.rst b/Documentation/filesystems/xfs-self-describing-metadata.rst
index b79dbf36dc94..a10c4ae6955e 100644
--- a/Documentation/filesystems/xfs-self-describing-metadata.rst
+++ b/Documentation/filesystems/xfs-self-describing-metadata.rst
@@ -1,4 +1,5 @@
 .. SPDX-License-Identifier: GPL-2.0
+.. _xfs_self_describing_metadata:
 
 ============================
 XFS Self Describing Metadata


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 06/14] xfs: document how online fsck deals with eventual consistency
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
                     ` (7 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 09/14] xfs: document online file metadata repair code Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2023-01-05  9:08     ` Amir Goldstein
  2023-01-31  6:11     ` Allison Henderson
  2022-12-30 22:10   ` [PATCH 13/14] xfs: document the userspace fsck driver program Darrick J. Wong
                     ` (6 subsequent siblings)
  15 siblings, 2 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Writes to an XFS filesystem employ an eventual consistency update model
to break up complex multistep metadata updates into small chained
transactions.  This is generally good for performance and scalability
because XFS doesn't need to prepare for enormous transactions, but it
also means that online fsck must be careful not to attempt a fsck action
unless it can be shown that there are no other threads processing a
transaction chain.  This part of the design documentation covers the
thinking behind the consistency model and how scrub deals with it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  303 ++++++++++++++++++++
 1 file changed, 303 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index f45bf97fa9c4..419eb54ee200 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -1443,3 +1443,306 @@ This step is critical for enabling system administrator to monitor the status
 of the filesystem and the progress of any repairs.
 For developers, it is a useful means to judge the efficacy of error detection
 and correction in the online and offline checking tools.
+
+Eventual Consistency vs. Online Fsck
+------------------------------------
+
+Midway through the development of online scrubbing, the fsstress tests
+uncovered a misinteraction between online fsck and compound transaction chains
+created by other writer threads that resulted in false reports of metadata
+inconsistency.
+The root cause of these reports is the eventual consistency model introduced by
+the expansion of deferred work items and compound transaction chains when
+reverse mapping and reflink were introduced.
+
+Originally, transaction chains were added to XFS to avoid deadlocks when
+unmapping space from files.
+Deadlock avoidance rules require that AGs only be locked in increasing order,
+which makes it impossible (say) to use a single transaction to free a space
+extent in AG 7 and then try to free a now superfluous block mapping btree block
+in AG 3.
+To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
+items to commit to freeing some space in one transaction while deferring the
+actual metadata updates to a fresh transaction.
+The transaction sequence looks like this:
+
+1. The first transaction contains a physical update to the file's block mapping
+   structures to remove the mapping from the btree blocks.
+   It then attaches to the in-memory transaction an action item to schedule
+   deferred freeing of space.
+   Concretely, each transaction maintains a list of ``struct
+   xfs_defer_pending`` objects, each of which maintains a list of ``struct
+   xfs_extent_free_item`` objects.
+   Returning to the example above, the action item tracks the freeing of both
+   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
+   AG 3.
+   Deferred frees recorded in this manner are committed in the log by creating
+   an EFI log item from the ``struct xfs_extent_free_item`` object and
+   attaching the log item to the transaction.
+   When the log is persisted to disk, the EFI item is written into the ondisk
+   transaction record.
+   EFIs can list up to 16 extents to free, all sorted in AG order.
+
+2. The second transaction contains a physical update to the free space btrees
+   of AG 3 to release the former BMBT block and a second physical update to the
+   free space btrees of AG 7 to release the unmapped file space.
+   Observe that the the physical updates are resequenced in the correct order
+   when possible.
+   Attached to the transaction is a an extent free done (EFD) log item.
+   The EFD contains a pointer to the EFI logged in transaction #1 so that log
+   recovery can tell if the EFI needs to be replayed.
+
+If the system goes down after transaction #1 is written back to the filesystem
+but before #2 is committed, a scan of the filesystem metadata would show
+inconsistent filesystem metadata because there would not appear to be any owner
+of the unmapped space.
+Happily, log recovery corrects this inconsistency for us -- when recovery finds
+an intent log item but does not find a corresponding intent done item, it will
+reconstruct the incore state of the intent item and finish it.
+In the example above, the log must replay both frees described in the recovered
+EFI to complete the recovery phase.
+
+There are two subtleties to XFS' transaction chaining strategy to consider.
+The first is that log items must be added to a transaction in the correct order
+to prevent conflicts with principal objects that are not held by the
+transaction.
+In other words, all per-AG metadata updates for an unmapped block must be
+completed before the last update to free the extent, and extents should not
+be reallocated until that last update commits to the log.
+The second subtlety comes from the fact that AG header buffers are (usually)
+released between each transaction in a chain.
+This means that other threads can observe an AG in an intermediate state,
+but as long as the first subtlety is handled, this should not affect the
+correctness of filesystem operations.
+Unmounting the filesystem flushes all pending work to disk, which means that
+offline fsck never sees the temporary inconsistencies caused by deferred work
+item processing.
+In this manner, XFS employs a form of eventual consistency to avoid deadlocks
+and increase parallelism.
+
+During the design phase of the reverse mapping and reflink features, it was
+decided that it was impractical to cram all the reverse mapping updates for a
+single filesystem change into a single transaction because a single file
+mapping operation can explode into many small updates:
+
+* The block mapping update itself
+* A reverse mapping update for the block mapping update
+* Fixing the freelist
+* A reverse mapping update for the freelist fix
+
+* A shape change to the block mapping btree
+* A reverse mapping update for the btree update
+* Fixing the freelist (again)
+* A reverse mapping update for the freelist fix
+
+* An update to the reference counting information
+* A reverse mapping update for the refcount update
+* Fixing the freelist (a third time)
+* A reverse mapping update for the freelist fix
+
+* Freeing any space that was unmapped and not owned by any other file
+* Fixing the freelist (a fourth time)
+* A reverse mapping update for the freelist fix
+
+* Freeing the space used by the block mapping btree
+* Fixing the freelist (a fifth time)
+* A reverse mapping update for the freelist fix
+
+Free list fixups are not usually needed more than once per AG per transaction
+chain, but it is theoretically possible if space is very tight.
+For copy-on-write updates this is even worse, because this must be done once to
+remove the space from a staging area and again to map it into the file!
+
+To deal with this explosion in a calm manner, XFS expands its use of deferred
+work items to cover most reverse mapping updates and all refcount updates.
+This reduces the worst case size of transaction reservations by breaking the
+work into a long chain of small updates, which increases the degree of eventual
+consistency in the system.
+Again, this generally isn't a problem because XFS orders its deferred work
+items carefully to avoid resource reuse conflicts between unsuspecting threads.
+
+However, online fsck changes the rules -- remember that although physical
+updates to per-AG structures are coordinated by locking the buffers for AG
+headers, buffer locks are dropped between transactions.
+Once scrub acquires resources and takes locks for a data structure, it must do
+all the validation work without releasing the lock.
+If the main lock for a space btree is an AG header buffer lock, scrub may have
+interrupted another thread that is midway through finishing a chain.
+For example, if a thread performing a copy-on-write has completed a reverse
+mapping update but not the corresponding refcount update, the two AG btrees
+will appear inconsistent to scrub and an observation of corruption will be
+recorded.  This observation will not be correct.
+If a repair is attempted in this state, the results will be catastrophic!
+
+Several solutions to this problem were evaluated upon discovery of this flaw:
+
+1. Add a higher level lock to allocation groups and require writer threads to
+   acquire the higher level lock in AG order before making any changes.
+   This would be very difficult to implement in practice because it is
+   difficult to determine which locks need to be obtained, and in what order,
+   without simulating the entire operation.
+   Performing a dry run of a file operation to discover necessary locks would
+   make the filesystem very slow.
+
+2. Make the deferred work coordinator code aware of consecutive intent items
+   targeting the same AG and have it hold the AG header buffers locked across
+   the transaction roll between updates.
+   This would introduce a lot of complexity into the coordinator since it is
+   only loosely coupled with the actual deferred work items.
+   It would also fail to solve the problem because deferred work items can
+   generate new deferred subtasks, but all subtasks must be complete before
+   work can start on a new sibling task.
+
+3. Teach online fsck to walk all transactions waiting for whichever lock(s)
+   protect the data structure being scrubbed to look for pending operations.
+   The checking and repair operations must factor these pending operations into
+   the evaluations being performed.
+   This solution is a nonstarter because it is *extremely* invasive to the main
+   filesystem.
+
+4. Recognize that only online fsck has this requirement of total consistency
+   of AG metadata, and that online fsck should be relatively rare as compared
+   to filesystem change operations.
+   For each AG, maintain a count of intent items targetting that AG.
+   When online fsck wants to examine an AG, it should lock the AG header
+   buffers to quiesce all transaction chains that want to modify that AG, and
+   only proceed with the scrub if the count is zero.
+   In other words, scrub only proceeds if it can lock the AG header buffers and
+   there can't possibly be any intents in progress.
+   This may lead to fairness and starvation issues, but regular filesystem
+   updates take precedence over online fsck activity.
+
+Intent Drains
+`````````````
+
+The fourth solution is implemented in the current iteration of online fsck,
+with atomic_t providing the active intent counter.
+
+There are two key properties to the drain mechanism.
+First, the counter is incremented when a deferred work item is *queued* to a
+transaction, and it is decremented after the associated intent done log item is
+*committed* to another transaction.
+The second property is that deferred work can be added to a transaction without
+holding an AG header lock, but per-AG work items cannot be marked done without
+locking that AG header buffer to log the physical updates and the intent done
+log item.
+The first property enables scrub to yield to running transaction chains, which
+is an explicit deprioritization of online fsck to benefit file operations.
+The second property of the drain is key to the correct coordination of scrub,
+since scrub will always be able to decide if a conflict is possible.
+
+For regular filesystem code, the drain works as follows:
+
+1. Call the appropriate subsystem function to add a deferred work item to a
+   transaction.
+
+2. The function calls ``xfs_drain_bump`` to increase the counter.
+
+3. When the deferred item manager wants to finish the deferred work item, it
+   calls ``->finish_item`` to complete it.
+
+4. The ``->finish_item`` implementation logs some changes and calls
+   ``xfs_drain_drop`` to decrease the sloppy counter and wake up any threads
+   waiting on the drain.
+
+5. The subtransaction commits, which unlocks the resource associated with the
+   intent item.
+
+For scrub, the drain works as follows:
+
+1. Lock the resource(s) associated with the metadata being scrubbed.
+   For example, a scan of the refcount btree would lock the AGI and AGF header
+   buffers.
+
+2. If the counter is zero (``xfs_drain_busy`` returns false), there are no
+   chains in progress and the operation may proceed.
+
+3. Otherwise, release the resources grabbed in step 1.
+
+4. Wait for the intent counter to reach zero (``xfs_drain_intents``), then go
+   back to step 1 unless a signal has been caught.
+
+To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
+be woken up whenever the intent count drops to zero.
+
+The proposed patchset is the
+`scrub intent drain series
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
+
+.. _jump_labels:
+
+Static Keys (aka Jump Label Patching)
+`````````````````````````````````````
+
+Online fsck for XFS separates the regular filesystem from the checking and
+repair code as much as possible.
+However, there are a few parts of online fsck (such as the intent drains, and
+later, live update hooks) where it is useful for the online fsck code to know
+what's going on in the rest of the filesystem.
+Since it is not expected that online fsck will be constantly running in the
+background, it is very important to minimize the runtime overhead imposed by
+these hooks when online fsck is compiled into the kernel but not actively
+running on behalf of userspace.
+Taking locks in the hot path of a writer thread to access a data structure only
+to find that no further action is necessary is expensive -- on the author's
+computer, this have an overhead of 40-50ns per access.
+Fortunately, the kernel supports dynamic code patching, which enables XFS to
+replace a static branch to hook code with ``nop`` sleds when online fsck isn't
+running.
+This sled has an overhead of however long it takes the instruction decoder to
+skip past the sled, which seems to be on the order of less than 1ns and
+does not access memory outside of instruction fetching.
+
+When online fsck enables the static key, the sled is replaced with an
+unconditional branch to call the hook code.
+The switchover is quite expensive (~22000ns) but is paid entirely by the
+program that invoked online fsck, and can be amortized if multiple threads
+enter online fsck at the same time, or if multiple filesystems are being
+checked at the same time.
+Changing the branch direction requires taking the CPU hotplug lock, and since
+CPU initialization requires memory allocation, online fsck must be careful not
+to change a static key while holding any locks or resources that could be
+accessed in the memory reclaim paths.
+To minimize contention on the CPU hotplug lock, care should be taken not to
+enable or disable static keys unnecessarily.
+
+Because static keys are intended to minimize hook overhead for regular
+filesystem operations when xfs_scrub is not running, the intended usage
+patterns are as follows:
+
+- The hooked part of XFS should declare a static-scoped static key that
+  defaults to false.
+  The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
+  The static key itself should be declared as a ``static`` variable.
+
+- When deciding to invoke code that's only used by scrub, the regular
+  filesystem should call the ``static_branch_unlikely`` predicate to avoid the
+  scrub-only hook code if the static key is not enabled.
+
+- The regular filesystem should export helper functions that call
+  ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the
+  static key.
+  Wrapper functions make it easy to compile out the relevant code if the kernel
+  distributor turns off online fsck at build time.
+
+- Scrub functions wanting to turn on scrub-only XFS functionality should call
+  the ``xchk_fshooks_enable`` from the setup function to enable a specific
+  hook.
+  This must be done before obtaining any resources that are used by memory
+  reclaim.
+  Callers had better be sure they really need the functionality gated by the
+  static key; the ``TRY_HARDER`` flag is useful here.
+
+Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to
+handle locking AGI and AGF buffers for all scrubber functions.
+If it detects a conflict between scrub and the running transactions, it will
+try to wait for intents to complete.
+If the caller of the helper has not enabled the static key, the helper will
+return -EDEADLOCK, which should result in the scrub being restarted with the
+``TRY_HARDER`` flag set.
+The scrub setup function should detect that flag, enable the static key, and
+try the scrub again.
+Scrub teardown disables all static keys obtained by ``xchk_fshooks_enable``.
+
+For more information, please see the kernel documentation of
+Documentation/staging/static-keys.rst.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 07/14] xfs: document pageable kernel memory
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 05/14] xfs: document the filesystem metadata checking strategy Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2023-02-02  7:14     ` Allison Henderson
  2022-12-30 22:10   ` [PATCH 09/14] xfs: document online file metadata repair code Darrick J. Wong
                     ` (8 subsequent siblings)
  15 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Add a discussion of pageable kernel memory, since online fsck needs
quite a bit more memory than most other parts of the filesystem to stage
records and other information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  490 ++++++++++++++++++++
 1 file changed, 490 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 419eb54ee200..9d7a2ef1d0dd 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -383,6 +383,8 @@ Algorithms") of Srinivasan.
 However, any data structure builder that maintains a resource lock for the
 duration of the repair is *always* an offline algorithm.
 
+.. _secondary_metadata:
+
 Secondary Metadata
 ``````````````````
 
@@ -1746,3 +1748,491 @@ Scrub teardown disables all static keys obtained by ``xchk_fshooks_enable``.
 
 For more information, please see the kernel documentation of
 Documentation/staging/static-keys.rst.
+
+.. _xfile:
+
+Pageable Kernel Memory
+----------------------
+
+Demonstrations of the first few prototypes of online repair revealed new
+technical requirements that were not originally identified.
+For the first demonstration, the code walked whatever filesystem
+metadata it needed to synthesize new records and inserted records into a new
+btree as it found them.
+This was subpar since any additional corruption or runtime errors encountered
+during the walk would shut down the filesystem.
+After remount, the blocks containing the half-rebuilt data structure would not
+be accessible until another repair was attempted.
+Solving the problem of half-rebuilt data structures will be discussed in the
+next section.
+
+For the second demonstration, the synthesized records were instead stored in
+kernel slab memory.
+Doing so enabled online repair to abort without writing to the filesystem if
+the metadata walk failed, which prevented online fsck from making things worse.
+However, even this approach needed improving upon.
+
+There are four reasons why traditional Linux kernel memory management isn't
+suitable for storing large datasets:
+
+1. Although it is tempting to allocate a contiguous block of memory to create a
+   C array, this cannot easily be done in the kernel because it cannot be
+   relied upon to allocate multiple contiguous memory pages.
+
+2. While disparate physical pages can be virtually mapped together, installed
+   memory might still not be large enough to stage the entire record set in
+   memory while constructing a new btree.
+
+3. To overcome these two difficulties, the implementation was adjusted to use
+   doubly linked lists, which means every record object needed two 64-bit list
+   head pointers, which is a lot of overhead.
+
+4. Kernel memory is pinned, which can drive the system out of memory, leading
+   to OOM kills of unrelated processes.
+
+For the third iteration, attention swung back to the possibility of using
+byte-indexed array-like storage to reduce the overhead of in-memory records.
+At any given time, online repair does not need to keep the entire record set in
+memory, which means that individual records can be paged out.
+Creating new temporary files in the XFS filesystem to store intermediate data
+was explored and rejected for some types of repairs because a filesystem with
+compromised space and inode metadata should never be used to fix compromised
+space or inode metadata.
+However, the kernel already has a facility for byte-addressable and pageable
+storage: shmfs.
+In-kernel graphics drivers (most notably i915) take advantage of shmfs files
+to store intermediate data that doesn't need to be in memory at all times, so
+that usage precedent is already established.
+Hence, the ``xfile`` was born!
+
+xfile Access Models
+```````````````````
+
+A survey of the intended uses of xfiles suggested these use cases:
+
+1. Arrays of fixed-sized records (space management btrees, directory and
+   extended attribute entries)
+
+2. Sparse arrays of fixed-sized records (quotas and link counts)
+
+3. Large binary objects (BLOBs) of variable sizes (directory and extended
+   attribute names and values)
+
+4. Staging btrees in memory (reverse mapping btrees)
+
+5. Arbitrary contents (realtime space management)
+
+To support the first four use cases, high level data structures wrap the xfile
+to share functionality between online fsck functions.
+The rest of this section discusses the interfaces that the xfile presents to
+four of those five higher level data structures.
+The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
+study.
+
+The most general storage interface supported by the xfile enables the reading
+and writing of arbitrary quantities of data at arbitrary offsets in the xfile.
+This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions,
+which behave similarly to their userspace counterparts.
+XFS is very record-based, which suggests that the ability to load and store
+complete records is important.
+To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store``
+functions are provided to read and persist objects into an xfile.
+They are internally the same as pread and pwrite, except that they treat any
+error as an out of memory error.
+For online repair, squashing error conditions in this manner is an acceptable
+behavior because the only reaction is to abort the operation back to userspace.
+All five xfile usecases can be serviced by these four functions.
+
+However, no discussion of file access idioms is complete without answering the
+question, "But what about mmap?"
+It would be *much* more convenient if kernel code could access pageable kernel
+memory with pointers, just like userspace code does with regular memory.
+Like any other filesystem that uses the page cache, reads and writes of xfile
+data lock the cache page and map it into the kernel address space for the
+duration of the operation.
+Unfortunately, shmfs can only write a file page to the swap device if the page
+is unmapped and unlocked, which means the xfile risks causing OOM problems
+unless it is careful not to pin too many pages.
+Therefore, the xfile steers most of its users towards programmatic access so
+that backing pages are not kept locked in memory for longer than is necessary.
+However, for callers performing quick linear scans of xfile data,
+``xfile_get_page`` and ``xfile_put_page`` functions are provided to pin a page
+in memory.
+So far, the only code to use these functions are the xfarray :ref:`sorting
+<xfarray_sort>` algorithms.
+
+xfile Access Coordination
+`````````````````````````
+
+For security reasons, xfiles must be owned privately by the kernel.
+They are marked ``S_PRIVATE`` to prevent interference from the security system,
+must never be mapped into process file descriptor tables, and their pages must
+never be mapped into userspace processes.
+
+To avoid locking recursion issues with the VFS, all accesses to the shmfs file
+are performed by manipulating the page cache directly.
+xfile writes call the ``->write_begin`` and ``->write_end`` functions of the
+xfile's address space to grab writable pages, copy the caller's buffer into the
+page, and release the pages.
+xfile reads call ``shmem_read_mapping_page_gfp`` to grab pages directly before
+copying the contents into the caller's buffer.
+In other words, xfiles ignore the VFS read and write code paths to avoid
+having to create a dummy ``struct kiocb`` and to avoid taking inode and
+freeze locks.
+
+If an xfile is shared between threads to stage repairs, the caller must provide
+its own locks to coordinate access.
+
+.. _xfarray:
+
+Arrays of Fixed-Sized Records
+`````````````````````````````
+
+In XFS, each type of indexed space metadata (free space, inodes, reference
+counts, file fork space, and reverse mappings) consists of a set of fixed-size
+records indexed with a classic B+ tree.
+Directories have a set of fixed-size dirent records that point to the names,
+and extended attributes have a set of fixed-size attribute keys that point to
+names and values.
+Quota counters and file link counters index records with numbers.
+During a repair, scrub needs to stage new records during the gathering step and
+retrieve them during the btree building step.
+
+Although this requirement can be satisfied by calling the read and write
+methods of the xfile directly, it is simpler for callers for there to be a
+higher level abstraction to take care of computing array offsets, to provide
+iterator functions, and to deal with sparse records and sorting.
+The ``xfarray`` abstraction presents a linear array for fixed-size records atop
+the byte-accessible xfile.
+
+.. _xfarray_access_patterns:
+
+Array Access Patterns
+^^^^^^^^^^^^^^^^^^^^^
+
+Array access patterns in online fsck tend to fall into three categories.
+Iteration of records is assumed to be necessary for all cases and will be
+covered in the next section.
+
+The first type of caller handles records that are indexed by position.
+Gaps may exist between records, and a record may be updated multiple times
+during the collection step.
+In other words, these callers want a sparse linearly addressed table file.
+The typical use case are quota records or file link count records.
+Access to array elements is performed programmatically via ``xfarray_load`` and
+``xfarray_store`` functions, which wrap the similarly-named xfile functions to
+provide loading and storing of array elements at arbitrary array indices.
+Gaps are defined to be null records, and null records are defined to be a
+sequence of all zero bytes.
+Null records are detected by calling ``xfarray_element_is_null``.
+They are created either by calling ``xfarray_unset`` to null out an existing
+record or by never storing anything to an array index.
+
+The second type of caller handles records that are not indexed by position
+and do not require multiple updates to a record.
+The typical use case here is rebuilding space btrees and key/value btrees.
+These callers can add records to the array without caring about array indices
+via the ``xfarray_append`` function, which stores a record at the end of the
+array.
+For callers that require records to be presentable in a specific order (e.g.
+rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted
+records; this function will be covered later.
+
+The third type of caller is a bag, which is useful for counting records.
+The typical use case here is constructing space extent reference counts from
+reverse mapping information.
+Records can be put in the bag in any order, they can be removed from the bag
+at any time, and uniqueness of records is left to callers.
+The ``xfarray_store_anywhere`` function is used to insert a record in any
+null record slot in the bag; and the ``xfarray_unset`` function removes a
+record from the bag.
+
+The proposed patchset is the
+`big in-memory array
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
+
+Iterating Array Elements
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most users of the xfarray require the ability to iterate the records stored in
+the array.
+Callers can probe every possible array index with the following:
+
+.. code-block:: c
+
+	xfarray_idx_t i;
+	foreach_xfarray_idx(array, i) {
+	    xfarray_load(array, i, &rec);
+
+	    /* do something with rec */
+	}
+
+All users of this idiom must be prepared to handle null records or must already
+know that there aren't any.
+
+For xfarray users that want to iterate a sparse array, the ``xfarray_iter``
+function ignores indices in the xfarray that have never been written to by
+calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas
+of the array that are not populated with memory pages.
+Once it finds a page, it will skip the zeroed areas of the page.
+
+.. code-block:: c
+
+	xfarray_idx_t i = XFARRAY_CURSOR_INIT;
+	while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
+	    /* do something with rec */
+	}
+
+.. _xfarray_sort:
+
+Sorting Array Elements
+^^^^^^^^^^^^^^^^^^^^^^
+
+During the fourth demonstration of online repair, a community reviewer remarked
+that for performance reasons, online repair ought to load batches of records
+into btree record blocks instead of inserting records into a new btree one at a
+time.
+The btree insertion code in XFS is responsible for maintaining correct ordering
+of the records, so naturally the xfarray must also support sorting the record
+set prior to bulk loading.
+
+The sorting algorithm used in the xfarray is actually a combination of adaptive
+quicksort and a heapsort subalgorithm in the spirit of
+`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
+`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux
+kernel.
+To sort records in a reasonably short amount of time, ``xfarray`` takes
+advantage of the binary subpartitioning offered by quicksort, but it also uses
+heapsort to hedge aginst performance collapse if the chosen quicksort pivots
+are poor.
+Both algorithms are (in general) O(n * lg(n)), but there is a wide performance
+gulf between the two implementations.
+
+The Linux kernel already contains a reasonably fast implementation of heapsort.
+It only operates on regular C arrays, which limits the scope of its usefulness.
+There are two key places where the xfarray uses it:
+
+* Sorting any record subset backed by a single xfile page.
+
+* Loading a small number of xfarray records from potentially disparate parts
+  of the xfarray into a memory buffer, and sorting the buffer.
+
+In other words, ``xfarray`` uses heapsort to constrain the nested recursion of
+quicksort, thereby mitigating quicksort's worst runtime behavior.
+
+Choosing a quicksort pivot is a tricky business.
+A good pivot splits the set to sort in half, leading to the divide and conquer
+behavior that is crucial to  O(n * lg(n)) performance.
+A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`)
+runtime.
+The xfarray sort routine tries to avoid picking a bad pivot by sampling nine
+records into a memory buffer and using the kernel heapsort to identify the
+median of the nine.
+
+Most modern quicksort implementations employ Tukey's "ninther" to select a
+pivot from a classic C array.
+Typical ninther implementations pick three unique triads of records, sort each
+of the triads, and then sort the middle value of each triad to determine the
+ninther value.
+As stated previously, however, xfile accesses are not entirely cheap.
+It turned out to be much more performant to read the nine elements into a
+memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
+the 4th element of that buffer as the pivot.
+Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for
+low-effort robust (resistant) location in large samples`, in *Contributions to
+Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press,
+1978), pp. 251–257.
+
+The partitioning of quicksort is fairly textbook -- rearrange the record
+subset around the pivot, then set up the current and next stack frames to
+sort with the larger and the smaller halves of the pivot, respectively.
+This keeps the stack space requirements to log2(record count).
+
+As a final performance optimization, the hi and lo scanning phase of quicksort
+keeps examined xfile pages mapped in the kernel for as long as possible to
+reduce map/unmap cycles.
+Surprisingly, this reduces overall sort runtime by nearly half again after
+accounting for the application of heapsort directly onto xfile pages.
+
+Blob Storage
+````````````
+
+Extended attributes and directories add an additional requirement for staging
+records: arbitrary byte sequences of finite length.
+Each directory entry record needs to store entry name,
+and each extended attribute needs to store both the attribute name and value.
+The names, keys, and values can consume a large amount of memory, so the
+``xfblob`` abstraction was created to simplify management of these blobs
+atop an xfile.
+
+Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve
+and persist objects.
+The store function returns a magic cookie for every object that it persists.
+Later, callers provide this cookie to the ``xblob_load`` to recall the object.
+The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
+function frees them all because compaction is not needed.
+
+The details of repairing directories and extended attributes will be discussed
+in a subsequent section about atomic extent swapping.
+However, it should be noted that these repair functions only use blob storage
+to cache a small number of entries before adding them to a temporary ondisk
+file, which is why compaction is not required.
+
+The proposed patchset is at the start of the
+`extended attribute repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
+
+.. _xfbtree:
+
+In-Memory B+Trees
+`````````````````
+
+The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that
+checking and repairing of secondary metadata commonly requires coordination
+between a live metadata scan of the filesystem and writer threads that are
+updating that metadata.
+Keeping the scan data up to date requires requires the ability to propagate
+metadata updates from the filesystem into the data being collected by the scan.
+This *can* be done by appending concurrent updates into a separate log file and
+applying them before writing the new metadata to disk, but this leads to
+unbounded memory consumption if the rest of the system is very busy.
+Another option is to skip the side-log and commit live updates from the
+filesystem directly into the scan data, which trades more overhead for a lower
+maximum memory requirement.
+In both cases, the data structure holding the scan results must support indexed
+access to perform well.
+
+Given that indexed lookups of scan data is required for both strategies, online
+fsck employs the second strategy of committing live updates directly into
+scan data.
+Because xfarrays are not indexed and do not enforce record ordering, they
+are not suitable for this task.
+Conveniently, however, XFS has a library to create and maintain ordered reverse
+mapping records: the existing rmap btree code!
+If only there was a means to create one in memory.
+
+Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a
+regular file, which means that the kernel can create byte or block addressable
+virtual address spaces at will.
+The XFS buffer cache specializes in abstracting IO to block-oriented  address
+spaces, which means that adaptation of the buffer cache to interface with
+xfiles enables reuse of the entire btree library.
+Btrees built atop an xfile are collectively known as ``xfbtrees``.
+The next few sections describe how they actually work.
+
+The proposed patchset is the
+`in-memory btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
+series.
+
+Using xfiles as a Buffer Cache Target
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Two modifications are necessary to support xfiles as a buffer cache target.
+The first is to make it possible for the ``struct xfs_buftarg`` structure to
+host the ``struct xfs_buf`` rhashtable, because normally those are held by a
+per-AG structure.
+The second change is to modify the buffer ``ioapply`` function to "read" cached
+pages from the xfile and "write" cached pages back to the xfile.
+Multiple access to individual buffers is controlled by the ``xfs_buf`` lock,
+since the xfile does not provide any locking on its own.
+With this adaptation in place, users of the xfile-backed buffer cache use
+exactly the same APIs as users of the disk-backed buffer cache.
+The separation between xfile and buffer cache implies higher memory usage since
+they do not share pages, but this property could some day enable transactional
+updates to an in-memory btree.
+Today, however, it simply eliminates the need for new code.
+
+Space Management with an xfbtree
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Space management for an xfile is very simple -- each btree block is one memory
+page in size.
+These blocks use the same header format as an on-disk btree, but the in-memory
+block verifiers ignore the checksums, assuming that xfile memory is no more
+corruption-prone than regular DRAM.
+Reusing existing code here is more important than absolute memory efficiency.
+
+The very first block of an xfile backing an xfbtree contains a header block.
+The header describes the owner, height, and the block number of the root
+xfbtree block.
+
+To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file.
+If there are no gaps, create one by extending the length of the xfile.
+Preallocate space for the block with ``xfile_prealloc``, and hand back the
+location.
+To free an xfbtree block, use ``xfile_discard`` (which internally uses
+``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
+
+Populating an xfbtree
+^^^^^^^^^^^^^^^^^^^^^
+
+An online fsck function that wants to create an xfbtree should proceed as
+follows:
+
+1. Call ``xfile_create`` to create an xfile.
+
+2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
+   pointing to the xfile.
+
+3. Pass the buffer cache target, buffer ops, and other information to
+   ``xfbtree_create`` to write an initial tree header and root block to the
+   xfile.
+   Each btree type should define a wrapper that passes necessary arguments to
+   the creation function.
+   For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
+   all the necessary details for callers.
+   A ``struct xfbtree`` object will be returned.
+
+4. Pass the xfbtree object to the btree cursor creation function for the
+   btree type.
+   Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this
+   for callers.
+
+5. Pass the btree cursor to the regular btree functions to make queries against
+   and to update the in-memory btree.
+   For example, a btree cursor for an rmap xfbtree can be passed to the
+   ``xfs_rmap_*`` functions just like any other btree cursor.
+   See the :ref:`next section<xfbtree_commit>` for information on dealing with
+   xfbtree updates that are logged to a transaction.
+
+6. When finished, delete the btree cursor, destroy the xfbtree object, free the
+   buffer target, and the destroy the xfile to release all resources.
+
+.. _xfbtree_commit:
+
+Committing Logged xfbtree Buffers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Although it is a clever hack to reuse the rmap btree code to handle the staging
+structure, the ephemeral nature of the in-memory btree block storage presents
+some challenges of its own.
+The XFS transaction manager must not commit buffer log items for buffers backed
+by an xfile because the log format does not understand updates for devices
+other than the data device.
+An ephemeral xfbtree probably will not exist by the time the AIL checkpoints
+log transactions back into the filesystem, and certainly won't exist during
+log recovery.
+For these reasons, any code updating an xfbtree in transaction context must
+remove the buffer log items from the transaction and write the updates into the
+backing xfile before committing or cancelling the transaction.
+
+The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement
+this functionality as follows:
+
+1. Find each buffer log item whose buffer targets the xfile.
+
+2. Record the dirty/ordered status of the log item.
+
+3. Detach the log item from the buffer.
+
+4. Queue the buffer to a special delwri list.
+
+5. Clear the transaction dirty flag if the only dirty log items were the ones
+   that were detached in step 3.
+
+6. Submit the delwri list to commit the changes to the xfile, if the updates
+   are being committed.
+
+After removing xfile logged buffers from the transaction in this manner, the
+transaction can be committed or cancelled.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 08/14] xfs: document btree bulk loading
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 02/14] xfs: document the general theory underlying online fsck design Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 01/14] xfs: document the motivation for " Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2023-02-09  5:47     ` Allison Henderson
  2022-12-30 22:10   ` [PATCH 04/14] xfs: document the user interface for online fsck Darrick J. Wong
                     ` (12 subsequent siblings)
  15 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Add a discussion of the btree bulk loading code, which makes it easy to
take an in-memory recordset and write it out to disk in an efficient
manner.  This also enables atomic switchover from the old to the new
structure with minimal potential for leaking the old blocks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  632 ++++++++++++++++++++
 1 file changed, 632 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 9d7a2ef1d0dd..eb61d867e55c 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -2236,3 +2236,635 @@ this functionality as follows:
 
 After removing xfile logged buffers from the transaction in this manner, the
 transaction can be committed or cancelled.
+
+Bulk Loading of Ondisk B+Trees
+------------------------------
+
+As mentioned previously, early iterations of online repair built new btree
+structures by creating a new btree and adding observations individually.
+Loading a btree one record at a time had a slight advantage of not requiring
+the incore records to be sorted prior to commit, but was very slow and leaked
+blocks if the system went down during a repair.
+Loading records one at a time also meant that repair could not control the
+loading factor of the blocks in the new btree.
+
+Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for
+rebuilding a btree index from a collection of records -- bulk btree loading.
+This was implemented rather inefficiently code-wise, since ``xfs_repair``
+had separate copy-pasted implementations for each btree type.
+
+To prepare for online fsck, each of the four bulk loaders were studied, notes
+were taken, and the four were refactored into a single generic btree bulk
+loading mechanism.
+Those notes in turn have been refreshed and are presented below.
+
+Geometry Computation
+````````````````````
+
+The zeroth step of bulk loading is to assemble the entire record set that will
+be stored in the new btree, and sort the records.
+Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the
+btree from the record set, the type of btree, and any load factor preferences.
+This information is required for resource reservation.
+
+First, the geometry computation computes the minimum and maximum records that
+will fit in a leaf block from the size of a btree block and the size of the
+block header.
+Roughly speaking, the maximum number of records is::
+
+        maxrecs = (block_size - header_size) / record_size
+
+The XFS design specifies that btree blocks should be merged when possible,
+which means the minimum number of records is half of maxrecs::
+
+        minrecs = maxrecs / 2
+
+The next variable to determine is the desired loading factor.
+This must be at least minrecs and no more than maxrecs.
+Choosing minrecs is undesirable because it wastes half the block.
+Choosing maxrecs is also undesirable because adding a single record to each
+newly rebuilt leaf block will cause a tree split, which causes a noticeable
+drop in performance immediately afterwards.
+The default loading factor was chosen to be 75% of maxrecs, which provides a
+reasonably compact structure without any immediate split penalties.
+If space is tight, the loading factor will be set to maxrecs to try to avoid
+running out of space::
+
+        leaf_load_factor = enough space ? (maxrecs + minrecs) / 2 : maxrecs
+
+Load factor is computed for btree node blocks using the combined size of the
+btree key and pointer as the record size::
+
+        maxrecs = (block_size - header_size) / (key_size + ptr_size)
+        minrecs = maxrecs / 2
+        node_load_factor = enough space ? (maxrecs + minrecs) / 2 : maxrecs
+
+Once that's done, the number of leaf blocks required to store the record set
+can be computed as::
+
+        leaf_blocks = ceil(record_count / leaf_load_factor)
+
+The number of node blocks needed to point to the next level down in the tree
+is computed as::
+
+        n_blocks = (n == 0 ? leaf_blocks : node_blocks[n])
+        node_blocks[n + 1] = ceil(n_blocks / node_load_factor)
+
+The entire computation is performed recursively until the current level only
+needs one block.
+The resulting geometry is as follows:
+
+- For AG-rooted btrees, this level is the root level, so the height of the new
+  tree is ``level + 1`` and the space needed is the summation of the number of
+  blocks on each level.
+
+- For inode-rooted btrees where the records in the top level do not fit in the
+  inode fork area, the height is ``level + 2``, the space needed is the
+  summation of the number of blocks on each level, and the inode fork points to
+  the root block.
+
+- For inode-rooted btrees where the records in the top level can be stored in
+  the inode fork area, then the root block can be stored in the inode, the
+  height is ``level + 1``, and the space needed is one less than the summation
+  of the number of blocks on each level.
+  This only becomes relevant when non-bmap btrees gain the ability to root in
+  an inode, which is a future patchset and only included here for completeness.
+
+.. _newbt:
+
+Reserving New B+Tree Blocks
+```````````````````````````
+
+Once repair knows the number of blocks needed for the new btree, it allocates
+those blocks using the free space information.
+Each reserved extent is tracked separately by the btree builder state data.
+To improve crash resilience, the reservation code also logs an Extent Freeing
+Intent (EFI) item in the same transaction as each space allocation and attaches
+its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
+If the system goes down, log recovery will use the unfinished EFIs to free the
+unused space, the free space, leaving the filesystem unchanged.
+
+Each time the btree builder claims a block for the btree from a reserved
+extent, it updates the in-memory reservation to reflect the claimed space.
+Block reservation tries to allocate as much contiguous space as possible to
+reduce the number of EFIs in play.
+
+While repair is writing these new btree blocks, the EFIs created for the space
+reservations pin the tail of the ondisk log.
+It's possible that other parts of the system will remain busy and push the head
+of the log towards the pinned tail.
+To avoid livelocking the filesystem, the EFIs must not pin the tail of the log
+for too long.
+To alleviate this problem, the dynamic relogging capability of the deferred ops
+mechanism is reused here to commit a transaction at the log head containing an
+EFD for the old EFI and new EFI at the head.
+This enables the log to release the old EFI to keep the log moving forwards.
+
+EFIs have a role to play during the commit and reaping phases; please see the
+next section and the section about :ref:`reaping<reaping>` for more details.
+
+Proposed patchsets are the
+`bitmap rework
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
+and the
+`preparation for bulk loading btrees
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_.
+
+
+Writing the New Tree
+````````````````````
+
+This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
+a block from the reserved list, writes the new btree block header, fills the
+rest of the block with records, and adds the new leaf block to a list of
+written blocks.
+Sibling pointers are set every time a new block is added to the level.
+When it finishes writing the record leaf blocks, it moves on to the node
+blocks.
+To fill a node block, it walks each block in the next level down in the tree
+to compute the relevant keys and write them into the parent node.
+When it reaches the root level, it is ready to commit the new btree!
+
+The first step to commit the new btree is to persist the btree blocks to disk
+synchronously.
+This is a little complicated because a new btree block could have been freed
+in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to
+remove the (stale) buffer from the AIL list before it can write the new blocks
+to disk.
+Blocks are queued for IO using a delwri list and written in one large batch
+with ``xfs_buf_delwri_submit``.
+
+Once the new blocks have been persisted to disk, control returns to the
+individual repair function that called the bulk loader.
+The repair function must log the location of the new root in a transaction,
+clean up the space reservations that were made for the new btree, and reap the
+old metadata blocks:
+
+1. Commit the location of the new btree root.
+
+2. For each incore reservation:
+
+   a. Log Extent Freeing Done (EFD) items for all the space that was consumed
+      by the btree builder.  The new EFDs must point to the EFIs attached to
+      the reservation to prevent log recovery from freeing the new blocks.
+
+   b. For unclaimed portions of incore reservations, create a regular deferred
+      extent free work item to be free the unused space later in the
+      transaction chain.
+
+   c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the
+      reservation of the committing transaction.
+      If the btree loading code suspects this might be about to happen, it must
+      call ``xrep_defer_finish`` to clear out the deferred work and obtain a
+      fresh transaction.
+
+3. Clear out the deferred work a second time to finish the commit and clean
+   the repair transaction.
+
+The transaction rolling in steps 2c and 3 represent a weakness in the repair
+algorithm, because a log flush and a crash before the end of the reap step can
+result in space leaking.
+Online repair functions minimize the chances of this occuring by using very
+large transactions, which each can accomodate many thousands of block freeing
+instructions.
+Repair moves on to reaping the old blocks, which will be presented in a
+subsequent :ref:`section<reaping>` after a few case studies of bulk loading.
+
+Case Study: Rebuilding the Inode Index
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The high level process to rebuild the inode index btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec``
+   records from the inode chunk information and a bitmap of the old inode btree
+   blocks.
+
+2. Append the records to an xfarray in inode order.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+   of blocks needed for the inode btree.
+   If the free space inode btree is enabled, call it again to estimate the
+   geometry of the finobt.
+
+4. Allocate the number of blocks computed in the previous step.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+   generate the internal node blocks.
+   If the free space inode btree is enabled, call it again to load the finobt.
+
+6. Commit the location of the new btree root block(s) to the AGI.
+
+7. Reap the old btree blocks using the bitmap created in step 1.
+
+Details are as follows.
+
+The inode btree maps inumbers to the ondisk location of the associated
+inode records, which means that the inode btrees can be rebuilt from the
+reverse mapping information.
+Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the
+location of the old inode btree blocks.
+Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the
+location of at least one inode cluster buffer.
+A cluster is the smallest number of ondisk inodes that can be allocated or
+freed in a single transaction; it is never smaller than 1 fs block or 4 inodes.
+
+For the space represented by each inode cluster, ensure that there are no
+records in the free space btrees nor any records in the reference count btree.
+If there are, the space metadata inconsistencies are reason enough to abort the
+operation.
+Otherwise, read each cluster buffer to check that its contents appear to be
+ondisk inodes and to decide if the file is allocated
+(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``).
+Accumulate the results of successive inode cluster buffer reads until there is
+enough information to fill a single inode chunk record, which is 64 consecutive
+numbers in the inumber keyspace.
+If the chunk is sparse, the chunk record may include holes.
+
+Once the repair function accumulates one chunk's worth of data, it calls
+``xfarray_append`` to add the inode btree record to the xfarray.
+This xfarray is walked twice during the btree creation step -- once to populate
+the inode btree with all inode chunk records, and a second time to populate the
+free inode btree with records for chunks that have free non-sparse inodes.
+The number of records for the inode btree is the number of xfarray records,
+but the record count for the free inode btree has to be computed as inode chunk
+records are stored in the xfarray.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding the Space Reference Counts
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The high level process to rebuild the reference count btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec``
+   records for any space having more than one reverse mapping and add them to
+   the xfarray.
+   Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray.
+   Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old
+   refcount btree blocks.
+
+2. Sort the records in physical extent order, putting the CoW staging extents
+   at the end of the xfarray.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+   of blocks needed for the new tree.
+
+4. Allocate the number of blocks computed in the previous step.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+   generate the internal node blocks.
+
+6. Commit the location of new btree root block to the AGF.
+
+7. Reap the old btree blocks using the bitmap created in step 1.
+
+Details are as follows; the same algorithm is used by ``xfs_repair`` to
+generate refcount information from reverse mapping records.
+
+Reverse mapping records are used to rebuild the reference count information.
+Reference counts are required for correct operation of copy on write for shared
+file data.
+Imagine the reverse mapping entries as rectangles representing extents of
+physical blocks, and that the rectangles can be laid down to allow them to
+overlap each other.
+From the diagram below, it is apparent that a reference count record must start
+or end wherever the height of the stack changes.
+In other words, the record emission stimulus is level-triggered::
+
+                        █    ███
+              ██      █████ ████   ███        ██████
+        ██   ████     ███████████ ████     █████████
+        ████████████████████████████████ ███████████
+        ^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
+        2 1  23 21    3 43 234  2123  1 01 2  3     0
+
+The ondisk reference count btree does not store the refcount == 0 cases because
+the free space btree already records which blocks are free.
+Extents being used to stage copy-on-write operations should be the only records
+with refcount == 1.
+Single-owner file blocks aren't recorded in either the free space or the
+reference count btrees.
+
+Given the reverse mapping btree which orders records by physical block number,
+a starting physical block (``sp``), a bag-like data structure to hold mappings
+that cover ``sp``, and the next physical block where the level changes
+(``np``), reference count information is constructed from reverse mapping data
+as follows:
+
+While there are still unprocessed mappings in the reverse mapping btree:
+
+1. Set ``sp`` to the physical block of the next unprocessed reverse mapping
+   record.
+
+2. Add to the bag all the reverse mappings where ``rm_startblock`` == ``sp``.
+
+3. Set ``np`` to the physical block where the bag size will change.
+   This is the minimum of (``rm_startblock`` of the next unprocessed mapping)
+   and (``rm_startblock`` + ``rm_blockcount`` of each mapping in the bag).
+
+4. Record the bag size as ``old_bag_size``.
+
+5. While the bag isn't empty,
+
+   a. Remove from the bag all mappings where ``rm_startblock`` +
+      ``rm_blockcount`` == ``np``.
+
+   b. Add to the bag all reverse mappings where ``rm_startblock`` == ``np``.
+
+   c. If the bag size isn't ``old_bag_size``, store the refcount record
+      ``(sp, np - sp, old_bag_size)`` in the refcount xfarray.
+
+   d. If the bag is empty, break out of this inner loop.
+
+   e. Set ``old_bag_size`` to ``bag_size``.
+
+   f. Set ``sp`` = ``np``.
+
+   g. Set ``np`` to the physical block where the bag size will change.
+      Go to step 3 above.
+
+The bag-like structure in this case is a type 2 xfarray as discussed in the
+:ref:`xfarray access patterns<xfarray_access_patterns>` section.
+Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and
+removed via ``xfarray_unset``.
+Bag members are examined through ``xfarray_iter`` loops.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding File Fork Mapping Indices
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The high level process to rebuild a data/attr fork mapping btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec``
+   records from the reverse mapping records for that inode and fork.
+   Append these records to an xfarray.
+   Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK``
+   records.
+
+2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+   of blocks needed for the new tree.
+
+3. Sort the records in file offset order.
+
+4. If the extent records would fit in the inode fork immediate area, commit the
+   records to that immediate area and skip to step 8.
+
+5. Allocate the number of blocks computed in the previous step.
+
+6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+   generate the internal node blocks.
+
+7. Commit the new btree root block to the inode fork immediate area.
+
+8. Reap the old btree blocks using the bitmap created in step 1.
+
+There are some complications here:
+First, it's possible to move the fork offset to adjust the sizes of the
+immediate areas if the data and attr forks are not both in BMBT format.
+Second, if there are sufficiently few fork mappings, it may be possible to use
+EXTENTS format instead of BMBT, which may require a conversion.
+Third, the incore extent map must be reloaded carefully to avoid disturbing
+any delayed allocation extents.
+
+The proposed patchset is the
+`file repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
+series.
+
+.. _reaping:
+
+Reaping Old Metadata Blocks
+---------------------------
+
+Whenever online fsck builds a new data structure to replace one that is
+suspect, there is a question of how to find and dispose of the blocks that
+belonged to the old structure.
+The laziest method of course is not to deal with them at all, but this slowly
+leads to service degradations as space leaks out of the filesystem.
+Hopefully, someone will schedule a rebuild of the free space information to
+plug all those leaks.
+Offline repair rebuilds all space metadata after recording the usage of
+the files and directories that it decides not to clear, hence it can build new
+structures in the discovered free space and avoid the question of reaping.
+
+As part of a repair, online fsck relies heavily on the reverse mapping records
+to find space that is owned by the corresponding rmap owner yet truly free.
+Cross referencing rmap records with other rmap records is necessary because
+there may be other data structures that also think they own some of those
+blocks (e.g. crosslinked trees).
+Permitting the block allocator to hand them out again will not push the system
+towards consistency.
+
+For space metadata, the process of finding extents to dispose of generally
+follows this format:
+
+1. Create a bitmap of space used by data structures that must be preserved.
+   The space reservations used to create the new metadata can be used here if
+   the same rmap owner code is used to denote all of the objects being rebuilt.
+
+2. Survey the reverse mapping data to create a bitmap of space owned by the
+   same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved.
+
+3. Use the bitmap disunion operator to subtract (1) from (2).
+   The remaining set bits represent candidate extents that could be freed.
+   The process moves on to step 4 below.
+
+Repairs for file-based metadata such as extended attributes, directories,
+symbolic links, quota files and realtime bitmaps are performed by building a
+new structure attached to a temporary file and swapping the forks.
+Afterward, the mappings in the old file fork are the candidate blocks for
+disposal.
+
+The process for disposing of old extents is as follows:
+
+4. For each candidate extent, count the number of reverse mapping records for
+   the first block in that extent that do not have the same rmap owner for the
+   data structure being repaired.
+
+   - If zero, the block has a single owner and can be freed.
+
+   - If not, the block is part of a crosslinked structure and must not be
+     freed.
+
+5. Starting with the next block in the extent, figure out how many more blocks
+   have the same zero/nonzero other owner status as that first block.
+
+6. If the region is crosslinked, delete the reverse mapping entry for the
+   structure being repaired and move on to the next region.
+
+7. If the region is to be freed, mark any corresponding buffers in the buffer
+   cache as stale to prevent log writeback.
+
+8. Free the region and move on.
+
+However, there is one complication to this procedure.
+Transactions are of finite size, so the reaping process must be careful to roll
+the transactions to avoid overruns.
+Overruns come from two sources:
+
+a. EFIs logged on behalf of space that is no longer occupied
+
+b. Log items for buffer invalidations
+
+This is also a window in which a crash during the reaping process can leak
+blocks.
+As stated earlier, online repair functions use very large transactions to
+minimize the chances of this occurring.
+
+The proposed patchset is the
+`preparation for bulk loading btrees
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_
+series.
+
+Case Study: Reaping After a Regular Btree Repair
+````````````````````````````````````````````````
+
+Old reference count and inode btrees are the easiest to reap because they have
+rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount
+btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees.
+Creating a list of extents to reap the old btree blocks is quite simple,
+conceptually:
+
+1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees.
+
+2. For each reverse mapping record with an rmap owner corresponding to the
+   metadata structure being rebuilt, set the corresponding range in a bitmap.
+
+3. Walk the current data structures that have the same rmap owner.
+   For each block visited, clear that range in the above bitmap.
+
+4. Each set bit in the bitmap represents a block that could be a block from the
+   old data structures and hence is a candidate for reaping.
+   In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)``
+   are the blocks that might be freeable.
+
+If it is possible to maintain the AGF lock throughout the repair (which is the
+common case), then step 2 can be performed at the same time as the reverse
+mapping record walk that creates the records for the new btree.
+
+Case Study: Rebuilding the Free Space Indices
+`````````````````````````````````````````````
+
+The high level process to rebuild the free space indices is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore``
+   records from the gaps in the reverse mapping btree.
+
+2. Append the records to an xfarray.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+   of blocks needed for each new tree.
+
+4. Allocate the number of blocks computed in the previous step from the free
+   space information collected.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+   generate the internal node blocks for the free space by block index.
+   Call it again for the free space by length index.
+
+6. Commit the locations of the new btree root blocks to the AGF.
+
+7. Reap the old btree blocks by looking for space that is not recorded by the
+   reverse mapping btree, the new free space btrees, or the AGFL.
+
+Repairing the free space btrees has three key complications over a regular
+btree repair:
+
+First, free space is not explicitly tracked in the reverse mapping records.
+Hence, the new free space records must be inferred from gaps in the physical
+space component of the keyspace of the reverse mapping btree.
+
+Second, free space repairs cannot use the common btree reservation code because
+new blocks are reserved out of the free space btrees.
+This is impossible when repairing the free space btrees themselves.
+However, repair holds the AGF buffer lock for the duration of the free space
+index reconstruction, so it can use the collected free space information to
+supply the blocks for the new free space btrees.
+It is not necessary to back each reserved extent with an EFI because the new
+free space btrees are constructed in what the ondisk filesystem thinks is
+unowned space.
+However, if reserving blocks for the new btrees from the collected free space
+information changes the number of free space records, repair must re-estimate
+the new free space btree geometry with the new record count until the
+reservation is sufficient.
+As part of committing the new btrees, repair must ensure that reverse mappings
+are created for the reserved blocks and that unused reserved blocks are
+inserted into the free space btrees.
+Deferrred rmap and freeing operations are used to ensure that this transition
+is atomic, similar to the other btree repair functions.
+
+Third, finding the blocks to reap after the repair is not overly
+straightforward.
+Blocks for the free space btrees and the reverse mapping btrees are supplied by
+the AGFL.
+Blocks put onto the AGFL have reverse mapping records with the owner
+``XFS_RMAP_OWN_AG``.
+This ownership is retained when blocks move from the AGFL into the free space
+btrees or the reverse mapping btrees.
+When repair walks reverse mapping records to synthesize free space records, it
+creates a bitmap (``ag_owner_bitmap``) of all the space claimed by
+``XFS_RMAP_OWN_AG`` records.
+The repair context maintains a second bitmap corresponding to the rmap btree
+blocks and the AGFL blocks (``rmap_agfl_bitmap``).
+When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
+~rmap_agfl_bitmap)`` computes the extents that are used by the old free space
+btrees.
+These blocks can then be reaped using the methods outlined above.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+.. _rmap_reap:
+
+Case Study: Reaping After Repairing Reverse Mapping Btrees
+``````````````````````````````````````````````````````````
+
+Old reverse mapping btrees are less difficult to reap after a repair.
+As mentioned in the previous section, blocks on the AGFL, the two free space
+btree blocks, and the reverse mapping btree blocks all have reverse mapping
+records with ``XFS_RMAP_OWN_AG`` as the owner.
+The full process of gathering reverse mapping records and building a new btree
+are described in the case study of
+:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that
+discussion is that the new rmap btree will not contain any records for the old
+rmap btree, nor will the old btree blocks be tracked in the free space btrees.
+The list of candidate reaping blocks is computed by setting the bits
+corresponding to the gaps in the new rmap btree records, and then clearing the
+bits corresponding to extents in the free space btrees and the current AGFL
+blocks.
+The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the
+methods outlined above.
+
+The rest of the process of rebuildng the reverse mapping btree is discussed
+in a separate :ref:`case study<rmap_repair>`.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding the AGFL
+```````````````````````````````
+
+The allocation group free block list (AGFL) is repaired as follows:
+
+1. Create a bitmap for all the space that the reverse mapping data claims is
+   owned by ``XFS_RMAP_OWN_AG``.
+
+2. Subtract the space used by the two free space btrees and the rmap btree.
+
+3. Subtract any space that the reverse mapping data claims is owned by any
+   other owner, to avoid re-adding crosslinked blocks to the AGFL.
+
+4. Once the AGFL is full, reap any blocks leftover.
+
+5. The next operation to fix the freelist will right-size the list.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 09/14] xfs: document online file metadata repair code
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 07/14] xfs: document pageable kernel memory Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 06/14] xfs: document how online fsck deals with eventual consistency Darrick J. Wong
                     ` (7 subsequent siblings)
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Add to the fifth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
kernel to repair file metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  150 ++++++++++++++++++++
 1 file changed, 150 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index eb61d867e55c..a658da8fe4ae 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -2868,3 +2868,153 @@ The allocation group free block list (AGFL) is repaired as follows:
 4. Once the AGFL is full, reap any blocks leftover.
 
 5. The next operation to fix the freelist will right-size the list.
+
+Inode Record Repairs
+--------------------
+
+Inode records must be handled carefully, because they have both ondisk records
+("dinodes") and an in-memory ("cached") representation.
+There is a very high potential for cache coherency issues if online fsck is not
+careful to access the ondisk metadata *only* when the ondisk metadata is so
+badly damaged that the filesystem cannot load the in-memory representation.
+When online fsck wants to open a damaged file for scrubbing, it must use
+specialized resource acquisition functions that return either the in-memory
+representation *or* a lock on whichever object is necessary to prevent any
+update to the ondisk location.
+
+The only repairs that should be made to the ondisk inode buffers are whatever
+is necessary to get the in-core structure loaded.
+This means fixing whatever is caught by the inode cluster buffer and inode fork
+verifiers, and retrying the ``iget`` operation.
+If the second ``iget`` fails, the repair has failed.
+
+Once the in-memory representation is loaded, repair can lock the inode and can
+subject it to comprehensive checks, repairs, and optimizations.
+Most inode attributes are easy to check and constrain, or are user-controlled
+arbitrary bit patterns; these are both easy to fix.
+Dealing with the data and attr fork extent counts and the file block counts is
+more complicated, because computing the correct value requires traversing the
+forks, or if that fails, leaving the fields invalid and waiting for the fork
+fsck functions to run.
+
+The proposed patchset is the
+`inode
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
+repair series.
+
+Quota Record Repairs
+--------------------
+
+Similar to inodes, quota records ("dquots") also have both ondisk records and
+an in-memory representation, and hence are subject to the same cache coherency
+issues.
+Somewhat confusingly, both are known as dquots in the XFS codebase.
+
+The only repairs that should be made to the ondisk quota record buffers are
+whatever is necessary to get the in-core structure loaded.
+Once the in-memory representation is loaded, the only attributes needing
+checking are obviously bad limits and timer values.
+
+Quota usage counters are checked, repaired, and discussed separately in the
+section about :ref:`live quotacheck <quotacheck>`.
+
+The proposed patchset is the
+`quota
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
+repair series.
+
+.. _fscounters:
+
+Freezing to Fix Summary Counters
+--------------------------------
+
+Filesystem summary counters track availability of filesystem resources such
+as free blocks, free inodes, and allocated inodes.
+This information could be compiled by walking the free space and inode indexes,
+but this is a slow process, so XFS maintains a copy in the ondisk superblock
+that should reflect the ondisk metadata, at least when the filesystem has been
+unmounted cleanly.
+For performance reasons, XFS also maintains incore copies of those counters,
+which are key to enabling resource reservations for active transactions.
+Writer threads reserve the worst-case quantities of resources from the
+incore counter and give back whatever they don't use at commit time.
+It is therefore only necessary to serialize on the superblock when the
+superblock is being committed to disk.
+
+The lazy superblock counter feature introduced in XFS v5 took this even further
+by training log recovery to recompute the summary counters from the AG headers,
+which eliminated the need for most transactions even to touch the superblock.
+The only time XFS commits the summary counters is at filesystem unmount.
+To reduce contention even further, the incore counter is implemented as a
+percpu counter, which means that each CPU is allocated a batch of blocks from a
+global incore counter and can satisfy small allocations from the local batch.
+
+The high-performance nature of the summary counters makes it difficult for
+online fsck to check them, since there is no way to quiesce a percpu counter
+while the system is running.
+Although online fsck can read the filesystem metadata to compute the correct
+values of the summary counters, there's no way to hold the value of a percpu
+counter stable, so it's quite possible that the counter will be out of date by
+the time the walk is complete.
+Earlier versions of online scrub would return to userspace with an incomplete
+scan flag, but this is not a satisfying outcome for a system administrator.
+For repairs, the in-memory counters must be stabilize while walking the
+filesystem metadata to get an accurate reading and install it in the percpu
+counter.
+
+To satisfy this requirement, online fsck must prevent other programs in the
+system from initiating new writes to the filesystem, it must disable background
+garbage collection threads, and it must wait for existing writer programs to
+exit the kernel.
+Once that has been established, scrub can walk the AG free space indexes, the
+inode btrees, and the realtime bitmap to compute the correct value of all
+four summary counters.
+This is very similar to a filesystem freeze.
+
+The initial implementation used the actual VFS filesystem freeze mechanism to
+quiesce filesystem activity.
+With the filesystem frozen, it is possible to resolve the counter values with
+exact precision, but there are many problems with calling the VFS methods
+directly:
+
+- Other programs can unfreeze the filesystem without our knowledge.
+  This leads to incorrect scan results and incorrect repairs.
+
+- Adding an extra lock to prevent others from thawing the filesystem required
+  the addition of a ``->freeze_super`` function to wrap ``freeze_fs()``.
+  This in turn caused other subtle problems because it turns out that the VFS
+  ``freeze_super`` and ``thaw_super`` functions can drop the last reference to
+  the VFS superblock, and any subsequent access becomes a UAF bug!
+  This can happen if the filesystem is unmounted while the underlying block
+  device has frozen the filesystem.
+  This problem could be solved by grabbing extra references to the superblock,
+  but it felt suboptimal given the other inadequacies of this approach:
+
+- The log need not be quiesced to check the summary counters, but a VFS freeze
+  initiates one anyway.
+  This adds unnecessary runtime to live fscounter fsck operations.
+
+- Quiescing the log means that XFS flushes the (possibly incorrect) counters to
+  disk as part of cleaning the log.
+
+- A bug in the VFS meant that freeze could complete even when sync_filesystem
+  fails to flush the filesystem and returns an error.
+  This bug was fixed in Linux 5.17.
+
+The author established that the only component of online fsck that requires the
+ability to freeze the filesystem is the fscounter scrubber, so the code for
+this could be localized to that source file.
+fscounter freeze behaves the same as the VFS freeze method, except:
+
+- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
+  prevent other threads from thawing the filesystem.
+
+- It does not quiesce the log.
+
+With this code in place, it is now possible to pause the filesystem for just
+long enough to check and correct the summary counters.
+
+The proposed patchset is the
+`summary counter cleanup
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
+series.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 10/14] xfs: document full filesystem scans for online fsck
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
                     ` (11 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 11/14] xfs: document metadata file repair Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2023-02-16 15:47     ` Allison Henderson
  2022-12-30 22:10   ` [PATCH 12/14] xfs: document directory tree repairs Darrick J. Wong
                     ` (2 subsequent siblings)
  15 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Certain parts of the online fsck code need to scan every file in the
entire filesystem.  It is not acceptable to block the entire filesystem
while this happens, which means that we need to be clever in allowing
scans to coordinate with ongoing filesystem updates.  We also need to
hook the filesystem so that regular updates propagate to the staging
records.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  677 ++++++++++++++++++++
 1 file changed, 677 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index a658da8fe4ae..c0f08a773f08 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -3018,3 +3018,680 @@ The proposed patchset is the
 `summary counter cleanup
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
 series.
+
+Full Filesystem Scans
+---------------------
+
+Certain types of metadata can only be checked by walking every file in the
+entire filesystem to record observations and comparing the observations against
+what's recorded on disk.
+Like every other type of online repair, repairs are made by writing those
+observations to disk in a replacement structure and committing it atomically.
+However, it is not practical to shut down the entire filesystem to examine
+hundreds of billions of files because the downtime would be excessive.
+Therefore, online fsck must build the infrastructure to manage a live scan of
+all the files in the filesystem.
+There are two questions that need to be solved to perform a live walk:
+
+- How does scrub manage the scan while it is collecting data?
+
+- How does the scan keep abreast of changes being made to the system by other
+  threads?
+
+.. _iscan:
+
+Coordinated Inode Scans
+```````````````````````
+
+In the original Unix filesystems of the 1970s, each directory entry contained
+an index number (*inumber*) which was used as an index into on ondisk array
+(*itable*) of fixed-size records (*inodes*) describing a file's attributes and
+its data block mapping.
+This system is described by J. Lions, `"inode (5659)"
+<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on
+UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South
+Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
+`"Implementation of the File System"
+<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
+Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
+1913-4.
+
+XFS retains most of this design, except now inumbers are search keys over all
+the space in the data section filesystem.
+They form a continuous keyspace that can be expressed as a 64-bit integer,
+though the inodes themselves are sparsely distributed within the keyspace.
+Scans proceed in a linear fashion across the inumber keyspace, starting from
+``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``.
+Naturally, a scan through a keyspace requires a scan cursor object to track the
+scan progress.
+Because this keyspace is sparse, this cursor contains two parts.
+The first part of this scan cursor object tracks the inode that will be
+examined next; call this the examination cursor.
+Somewhat less obviously, the scan cursor object must also track which parts of
+the keyspace have already been visited, which is critical for deciding if a
+concurrent filesystem update needs to be incorporated into the scan data.
+Call this the visited inode cursor.
+
+Advancing the scan cursor is a multi-step process encapsulated in
+``xchk_iscan_iter``:
+
+1. Lock the AGI buffer of the AG containing the inode pointed to by the visited
+   inode cursor.
+   This guarantee that inodes in this AG cannot be allocated or freed while
+   advancing the cursor.
+
+2. Use the per-AG inode btree to look up the next inumber after the one that
+   was just visited, since it may not be keyspace adjacent.
+
+3. If there are no more inodes left in this AG:
+
+   a. Move the examination cursor to the point of the inumber keyspace that
+      corresponds to the start of the next AG.
+
+   b. Adjust the visited inode cursor to indicate that it has "visited" the
+      last possible inode in the current AG's inode keyspace.
+      XFS inumbers are segmented, so the cursor needs to be marked as having
+      visited the entire keyspace up to just before the start of the next AG's
+      inode keyspace.
+
+   c. Unlock the AGI and return to step 1 if there are unexamined AGs in the
+      filesystem.
+
+   d. If there are no more AGs to examine, set both cursors to the end of the
+      inumber keyspace.
+      The scan is now complete.
+
+4. Otherwise, there is at least one more inode to scan in this AG:
+
+   a. Move the examination cursor ahead to the next inode marked as allocated
+      by the inode btree.
+
+   b. Adjust the visited inode cursor to point to the inode just prior to where
+      the examination cursor is now.
+      Because the scanner holds the AGI buffer lock, no inodes could have been
+      created in the part of the inode keyspace that the visited inode cursor
+      just advanced.
+
+5. Get the incore inode for the inumber of the examination cursor.
+   By maintaining the AGI buffer lock until this point, the scanner knows that
+   it was safe to advance the examination cursor across the entire keyspace,
+   and that it has stabilized this next inode so that it cannot disappear from
+   the filesystem until the scan releases the incore inode.
+
+6. Drop the AGI lock and return the incore inode to the caller.
+
+Online fsck functions scan all files in the filesystem as follows:
+
+1. Start a scan by calling ``xchk_iscan_start``.
+
+2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode.
+   If one is provided:
+
+   a. Lock the inode to prevent updates during the scan.
+
+   b. Scan the inode.
+
+   c. While still holding the inode lock, adjust the visited inode cursor
+      (``xchk_iscan_mark_visited``) to point to this inode.
+
+   d. Unlock and release the inode.
+
+8. Call ``xchk_iscan_finish`` to complete the scan.
+
+There are subtleties with the inode cache that complicate grabbing the incore
+inode for the caller.
+Obviously, it is an absolute requirement that the inode metadata be consistent
+enough to load it into the inode cache.
+Second, if the incore inode is stuck in some intermediate state, the scan
+coordinator must release the AGI and push the main filesystem to get the inode
+back into a loadable state.
+
+The proposed patches are the
+`inode scanner
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
+series.
+
+Inode Management
+````````````````
+
+In regular filesystem code, references to allocated XFS incore inodes are
+always obtained (``xfs_iget``) outside of transaction context because the
+creation of the incore context for ane xisting file does not require metadata
+updates.
+However, it is important to note that references to incore inodes obtained as
+part of file creation must be performed in transaction context because the
+filesystem must ensure the atomicity of the ondisk inode btree index updates
+and the initialization of the actual ondisk inode.
+
+References to incore inodes are always released (``xfs_irele``) outside of
+transaction context because there are a handful of activities that might
+require ondisk updates:
+
+- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
+  release.
+
+- Speculative preallocations need to be unreserved.
+
+- An unlinked file may have lost its last reference, in which case the entire
+  file must be inactivated, which involves releasing all of its resources in
+  the ondisk metadata and freeing the inode.
+
+These activities are collectively called inode inactivation.
+Inactivation has two parts -- the VFS part, which initiates writeback on all
+dirty file pages, and the XFS part, which cleans up XFS-specific information
+and frees the inode if it was unlinked.
+If the inode is unlinked (or unconnected after a file handle operation), the
+kernel drops the inode into the inactivation machinery immediately.
+
+During normal operation, resource acquisition for an update follows this order
+to avoid deadlocks:
+
+1. Inode reference (``iget``).
+
+2. Filesystem freeze protection, if repairing (``mnt_want_write_file``).
+
+3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO.
+
+4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that
+   can update page cache mappings.
+
+5. Log feature enablement.
+
+6. Transaction log space grant.
+
+7. Space on the data and realtime devices for the transaction.
+
+8. Incore dquot references, if a file is being repaired.
+   Note that they are not locked, merely acquired.
+
+9. Inode ``ILOCK`` for file metadata updates.
+
+10. AG header buffer locks / Realtime metadata inode ILOCK.
+
+11. Realtime metadata buffer locks, if applicable.
+
+12. Extent mapping btree blocks, if applicable.
+
+Resources are often released in the reverse order, though this is not required.
+However, online fsck differs from regular XFS operations because it may examine
+an object that normally is acquired in a later stage of the locking order, and
+then decide to cross-reference the object with an object that is acquired
+earlier in the order.
+The next few sections detail the specific ways in which online fsck takes care
+to avoid deadlocks.
+
+iget and irele During a Scrub
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An inode scan performed on behalf of a scrub operation runs in transaction
+context, and possibly with resources already locked and bound to it.
+This isn't much of a problem for ``iget`` since it can operate in the context
+of an existing transaction, as long as all of the bound resources are acquired
+before the inode reference in the regular filesystem.
+
+When the VFS ``iput`` function is given a linked inode with no other
+references, it normally puts the inode on an LRU list in the hope that it can
+save time if another process re-opens the file before the system runs out
+of memory and frees it.
+Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
+flag on the inode to cause the kernel to try to drop the inode into the
+inactivation machinery immediately.
+
+In the past, inactivation was always done from the process that dropped the
+inode, which was a problem for scrub because scrub may already hold a
+transaction, and XFS does not support nesting transactions.
+On the other hand, if there is no scrub transaction, it is desirable to drop
+otherwise unused inodes immediately to avoid polluting caches.
+To capture these nuances, the online fsck code has a separate ``xchk_irele``
+function to set or clear the ``DONTCACHE`` flag to get the required release
+behavior.
+
+Proposed patchsets include fixing
+`scrub iget usage
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
+`dir iget usage
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
+
+Locking Inodes
+^^^^^^^^^^^^^^
+
+In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks
+in a well-known order: parent → child when updating the directory tree, and
+``struct inode`` address order otherwise.
+For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page
+faults.
+If two MMAPLOCKs must be acquired, they are acquired in ``struct
+address_space`` order.
+Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be
+acquired before transactions are allocated.
+If two ILOCKs must be acquired, they are acquired in inumber order.
+
+Inode lock acquisition must be done carefully during a coordinated inode scan.
+Online fsck cannot abide these conventions, because for a directory tree
+scanner, the scrub process holds the IOLOCK of the file being scanned and it
+needs to take the IOLOCK of the file at the other end of the directory link.
+If the directory tree is corrupt because it contains a cycle, ``xfs_scrub``
+cannot use the regular inode locking functions and avoid becoming trapped in an
+ABBA deadlock.
+
+Solving both of these problems is straightforward -- any time online fsck
+needs to take a second lock of the same class, it uses trylock to avoid an ABBA
+deadlock.
+If the trylock fails, scrub drops all inode locks and use trylock loops to
+(re)acquire all necessary resources.
+Trylock loops enable scrub to check for pending fatal signals, which is how
+scrub avoids deadlocking the filesystem or becoming an unresponsive process.
+However, trylock loops means that online fsck must be prepared to measure the
+resource being scrubbed before and after the lock cycle to detect changes and
+react accordingly.
+
+.. _dirparent:
+
+Case Study: Finding a Directory Parent
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Consider the directory parent pointer repair code as an example.
+Online fsck must verify that the dotdot dirent of a directory points up to a
+parent directory, and that the parent directory contains exactly one dirent
+pointing down to the child directory.
+Fully validating this relationship (and repairing it if possible) requires a
+walk of every directory on the filesystem while holding the child locked, and
+while updates to the directory tree are being made.
+The coordinated inode scan provides a way to walk the filesystem without the
+possibility of missing an inode.
+The child directory is kept locked to prevent updates to the dotdot dirent, but
+if the scanner fails to lock a parent, it can drop and relock both the child
+and the prospective parent.
+If the dotdot entry changes while the directory is unlocked, then a move or
+rename operation must have changed the child's parentage, and the scan can
+exit early.
+
+The proposed patchset is the
+`directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
+series.
+
+.. _fshooks:
+
+Filesystem Hooks
+`````````````````
+
+The second piece of support that online fsck functions need during a full
+filesystem scan is the ability to stay informed about updates being made by
+other threads in the filesystem, since comparisons against the past are useless
+in a dynamic environment.
+Two pieces of Linux kernel infrastructure enable online fsck to monitor regular
+filesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`.
+
+Filesystem hooks convey information about an ongoing filesystem operation to
+a downstream consumer.
+In this case, the downstream consumer is always an online fsck function.
+Because multiple fsck functions can run in parallel, online fsck uses the Linux
+notifier call chain facility to dispatch updates to any number of interested
+fsck processes.
+Call chains are a dynamic list, which means that they can be configured at
+run time.
+Because these hooks are private to the XFS module, the information passed along
+contains exactly what the checking function needs to update its observations.
+
+The current implementation of XFS hooks uses SRCU notifier chains to reduce the
+impact to highly threaded workloads.
+Regular blocking notifier chains use a rwsem and seem to have a much lower
+overhead for single-threaded applications.
+However, it may turn out that the combination of blocking chains and static
+keys are a more performant combination; more study is needed here.
+
+The following pieces are necessary to hook a certain point in the filesystem:
+
+- A ``struct xfs_hooks`` object must be embedded in a convenient place such as
+  a well-known incore filesystem object.
+
+- Each hook must define an action code and a structure containing more context
+  about the action.
+
+- Hook providers should provide appropriate wrapper functions and structs
+  around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type
+  checking to ensure correct usage.
+
+- A callsite in the regular filesystem code must be chosen to call
+  ``xfs_hooks_call`` with the action code and data structure.
+  This place should be adjacent to (and not earlier than) the place where
+  the filesystem update is committed to the transaction.
+  In general, when the filesystem calls a hook chain, it should be able to
+  handle sleeping and should not be vulnerable to memory reclaim or locking
+  recursion.
+  However, the exact requirements are very dependent on the context of the hook
+  caller and the callee.
+
+- The online fsck function should define a structure to hold scan data, a lock
+  to coordinate access to the scan data, and a ``struct xfs_hook`` object.
+  The scanner function and the regular filesystem code must acquire resources
+  in the same order; see the next section for details.
+
+- The online fsck code must contain a C function to catch the hook action code
+  and data structure.
+  If the object being updated has already been visited by the scan, then the
+  hook information must be applied to the scan data.
+
+- Prior to unlocking inodes to start the scan, online fsck must call
+  ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and
+  ``xfs_hooks_add`` to enable the hook.
+
+- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
+  complete.
+
+The number of hooks should be kept to a minimum to reduce complexity.
+Static keys are used to reduce the overhead of filesystem hooks to nearly
+zero when online fsck is not running.
+
+.. _liveupdate:
+
+Live Updates During a Scan
+``````````````````````````
+
+The code paths of the online fsck scanning code and the :ref:`hooked<fshooks>`
+filesystem code look like this::
+
+            other program
+                  ↓
+            inode lock ←────────────────────┐
+                  ↓                         │
+            AG header lock                  │
+                  ↓                         │
+            filesystem function             │
+                  ↓                         │
+            notifier call chain             │    same
+                  ↓                         ├─── inode
+            scrub hook function             │    lock
+                  ↓                         │
+            scan data mutex ←──┐    same    │
+                  ↓            ├─── scan    │
+            update scan data   │    lock    │
+                  ↑            │            │
+            scan data mutex ←──┘            │
+                  ↑                         │
+            inode lock ←────────────────────┘
+                  ↑
+            scrub function
+                  ↑
+            inode scanner
+                  ↑
+            xfs_scrub
+
+These rules must be followed to ensure correct interactions between the
+checking code and the code making an update to the filesystem:
+
+- Prior to invoking the notifier call chain, the filesystem function being
+  hooked must acquire the same lock that the scrub scanning function acquires
+  to scan the inode.
+
+- The scanning function and the scrub hook function must coordinate access to
+  the scan data by acquiring a lock on the scan data.
+
+- Scrub hook function must not add the live update information to the scan
+  observations unless the inode being updated has already been scanned.
+  The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``)
+  for this.
+
+- Scrub hook functions must not change the caller's state, including the
+  transaction that it is running.
+  They must not acquire any resources that might conflict with the filesystem
+  function being hooked.
+
+- The hook function can abort the inode scan to avoid breaking the other rules.
+
+The inode scan APIs are pretty simple:
+
+- ``xchk_iscan_start`` starts a scan
+
+- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
+  returns zero if there is nothing left to scan
+
+- ``xchk_iscan_want_live_update`` to decide if an inode has already been
+  visited in the scan.
+  This is critical for hook functions to decide if they need to update the
+  in-memory scan information.
+
+- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
+  scan
+
+- ``xchk_iscan_finish`` to finish the scan
+
+The proposed patches are at the start of the
+`online quotacheck
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
+series.
+
+.. _quotacheck:
+
+Case Study: Quota Counter Checking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+It is useful to compare the mount time quotacheck code to the online repair
+quotacheck code.
+Mount time quotacheck does not have to contend with concurrent operations, so
+it does the following:
+
+1. Make sure the ondisk dquots are in good enough shape that all the incore
+   dquots will actually load, and zero the resource usage counters in the
+   ondisk buffer.
+
+2. Walk every inode in the filesystem.
+   Add each file's resource usage to the incore dquot.
+
+3. Walk each incore dquot.
+   If the incore dquot is not being flushed, add the ondisk buffer backing the
+   incore dquot to a delayed write (delwri) list.
+
+4. Write the buffer list to disk.
+
+Like most online fsck functions, online quotacheck can't write to regular
+filesystem objects until the newly collected metadata reflect all filesystem
+state.
+Therefore, online quotacheck records file resource usage to a shadow dquot
+index implemented with a sparse ``xfarray``, and only writes to the real dquots
+once the scan is complete.
+Handling transactional updates is tricky because quota resource usage updates
+are handled in phases to minimize contention on dquots:
+
+1. The inodes involved are joined and locked to a transaction.
+
+2. For each dquot attached to the file:
+
+   a. The dquot is locked.
+
+   b. A quota reservation is added to the dquot's resource usage.
+      The reservation is recorded in the transaction.
+
+   c. The dquot is unlocked.
+
+3. Changes in actual quota usage are tracked in the transaction.
+
+4. At transaction commit time, each dquot is examined again:
+
+   a. The dquot is locked again.
+
+   b. Quota usage changes are logged and unused reservation is given back to
+      the dquot.
+
+   c. The dquot is unlocked.
+
+For online quotacheck, hooks are placed in steps 2 and 4.
+The step 2 hook creates a shadow version of the transaction dquot context
+(``dqtrx``) that operates in a similar manner to the regular code.
+The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots.
+Notice that both hooks are called with the inode locked, which is how the
+live update coordinates with the inode scanner.
+
+The quotacheck scan looks like this:
+
+1. Set up a coordinated inode scan.
+
+2. For each inode returned by the inode scan iterator:
+
+   a. Grab and lock the inode.
+
+   b. Determine that inode's resource usage (data blocks, inode counts,
+      realtime blocks) and add that to the shadow dquots for the user, group,
+      and project ids associated with the inode.
+
+   c. Unlock and release the inode.
+
+3. For each dquot in the system:
+
+   a. Grab and lock the dquot.
+
+   b. Check the dquot against the shadow dquots created by the scan and updated
+      by the live hooks.
+
+Live updates are key to being able to walk every quota record without
+needing to hold any locks for a long duration.
+If repairs are desired, the real and shadow dquots are locked and their
+resource counts are set to the values in the shadow dquot.
+
+The proposed patchset is the
+`online quotacheck
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
+series.
+
+.. _nlinks:
+
+Case Study: File Link Count Checking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+File link count checking also uses live update hooks.
+The coordinated inode scanner is used to visit all directories on the
+filesystem, and per-file link count records are stored in a sparse ``xfarray``
+indexed by inumber.
+During the scanning phase, each entry in a directory generates observation
+data as follows:
+
+1. If the entry is a dotdot (``'..'``) entry of the root directory, the
+   directory's parent link count is bumped because the root directory's dotdot
+   entry is self referential.
+
+2. If the entry is a dotdot entry of a subdirectory, the parent's backref
+   count is bumped.
+
+3. If the entry is neither a dot nor a dotdot entry, the target file's parent
+   count is bumped.
+
+4. If the target is a subdirectory, the parent's child link count is bumped.
+
+A crucial point to understand about how the link count inode scanner interacts
+with the live update hooks is that the scan cursor tracks which *parent*
+directories have been scanned.
+In other words, the live updates ignore any update about ``A → B`` when A has
+not been scanned, even if B has been scanned.
+Furthermore, a subdirectory A with a dotdot entry pointing back to B is
+accounted as a backref counter in the shadow data for A, since child dotdot
+entries affect the parent's link count.
+Live update hooks are carefully placed in all parts of the filesystem that
+create, change, or remove directory entries, since those operations involve
+bumplink and droplink.
+
+For any file, the correct link count is the number of parents plus the number
+of child subdirectories.
+Non-directories never have children of any kind.
+The backref information is used to detect inconsistencies in the number of
+links pointing to child subdirectories and the number of dotdot entries
+pointing back.
+
+After the scan completes, the link count of each file can be checked by locking
+both the inode and the shadow data, and comparing the link counts.
+A second coordinated inode scan cursor is used for comparisons.
+Live updates are key to being able to walk every inode without needing to hold
+any locks between inodes.
+If repairs are desired, the inode's link count is set to the value in the
+shadow information.
+If no parents are found, the file must be :ref:`reparented <orphanage>` to the
+orphanage to prevent the file from being lost forever.
+
+The proposed patchset is the
+`file link count repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
+series.
+
+.. _rmap_repair:
+
+Case Study: Rebuilding Reverse Mapping Records
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most repair functions follow the same pattern: lock filesystem resources,
+walk the surviving ondisk metadata looking for replacement metadata records,
+and use an :ref:`in-memory array <xfarray>` to store the gathered observations.
+The primary advantage of this approach is the simplicity and modularity of the
+repair code -- code and data are entirely contained within the scrub module,
+do not require hooks in the main filesystem, and are usually the most efficient
+in memory use.
+A secondary advantage of this repair approach is atomicity -- once the kernel
+decides a structure is corrupt, no other threads can access the metadata until
+the kernel finishes repairing and revalidating the metadata.
+
+For repairs going on within a shard of the filesystem, these advantages
+outweigh the delays inherent in locking the shard while repairing parts of the
+shard.
+Unfortunately, repairs to the reverse mapping btree cannot use the "standard"
+btree repair strategy because it must scan every space mapping of every fork of
+every file in the filesystem, and the filesystem cannot stop.
+Therefore, rmap repair foregoes atomicity between scrub and repair.
+It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks
+<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
+scan for reverse mapping records.
+
+1. Set up an xfbtree to stage rmap records.
+
+2. While holding the locks on the AGI and AGF buffers acquired during the
+   scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW
+   staging extents, and the internal log.
+
+3. Set up an inode scanner.
+
+4. Hook into rmap updates for the AG being repaired so that the live scan data
+   can receive updates to the rmap btree from the rest of the filesystem during
+   the file scan.
+
+5. For each space mapping found in either fork of each file scanned,
+   decide if the mapping matches the AG of interest.
+   If so:
+
+   a. Create a btree cursor for the in-memory btree.
+
+   b. Use the rmap code to add the record to the in-memory btree.
+
+   c. Use the :ref:`special commit function <xfbtree_commit>` to write the
+      xfbtree changes to the xfile.
+
+6. For each live update received via the hook, decide if the owner has already
+   been scanned.
+   If so, apply the live update into the scan data:
+
+   a. Create a btree cursor for the in-memory btree.
+
+   b. Replay the operation into the in-memory btree.
+
+   c. Use the :ref:`special commit function <xfbtree_commit>` to write the
+      xfbtree changes to the xfile.
+      This is performed with an empty transaction to avoid changing the
+      caller's state.
+
+7. When the inode scan finishes, create a new scrub transaction and relock the
+   two AG headers.
+
+8. Compute the new btree geometry using the number of rmap records in the
+   shadow btree, like all other btree rebuilding functions.
+
+9. Allocate the number of blocks computed in the previous step.
+
+10. Perform the usual btree bulk loading and commit to install the new rmap
+    btree.
+
+11. Reap the old rmap btree blocks as discussed in the case study about how
+    to :ref:`reap after rmap btree repair <rmap_reap>`.
+
+12. Free the xfbtree now that it not needed.
+
+The proposed patchset is the
+`rmap repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
+series.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 11/14] xfs: document metadata file repair
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
                     ` (10 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 14/14] xfs: document future directions of online fsck Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2023-02-25  7:33     ` Allison Henderson
  2022-12-30 22:10   ` [PATCH 10/14] xfs: document full filesystem scans for online fsck Darrick J. Wong
                     ` (3 subsequent siblings)
  15 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

File-based metadata (such as xattrs and directories) can be extremely
large.  To reduce the memory requirements and maximize code reuse, it is
very convenient to create a temporary file, use the regular dir/attr
code to store salvaged information, and then atomically swap the extents
between the file being repaired and the temporary file.  Record the high
level concepts behind how temporary files and atomic content swapping
should work, and then present some case studies of what the actual
repair functions do.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  574 ++++++++++++++++++++
 1 file changed, 574 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index c0f08a773f08..e32506acb66f 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -3252,6 +3252,8 @@ Proposed patchsets include fixing
 `dir iget usage
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
 
+.. _ilocking:
+
 Locking Inodes
 ^^^^^^^^^^^^^^
 
@@ -3695,3 +3697,575 @@ The proposed patchset is the
 `rmap repair
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
 series.
+
+Staging Repairs with Temporary Files on Disk
+--------------------------------------------
+
+XFS stores a substantial amount of metadata in file forks: directories,
+extended attributes, symbolic link targets, free space bitmaps and summary
+information for the realtime volume, and quota records.
+File forks map 64-bit logical file fork space extents to physical storage space
+extents, similar to how a memory management unit maps 64-bit virtual addresses
+to physical memory addresses.
+Therefore, file-based tree structures (such as directories and extended
+attributes) use blocks mapped in the file fork offset address space that point
+to other blocks mapped within that same address space, and file-based linear
+structures (such as bitmaps and quota records) compute array element offsets in
+the file fork offset address space.
+
+In the initial iteration of file metadata repair, the damaged metadata blocks
+would be scanned for salvageable data; the extents in the file fork would be
+reaped; and then a new structure would be built in its place.
+This strategy did not survive the introduction of the atomic repair requirement
+expressed earlier in this document.
+The second iteration explored building a second structure at a high offset
+in the fork from the salvage data, reaping the old extents, and using a
+``COLLAPSE_RANGE`` operation to slide the new extents into place.
+This had many drawbacks:
+
+- Array structures are linearly addressed, and the regular filesystem codebase
+  does not have the concept of a linear offset that could be applied to the
+  record offset computation to build an alternate copy.
+
+- Extended attributes are allowed to use the entire attr fork offset address
+  space.
+
+- Even if repair could build an alternate copy of a data structure in a
+  different part of the fork address space, the atomic repair commit
+  requirement means that online repair would have to be able to perform a log
+  assisted ``COLLAPSE_RANGE`` operation to ensure that the old structure was
+  completely replaced.
+
+- A crash after construction of the secondary tree but before the range
+  collapse would leave unreachable blocks in the file fork.
+  This would likely confuse things further.
+
+- Reaping blocks after a repair is not a simple operation, and initiating a
+  reap operation from a restarted range collapse operation during log recovery
+  is daunting.
+
+- Directory entry blocks and quota records record the file fork offset in the
+  header area of each block.
+  An atomic range collapse operation would have to rewrite this part of each
+  block header.
+  Rewriting a single field in block headers is not a huge problem, but it's
+  something to be aware of.
+
+- Each block in a directory or extended attributes btree index contains sibling
+  and child block pointers.
+  Were the atomic commit to use a range collapse operation, each block would
+  have to be rewritten very carefully to preserve the graph structure.
+  Doing this as part of a range collapse means rewriting a large number of
+  blocks repeatedly, which is not conducive to quick repairs.
+
+The third iteration of the design for file metadata repair went for a totally
+new strategy -- create a temporary file in the XFS filesystem, write a new
+structure at the correct offsets into the temporary file, and atomically swap
+the fork mappings (and hence the fork contents) to commit the repair.
+Once the repair is complete, the old fork can be reaped as necessary; if the
+system goes down during the reap, the iunlink code will delete the blocks
+during log recovery.
+
+**Note**: All space usage and inode indices in the filesystem *must* be
+consistent to use a temporary file safely!
+This dependency is the reason why online repair can only use pageable kernel
+memory to stage ondisk space usage information.
+
+Swapping extents with a temporary file still requires a rewrite of the owner
+field of the block headers, but this is *much* simpler than moving tree blocks
+individually.
+Furthermore, the buffer verifiers do not verify owner fields (since they are
+not aware of the inode that owns the block), which makes reaping of old file
+blocks much simpler.
+Extent swapping requires that AG space metadata and the file fork metadata of
+the file being repaired are all consistent with respect to each other, but
+that's already a requirement for correct operation of files in general.
+There is, however, a slight downside -- if the system crashes during the reap
+phase and the fork extents are crosslinked, the iunlink processing will fail
+because freeing space will find the extra reverse mappings and abort.
+
+Temporary files created for repair are similar to ``O_TMPFILE`` files created
+by userspace.
+They are not linked into a directory and the entire file will be reaped when
+the last reference to the file is lost.
+The key differences are that these files must have no access permission outside
+the kernel at all, they must be specially marked to prevent them from being
+opened by handle, and they must never be linked into the directory tree.
+
+Using a Temporary File
+``````````````````````
+
+Online repair code should use the ``xrep_tempfile_create`` function to create a
+temporary file inside the filesystem.
+This allocates an inode, marks the in-core inode private, and attaches it to
+the scrub context.
+These files are hidden from userspace, may not be added to the directory tree,
+and must be kept private.
+
+Temporary files only use two inode locks: the IOLOCK and the ILOCK.
+The MMAPLOCK is not needed here, because there must not be page faults from
+userspace for data fork blocks.
+The usage patterns of these two locks are the same as for any other XFS file --
+access to file data are controlled via the IOLOCK, and access to file metadata
+are controlled via the ILOCK.
+Locking helpers are provided so that the temporary file and its lock state can
+be cleaned up by the scrub context.
+To comply with the nested locking strategy laid out in the :ref:`inode
+locking<ilocking>` section, it is recommended that scrub functions use the
+xrep_tempfile_ilock*_nowait lock helpers.
+
+Data can be written to a temporary file by two means:
+
+1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular
+   temporary file from an xfile.
+
+2. The regular directory, symbolic link, and extended attribute functions can
+   be used to write to the temporary file.
+
+Once a good copy of a data file has been constructed in a temporary file, it
+must be conveyed to the file being repaired, which is the topic of the next
+section.
+
+The proposed patches are in the
+`realtime summary repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
+series.
+
+Atomic Extent Swapping
+----------------------
+
+Once repair builds a temporary file with a new data structure written into
+it, it must commit the new changes into the existing file.
+It is not possible to swap the inumbers of two files, so instead the new
+metadata must replace the old.
+This suggests the need for the ability to swap extents, but the existing extent
+swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient
+for online repair because:
+
+a. When the reverse-mapping btree is enabled, the swap code must keep the
+   reverse mapping information up to date with every exchange of mappings.
+   Therefore, it can only exchange one mapping per transaction, and each
+   transaction is independent.
+
+b. Reverse-mapping is critical for the operation of online fsck, so the old
+   defragmentation code (which swapped entire extent forks in a single
+   operation) is not useful here.
+
+c. Defragmentation is assumed to occur between two files with identical
+   contents.
+   For this use case, an incomplete exchange will not result in a user-visible
+   change in file contents, even if the operation is interrupted.
+
+d. Online repair needs to swap the contents of two files that are by definition
+   *not* identical.
+   For directory and xattr repairs, the user-visible contents might be the
+   same, but the contents of individual blocks may be very different.
+
+e. Old blocks in the file may be cross-linked with another structure and must
+   not reappear if the system goes down mid-repair.
+
+These problems are overcome by creating a new deferred operation and a new type
+of log intent item to track the progress of an operation to exchange two file
+ranges.
+The new deferred operation type chains together the same transactions used by
+the reverse-mapping extent swap code.
+The new log item records the progress of the exchange to ensure that once an
+exchange begins, it will always run to completion, even there are
+interruptions.
+
+The proposed patchset is the
+`atomic extent swap
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
+series.
+
+Using Log-Incompatible Feature Flags
+````````````````````````````````````
+
+Starting with XFS v5, the superblock contains a ``sb_features_log_incompat``
+field to indicate that the log contains records that might not readable by all
+kernels that could mount this filesystem.
+In short, log incompat features protect the log contents against kernels that
+will not understand the contents.
+Unlike the other superblock feature bits, log incompat bits are ephemeral
+because an empty (clean) log does not need protection.
+The log cleans itself after its contents have been committed into the
+filesystem, either as part of an unmount or because the system is otherwise
+idle.
+Because upper level code can be working on a transaction at the same time that
+the log cleans itself, it is necessary for upper level code to communicate to
+the log when it is going to use a log incompatible feature.
+
+The log coordinates access to incompatible features through the use of one
+``struct rw_semaphore`` for each feature.
+The log cleaning code tries to take this rwsem in exclusive mode to clear the
+bit; if the lock attempt fails, the feature bit remains set.
+Filesystem code signals its intention to use a log incompat feature in a
+transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem in
+shared mode.
+The code supporting a log incompat feature should create wrapper functions to
+obtain the log feature and call ``xfs_add_incompat_log_feature`` to set the
+feature bits in the primary superblock.
+The superblock update is performed transactionally, so the wrapper to obtain
+log assistance must be called just prior to the creation of the transaction
+that uses the functionality.
+For a file operation, this step must happen after taking the IOLOCK and the
+MMAPLOCK, but before allocating the transaction.
+When the transaction is complete, the ``xlog_drop_incompat_feat`` function
+is called to release the feature.
+The feature bit will not be cleared from the superblock until the log becomes
+clean.
+
+Log-assisted extended attribute updates and atomic extent swaps both use log
+incompat features and provide convenience wrappers around the functionality.
+
+Mechanics of an Atomic Extent Swap
+``````````````````````````````````
+
+Swapping entire file forks is a complex task.
+The goal is to exchange all file fork mappings between two file fork offset
+ranges.
+There are likely to be many extent mappings in each fork, and the edges of
+the mappings aren't necessarily aligned.
+Furthermore, there may be other updates that need to happen after the swap,
+such as exchanging file sizes, inode flags, or conversion of fork data to local
+format.
+This is roughly the format of the new deferred extent swap work item:
+
+.. code-block:: c
+
+	struct xfs_swapext_intent {
+	    /* Inodes participating in the operation. */
+	    struct xfs_inode    *sxi_ip1;
+	    struct xfs_inode    *sxi_ip2;
+
+	    /* File offset range information. */
+	    xfs_fileoff_t       sxi_startoff1;
+	    xfs_fileoff_t       sxi_startoff2;
+	    xfs_filblks_t       sxi_blockcount;
+
+	    /* Set these file sizes after the operation, unless negative. */
+	    xfs_fsize_t         sxi_isize1;
+	    xfs_fsize_t         sxi_isize2;
+
+	    /* XFS_SWAP_EXT_* log operation flags */
+	    uint64_t            sxi_flags;
+	};
+
+The new log intent item contains enough information to track two logical fork
+offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
+blockcount)``.
+Each step of a swap operation exchanges the largest file range mapping possible
+from one file to the other.
+After each step in the swap operation, the two startoff fields are incremented
+and the blockcount field is decremented to reflect the progress made.
+The flags field captures behavioral parameters such as swapping the attr fork
+instead of the data fork and other work to be done after the extent swap.
+The two isize fields are used to swap the file size at the end of the operation
+if the file data fork is the target of the swap operation.
+
+When the extent swap is initiated, the sequence of operations is as follows:
+
+1. Create a deferred work item for the extent swap.
+   At the start, it should contain the entirety of the file ranges to be
+   swapped.
+
+2. Call ``xfs_defer_finish`` to start processing of the exchange.
+   This will log an extent swap intent item to the transaction for the deferred
+   extent swap work item.
+
+3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero,
+
+   a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and
+      ``sxi_startoff2``, respectively, and compute the longest extent that can
+      be swapped in a single step.
+      This is the minimum of the two ``br_blockcount`` s in the mappings.
+      Keep advancing through the file forks until at least one of the mappings
+      contains written blocks.
+      Mutual holes, unwritten extents, and extent mappings to the same physical
+      space are not exchanged.
+
+      For the next few steps, this document will refer to the mapping that came
+      from file 1 as "map1", and the mapping that came from file 2 as "map2".
+
+   b. Create a deferred block mapping update to unmap map1 from file 1.
+
+   c. Create a deferred block mapping update to unmap map2 from file 2.
+
+   d. Create a deferred block mapping update to map map1 into file 2.
+
+   e. Create a deferred block mapping update to map map2 into file 1.
+
+   f. Log the block, quota, and extent count updates for both files.
+
+   g. Extend the ondisk size of either file if necessary.
+
+   h. Log an extent swap done log item for the extent swap intent log item
+      that was read at the start of step 3.
+
+   i. Compute the amount of file range that has just been covered.
+      This quantity is ``(map1.br_startoff + map1.br_blockcount -
+      sxi_startoff1)``, because step 3a could have skipped holes.
+
+   j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2``
+      by the number of blocks computed in the previous step, and decrease
+      ``sxi_blockcount`` by the same quantity.
+      This advances the cursor.
+
+   k. Log a new extent swap intent log item reflecting the advanced state of
+      the work item.
+
+   l. Return the proper error code (EAGAIN) to the deferred operation manager
+      to inform it that there is more work to be done.
+      The operation manager completes the deferred work in steps 3b-3e before
+      moving back to the start of step 3.
+
+4. Perform any post-processing.
+   This will be discussed in more detail in subsequent sections.
+
+If the filesystem goes down in the middle of an operation, log recovery will
+find the most recent unfinished extent swap log intent item and restart from
+there.
+This is how extent swapping guarantees that an outside observer will either see
+the old broken structure or the new one, and never a mismash of both.
+
+Extent Swapping with Regular User Files
+```````````````````````````````````````
+
+As mentioned earlier, XFS has long had the ability to swap extents between
+files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
+The earliest form of this was the fork swap mechanism, where the entire
+contents of data forks could be exchanged between two files by exchanging the
+raw bytes in each inode fork's immediate area.
+When XFS v5 came along with self-describing metadata, this old mechanism grew
+some log support to continue rewriting the owner fields of BMBT blocks during
+log recovery.
+When the reverse mapping btree was later added to XFS, the only way to maintain
+the consistency of the fork mappings with the reverse mapping index was to
+develop an iterative mechanism that used deferred bmap and rmap operations to
+swap mappings one at a time.
+This mechanism is identical to steps 2-3 from the procedure above except for
+the new tracking items, because the atomic extent swap mechanism is an
+iteration of an existing mechanism and not something totally novel.
+For the narrow case of file defragmentation, the file contents must be
+identical, so the recovery guarantees are not much of a gain.
+
+Atomic extent swapping is much more flexible than the existing swapext
+implementations because it can guarantee that the caller never sees a mix of
+old and new contents even after a crash, and it can operate on two arbitrary
+file fork ranges.
+The extra flexibility enables several new use cases:
+
+- **Atomic commit of file writes**: A userspace process opens a file that it
+  wants to update.
+  Next, it opens a temporary file and calls the file clone operation to reflink
+  the first file's contents into the temporary file.
+  Writes to the original file should instead be written to the temporary file.
+  Finally, the process calls the atomic extent swap system call
+  (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
+  of the updates to the original file, or none of them.
+
+- **Transactional file updates**: The same mechanism as above, but the caller
+  only wants the commit to occur if the original file's contents have not
+  changed.
+  To make this happen, the calling process snapshots the file modification and
+  change timestamps of the original file before reflinking its data to the
+  temporary file.
+  When the program is ready to commit the changes, it passes the timestamps
+  into the kernel as arguments to the atomic extent swap system call.
+  The kernel only commits the changes if the provided timestamps match the
+  original file.
+
+- **Emulation of atomic block device writes**: Export a block device with a
+  logical sector size matching the filesystem block size to force all writes
+  to be aligned to the filesystem block size.
+  Stage all writes to a temporary file, and when that is complete, call the
+  atomic extent swap system call with a flag to indicate that holes in the
+  temporary file should be ignored.
+  This emulates an atomic device write in software, and can support arbitrary
+  scattered writes.
+
+Preparation for Extent Swapping
+```````````````````````````````
+
+There are a few things that need to be taken care of before initiating an
+atomic extent swap operation.
+First, regular files require the page cache to be flushed to disk before the
+operation begins, and directio writes to be quiesced.
+Like any filesystem operation, extent swapping must determine the maximum
+amount of disk space and quota that can be consumed on behalf of both files in
+the operation, and reserve that quantity of resources to avoid an unrecoverable
+out of space failure once it starts dirtying metadata.
+The preparation step scans the ranges of both files to estimate:
+
+- Data device blocks needed to handle the repeated updates to the fork
+  mappings.
+- Change in data and realtime block counts for both files.
+- Increase in quota usage for both files, if the two files do not share the
+  same set of quota ids.
+- The number of extent mappings that will be added to each file.
+- Whether or not there are partially written realtime extents.
+  User programs must never be able to access a realtime file extent that maps
+  to different extents on the realtime volume, which could happen if the
+  operation fails to run to completion.
+
+The need for precise estimation increases the run time of the swap operation,
+but it is very important to maintain correct accounting.
+The filesystem must not run completely out of free space, nor can the extent
+swap ever add more extent mappings to a fork than it can support.
+Regular users are required to abide the quota limits, though metadata repairs
+may exceed quota to resolve inconsistent metadata elsewhere.
+
+Special Features for Swapping Metadata File Extents
+```````````````````````````````````````````````````
+
+Extended attributes, symbolic links, and directories can set the fork format to
+"local" and treat the fork as a literal area for data storage.
+Metadata repairs must take extra steps to support these cases:
+
+- If both forks are in local format and the fork areas are large enough, the
+  swap is performed by copying the incore fork contents, logging both forks,
+  and committing.
+  The atomic extent swap mechanism is not necessary, since this can be done
+  with a single transaction.
+
+- If both forks map blocks, then the regular atomic extent swap is used.
+
+- Otherwise, only one fork is in local format.
+  The contents of the local format fork are converted to a block to perform the
+  swap.
+  The conversion to block format must be done in the same transaction that
+  logs the initial extent swap intent log item.
+  The regular atomic extent swap is used to exchange the mappings.
+  Special flags are set on the swap operation so that the transaction can be
+  rolled one more time to convert the second file's fork back to local format
+  if possible.
+
+Extended attributes and directories stamp the owning inode into every block,
+but the buffer verifiers do not actually check the inode number!
+Although there is no verification, it is still important to maintain
+referential integrity, so prior to performing the extent swap, online repair
+walks every block in the new data structure to update the owner field and flush
+the buffer to disk.
+
+After a successful swap operation, the repair operation must reap the old fork
+blocks by processing each fork mapping through the standard :ref:`file extent
+reaping <reaping>` mechanism that is done post-repair.
+If the filesystem should go down during the reap part of the repair, the
+iunlink processing at the end of recovery will free both the temporary file and
+whatever blocks were not reaped.
+However, this iunlink processing omits the cross-link detection of online
+repair, and is not completely foolproof.
+
+Swapping Temporary File Extents
+```````````````````````````````
+
+To repair a metadata file, online repair proceeds as follows:
+
+1. Create a temporary repair file.
+
+2. Use the staging data to write out new contents into the temporary repair
+   file.
+   The same fork must be written to as is being repaired.
+
+3. Commit the scrub transaction, since the swap estimation step must be
+   completed before transaction reservations are made.
+
+4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with
+   the appropriate resource reservations, locks, and fill out a ``struct
+   xfs_swapext_req`` with the details of the swap operation.
+
+5. Call ``xrep_tempswap_contents`` to swap the contents.
+
+6. Commit the transaction to complete the repair.
+
+.. _rtsummary:
+
+Case Study: Repairing the Realtime Summary File
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In the "realtime" section of an XFS filesystem, free space is tracked via a
+bitmap, similar to Unix FFS.
+Each bit in the bitmap represents one realtime extent, which is a multiple of
+the filesystem block size between 4KiB and 1GiB in size.
+The realtime summary file indexes the number of free extents of a given size to
+the offset of the block within the realtime free space bitmap where those free
+extents begin.
+In other words, the summary file helps the allocator find free extents by
+length, similar to what the free space by count (cntbt) btree does for the data
+section.
+
+The summary file itself is a flat file (with no block headers or checksums!)
+partitioned into ``log2(total rt extents)`` sections containing enough 32-bit
+counters to match the number of blocks in the rt bitmap.
+Each counter records the number of free extents that start in that bitmap block
+and can satisfy a power-of-two allocation request.
+
+To check the summary file against the bitmap:
+
+1. Take the ILOCK of both the realtime bitmap and summary files.
+
+2. For each free space extent recorded in the bitmap:
+
+   a. Compute the position in the summary file that contains a counter that
+      represents this free extent.
+
+   b. Read the counter from the xfile.
+
+   c. Increment it, and write it back to the xfile.
+
+3. Compare the contents of the xfile against the ondisk file.
+
+To repair the summary file, write the xfile contents into the temporary file
+and use atomic extent swap to commit the new contents.
+The temporary file is then reaped.
+
+The proposed patchset is the
+`realtime summary repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
+series.
+
+Case Study: Salvaging Extended Attributes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In XFS, extended attributes are implemented as a namespaced name-value store.
+Values are limited in size to 64KiB, but there is no limit in the number of
+names.
+The attribute fork is unpartitioned, which means that the root of the attribute
+structure is always in logical block zero, but attribute leaf blocks, dabtree
+index blocks, and remote value blocks are intermixed.
+Attribute leaf blocks contain variable-sized records that associate
+user-provided names with the user-provided values.
+Values larger than a block are allocated separate extents and written there.
+If the leaf information expands beyond a single block, a directory/attribute
+btree (``dabtree``) is created to map hashes of attribute names to entries
+for fast lookup.
+
+Salvaging extended attributes is done as follows:
+
+1. Walk the attr fork mappings of the file being repaired to find the attribute
+   leaf blocks.
+   When one is found,
+
+   a. Walk the attr leaf block to find candidate keys.
+      When one is found,
+
+      1. Check the name for problems, and ignore the name if there are.
+
+      2. Retrieve the value.
+         If that succeeds, add the name and value to the staging xfarray and
+         xfblob.
+
+2. If the memory usage of the xfarray and xfblob exceed a certain amount of
+   memory or there are no more attr fork blocks to examine, unlock the file and
+   add the staged extended attributes to the temporary file.
+
+3. Use atomic extent swapping to exchange the new and old extended attribute
+   structures.
+   The old attribute blocks are now attached to the temporary file.
+
+4. Reap the temporary file.
+
+The proposed patchset is the
+`extended attribute repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
+series.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 12/14] xfs: document directory tree repairs
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
                     ` (12 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 10/14] xfs: document full filesystem scans for online fsck Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2023-01-14  2:32     ` [PATCH v24.2 " Darrick J. Wong
  2023-02-03  2:12     ` [PATCH v24.3 " Darrick J. Wong
  2023-03-07  1:30   ` [PATCHSET v24.3 00/14] xfs: design documentation for online fsck Darrick J. Wong
  2023-03-07  1:30   ` Darrick J. Wong
  15 siblings, 2 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Directory tree repairs are the least complete part of online fsck, due
to the lack of directory parent pointers.  However, even without that
feature, we can still make some corrections to the directory tree -- we
can salvage as many directory entries as we can from a damaged
directory, and we can reattach orphaned inodes to the lost+found, just
as xfs_repair does now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  236 ++++++++++++++++++++
 1 file changed, 236 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index e32506acb66f..2e20314f1831 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -4269,3 +4269,239 @@ The proposed patchset is the
 `extended attribute repair
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
 series.
+
+Fixing Directories
+------------------
+
+Fixing directories is difficult with currently available filesystem features.
+The offline repair tool scans all inodes to find files with nonzero link count,
+and then it scans all directories to establish parentage of those linked files.
+Damaged files and directories are zapped, and files with no parent are
+moved to the ``/lost+found`` directory.
+It does not try to salvage anything.
+
+The best that online repair can do at this time is to read directory data
+blocks and salvage any dirents that look plausible, correct link counts, and
+move orphans back into the directory tree.
+The salvage process is discussed in the case study at the end of this section.
+The second component to fixing the directory tree online is the :ref:`file link
+count fsck <nlinks>`, since it can scan the entire filesystem to make sure that
+files can neither be deleted while there are still parents nor forgotten after
+all parents sever their links to the child.
+The third part is discussed at the :ref:`end of this section<orphanage>`.
+However, there may be a solution to these deficiencies soon!
+
+Parent Pointers
+```````````````
+
+The lack of secondary directory metadata hinders directory tree reconstruction
+in much the same way that the historic lack of reverse space mapping
+information once hindered reconstruction of filesystem space metadata.
+Specifically, the lack of redundant metadata makes it nearly impossible to
+construct a true replacement for a damaged directory; the best repair can do is
+to salvage the dirents and use the file link count repair function to move
+orphaned files to the lost and found.
+The proposed parent pointer feature, however, will make total directory
+reconstruction possible.
+
+Directory parent pointers were first proposed as an XFS feature more than a
+decade ago by SGI.
+In that implementation, each link from a parent directory to a child file was
+augmented by an extended attribute in the child that could be used to identify
+the parent directory.
+Unfortunately, this early implementation had several major shortcomings:
+
+1. The XFS codebase of the late 2000s did not have the infrastructure to
+   enforce strong referential integrity in the directory tree, which is a fancy
+   way to say that it could not guarantee that a change in a forward link would
+   always be followed up by a corresponding change to the reverse links.
+
+2. Referential integrity was not integrated into either offline repair tool.
+   Checking had to be done online without taking any kernel or inode locks to
+   coordinate access.
+   It is not clear if this actually worked properly.
+
+3. The extended attribute did not record the name of the directory entry in the
+   parent, so the first parent pointer implementation cannot be used to
+   reconnect the directory tree.
+
+4. Extended attribute forks only support 65,536 extents, which means that
+   parent pointer attribute creation is likely to fail at some point before the
+   maximum file link count is achieved.
+
+In the second implementation (currently being developed by Allison Henderson
+and Chandan Babu), the extended attribute code will be enhanced to use log
+intent items to guarantee that an extended attribute update can always be
+completed by log recovery.
+The maximum extent counts of both the data and attribute forks have raised to
+allow for creation of as many parent pointers as possible.
+The parent pointer data will also include the entry name and location within
+the parent.
+In other words, child files will store parent pointer mappings of the form
+``(parent_ino, parent_gen, dirent_pos) → (dirent_name)`` in their extended
+attribute data.
+With that in place, XFS can guarantee strong referential integrity of directory
+tree operations -- forward links will always be complemented with reverse
+links.
+
+When the parent pointer feature lands, the directory checking process can be
+strengthened to ensure that the target of each dirent also contains a parent
+pointer pointing back to the dirent.
+The quality of directory repairs will improve because online fsck will be able
+to reconstruct a directory in its entirety instead of skipping unsalvageable
+areas.
+This process is imagined to involve a :ref:`coordinated inode scan <iscan>` and
+a :ref:`directory entry live update hook <liveupdate>`:
+Scan every file in the entire filesystem, and every time the scan encounters a
+file with a parent pointer to the directory that is being reconstructed, record
+this entry in the temporary directory.
+When the scan is complete, atomically swap the contents of the temporary
+directory and the directory being repaired.
+This code has not yet been constructed, so there is not yet a case study laying
+out exactly how this process works.
+
+Parent pointers themselves can be checked by scanning each pointer and
+verifying that the target of the pointer is a directory and that it contains a
+dirent that corresponds to the information recorded in the parent pointer.
+Reconstruction of the parent pointer information will work similarly to
+directory reconstruction -- scan the filesystem, record the dirents pointing to
+the file being repaired, and rebuild that part of the xattr namespace.
+
+**Question**: How will repair ensure that the ``dirent_pos`` fields match in
+the reconstructed directory?
+
+*Answer*: The field could be designated advisory, since the other three values
+are sufficient to find the entry in the parent.
+However, this makes indexed key lookup impossible while repairs are ongoing.
+A second option would be to allow creating directory entries at specified
+offsets, which solves the referential integrity problem but runs the risk that
+dirent creation will fail due to conflicts with the free space in the
+directory.
+These conflicts could be resolved by appending the directory entry and amending
+the xattr code to support updating an xattr key and reindexing the dabtree,
+though this would have to be performed with the parent directory still locked.
+A fourth option would be to remove the parent pointer entry and re-add it
+atomically.
+
+Case Study: Salvaging Directories
+`````````````````````````````````
+
+Unlike extended attributes, directory blocks are all the same size, so
+salvaging directories is straightforward:
+
+1. Find the parent of the directory.
+   If the dotdot entry is not unreadable, try to confirm that the alleged
+   parent has a child entry pointing back to the directory being repaired.
+   Otherwise, walk the filesystem to find it.
+
+2. Walk the first partition of data fork of the directory to find the directory
+   entry data blocks.
+   When one is found,
+
+   a. Walk the directory data block to find candidate entries.
+      When an entry is found:
+
+      i. Check the name for problems, and ignore the name if there are.
+
+      ii. Retrieve the inumber and grab the inode.
+          If that succeeds, add the name, inode number, and file type to the
+          staging xfarray and xblob.
+
+3. If the memory usage of the xfarray and xfblob exceed a certain amount of
+   memory or there are no more directory data blocks to examine, unlock the
+   directory and add the staged dirents into the temporary directory.
+   Truncate the staging files.
+
+4. Use atomic extent swapping to exchange the new and old directory structures.
+   The old directory blocks are now attached to the temporary file.
+
+5. Reap the temporary file.
+
+**Question**: Should repair invalidate dentries when rebuilding a directory?
+
+**Question**: Can the dentry cache know about a directory entry that cannot be
+salvaged?
+
+In theory, the dentry cache should be a subset of the directory entries on disk
+because there's no way to load a dentry without having something to read in the
+directory.
+However, it is possible for a coherency problem to be introduced if the ondisk
+structures becomes corrupt *after* the cache loads.
+In theory it is necessary to scan all dentry cache entries for a directory to
+ensure that one of the following apply:
+
+1. The cached dentry reflects an ondisk dirent in the new directory.
+
+2. The cached dentry no longer has a corresponding ondisk dirent in the new
+   directory and the dentry can be purged from the cache.
+
+3. The cached dentry no longer has an ondisk dirent but the dentry cannot be
+   purged.
+   This is bad.
+
+Unfortunately, the dentry cache does not have a means to walk all the dentries
+with a particular directory as a parent.
+This makes detecting situations #2 and #3 impossible, and remains an
+interesting question for research.
+
+The proposed patchset is the
+`directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
+series.
+
+.. _orphanage:
+
+The Orphanage
+-------------
+
+Filesystems present files as a directed, and hopefully acyclic, graph.
+In other words, a tree.
+The root of the filesystem is a directory, and each entry in a directory points
+downwards either to more subdirectories or to non-directory files.
+Unfortunately, a disruption in the directory graph pointers result in a
+disconnected graph, which makes files impossible to access via regular path
+resolution.
+The directory parent pointer online scrub code can detect a dotdot entry
+pointing to a parent directory that doesn't have a link back to the child
+directory, and the file link count checker can detect a file that isn't pointed
+to by any directory in the filesystem.
+If the file in question has a positive link count, the file in question is an
+orphan.
+
+When orphans are found, they should be reconnected to the directory tree.
+Offline fsck solves the problem by creating a directory ``/lost+found`` to
+serve as an orphanage, and linking orphan files into the orphanage by using the
+inumber as the name.
+Reparenting a file to the orphanage does not reset any of its permissions or
+ACLs.
+
+This process is more involved in the kernel than it is in userspace.
+The directory and file link count repair setup functions must use the regular
+VFS mechanisms to create the orphanage directory with all the necessary
+security attributes and dentry cache entries, just like a regular directory
+tree modification.
+
+Orphaned files are adopted by the orphanage as follows:
+
+1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function
+   to try to ensure that the lost and found directory actually exists.
+   This also attaches the orphanage directory to the scrub context.
+
+2. If the decision is made to reconnect a file, take the IOLOCK of both the
+   orphanage and the file being reattached.
+   The ``xrep_orphanage_iolock_two`` function follows the inode locking
+   strategy discussed earlier.
+
+3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name``
+   to compute the new name in the orphanage and the block reservation required.
+
+4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair
+   transaction.
+
+5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost
+   and found, and update the kernel dentry cache.
+
+The proposed patches are in the
+`orphanage adoption
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
+series.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 13/14] xfs: document the userspace fsck driver program
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
                     ` (8 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 06/14] xfs: document how online fsck deals with eventual consistency Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2023-03-01  5:36     ` Allison Henderson
  2022-12-30 22:10   ` [PATCH 14/14] xfs: document future directions of online fsck Darrick J. Wong
                     ` (5 subsequent siblings)
  15 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Add the sixth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
driver program xfs_scrub.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  313 ++++++++++++++++++++
 1 file changed, 313 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 2e20314f1831..05b9411fac7f 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -300,6 +300,9 @@ The seven phases are as follows:
 7. Re-check the summary counters and presents the caller with a summary of
    space usage and file counts.
 
+This allocation of responsibilities will be :ref:`revisited <scrubcheck>`
+later in this document.
+
 Steps for Each Scrub Item
 -------------------------
 
@@ -4505,3 +4508,313 @@ The proposed patches are in the
 `orphanage adoption
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
 series.
+
+6. Userspace Algorithms and Data Structures
+===========================================
+
+This section discusses the key algorithms and data structures of the userspace
+program, ``xfs_scrub``, that provide the ability to drive metadata checks and
+repairs in the kernel, verify file data, and look for other potential problems.
+
+.. _scrubcheck:
+
+Checking Metadata
+-----------------
+
+Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier.
+That structure follows naturally from the data dependencies designed into the
+filesystem from its beginnings in 1993.
+In XFS, there are several groups of metadata dependencies:
+
+a. Filesystem summary counts depend on consistency within the inode indices,
+   the allocation group space btrees, and the realtime volume space
+   information.
+
+b. Quota resource counts depend on consistency within the quota file data
+   forks, inode indices, inode records, and the forks of every file on the
+   system.
+
+c. The naming hierarchy depends on consistency within the directory and
+   extended attribute structures.
+   This includes file link counts.
+
+d. Directories, extended attributes, and file data depend on consistency within
+   the file forks that map directory and extended attribute data to physical
+   storage media.
+
+e. The file forks depends on consistency within inode records and the space
+   metadata indices of the allocation groups and the realtime volume.
+   This includes quota and realtime metadata files.
+
+f. Inode records depends on consistency within the inode metadata indices.
+
+g. Realtime space metadata depend on the inode records and data forks of the
+   realtime metadata inodes.
+
+h. The allocation group metadata indices (free space, inodes, reference count,
+   and reverse mapping btrees) depend on consistency within the AG headers and
+   between all the AG metadata btrees.
+
+i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support
+   for online fsck functionality.
+
+Therefore, a metadata dependency graph is a convenient way to schedule checking
+operations in the ``xfs_scrub`` program:
+
+- Phase 1 checks that the provided path maps to an XFS filesystem and detect
+  the kernel's scrubbing abilities, which validates group (i).
+
+- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
+
+- Phase 3 checks groups (f), (e), and (d), in that order.
+  These groups are all file metadata, which means that inodes are scanned in
+  parallel.
+
+- Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
+  may run reliably.
+
+- Phase 5 starts by checking groups (b) and (c) in parallel before moving on
+  to checking names.
+
+- Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
+  to read them, and to report which blocks of which files are affected.
+
+- Phase 7 checks group (a), having validated everything else.
+
+Notice that the data dependencies between groups are enforced by the structure
+of the program flow.
+
+Parallel Inode Scans
+--------------------
+
+An XFS filesystem can easily contain hundreds of millions of inodes.
+Given that XFS targets installations with large high-performance storage,
+it is desirable to scrub inodes in parallel to minimize runtime, particularly
+if the program has been invoked manually from a command line.
+This requires careful scheduling to keep the threads as evenly loaded as
+possible.
+
+Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single
+workqueue and scheduled a single workqueue item per AG.
+Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find
+inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough
+information to construct file handles.
+The file handle was then passed to a function to generate scrub items for each
+metadata object of each inode.
+This simple algorithm leads to thread balancing problems in phase 3 if the
+filesystem contains one AG with a few large sparse files and the rest of the
+AGs contain many smaller files.
+The inode scan dispatch function was not sufficiently granular; it should have
+been dispatching at the level of individual inodes, or, to constrain memory
+consumption, inode btree records.
+
+Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to
+avoid this problem with ease by adding a second workqueue.
+Just like before, the first workqueue is seeded with one workqueue item per AG,
+and it uses INUMBERS to find inode btree chunks.
+The second workqueue, however, is configured with an upper bound on the number
+of items that can be waiting to be run.
+Each inode btree chunk found by the first workqueue's workers are queued to the
+second workqueue, and it is this second workqueue that queries BULKSTAT,
+creates a file handle, and passes it to a function to generate scrub items for
+each metadata object of each inode.
+If the second workqueue is too full, the workqueue add function blocks the
+first workqueue's workers until the backlog eases.
+This doesn't completely solve the balancing problem, but reduces it enough to
+move on to more pressing issues.
+
+The proposed patchsets are the scrub
+`performance tweaks
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
+and the
+`inode scan rebalance
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
+series.
+
+.. _scrubrepair:
+
+Scheduling Repairs
+------------------
+
+During phase 2, corruptions and inconsistencies reported in any AGI header or
+inode btree are repaired immediately, because phase 3 relies on proper
+functioning of the inode indices to find inodes to scan.
+Failed repairs are rescheduled to phase 4.
+Problems reported in any other space metadata are deferred to phase 4.
+Optimization opportunities are always deferred to phase 4, no matter their
+origin.
+
+During phase 3, corruptions and inconsistencies reported in any part of a
+file's metadata are repaired immediately if all space metadata were validated
+during phase 2.
+Repairs that fail or cannot be repaired immediately are scheduled for phase 4.
+
+In the original design of ``xfs_scrub``, it was thought that repairs would be
+so infrequent that the ``struct xfs_scrub_metadata`` objects used to
+communicate with the kernel could also be used as the primary object to
+schedule repairs.
+With recent increases in the number of optimizations possible for a given
+filesystem object, it became much more memory-efficient to track all eligible
+repairs for a given filesystem object with a single repair item.
+Each repair item represents a single lockable object -- AGs, metadata files,
+individual inodes, or a class of summary information.
+
+Phase 4 is responsible for scheduling a lot of repair work in as quick a
+manner as is practical.
+The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which
+means that ``xfs_scrub`` must try to complete the repair work scheduled by
+phase 2 before trying repair work scheduled by phase 3.
+The repair process is as follows:
+
+1. Start a round of repair with a workqueue and enough workers to keep the CPUs
+   as busy as the user desires.
+
+   a. For each repair item queued by phase 2,
+
+      i.   Ask the kernel to repair everything listed in the repair item for a
+           given filesystem object.
+
+      ii.  Make a note if the kernel made any progress in reducing the number
+           of repairs needed for this object.
+
+      iii. If the object no longer requires repairs, revalidate all metadata
+           associated with this object.
+           If the revalidation succeeds, drop the repair item.
+           If not, requeue the item for more repairs.
+
+   b. If any repairs were made, jump back to 1a to retry all the phase 2 items.
+
+   c. For each repair item queued by phase 3,
+
+      i.   Ask the kernel to repair everything listed in the repair item for a
+           given filesystem object.
+
+      ii.  Make a note if the kernel made any progress in reducing the number
+           of repairs needed for this object.
+
+      iii. If the object no longer requires repairs, revalidate all metadata
+           associated with this object.
+           If the revalidation succeeds, drop the repair item.
+           If not, requeue the item for more repairs.
+
+   d. If any repairs were made, jump back to 1c to retry all the phase 3 items.
+
+2. If step 1 made any repair progress of any kind, jump back to step 1 to start
+   another round of repair.
+
+3. If there are items left to repair, run them all serially one more time.
+   Complain if the repairs were not successful, since this is the last chance
+   to repair anything.
+
+Corruptions and inconsistencies encountered during phases 5 and 7 are repaired
+immediately.
+Corrupt file data blocks reported by phase 6 cannot be recovered by the
+filesystem.
+
+The proposed patchsets are the
+`repair warning improvements
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
+refactoring of the
+`repair data dependency
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
+and
+`object tracking
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
+and the
+`repair scheduling
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
+improvement series.
+
+Checking Names for Confusable Unicode Sequences
+-----------------------------------------------
+
+If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of
+phase 4, it moves on to phase 5, which checks for suspicious looking names in
+the filesystem.
+These names consist of the filesystem label, names in directory entries, and
+the names of extended attributes.
+Like most Unix filesystems, XFS imposes the sparest of constraints on the
+contents of a name -- slashes and null bytes are not allowed in directory
+entries; and null bytes are not allowed in extended attributes and the
+filesystem label.
+Directory entries and attribute keys store the length of the name explicitly
+ondisk, which means that nulls are not name terminators.
+For this section, the term "naming domain" refers to any place where names are
+presented together -- all the names in a directory, or all the attributes of a
+file.
+
+Although the Unix naming constraints are very permissive, the reality of most
+modern-day Linux systems is that programs work with Unicode character code
+points to support international languages.
+These programs typically encode those code points in UTF-8 when interfacing
+with the C library because the kernel expects null-terminated names.
+In the common case, therefore, names found in an XFS filesystem are actually
+UTF-8 encoded Unicode data.
+
+To maximize its expressiveness, the Unicode standard defines separate control
+points for various characters that render similarly or identically in writing
+systems around the world.
+For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders
+identically to "Latin Small Letter A" U+0061 "a".
+
+The standard also permits characters to be constructed in multiple ways --
+either by using a defined code point, or by combining one code point with
+various combining marks.
+For example, the character "Angstrom Sign U+212B "Å" can also be expressed
+as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above"
+U+030A "◌̊".
+Both sequences render identically.
+
+Like the standards that preceded it, Unicode also defines various control
+characters to alter the presentation of text.
+For example, the character "Right-to-Left Override" U+202E can trick some
+programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png".
+A second category of rendering problems involves whitespace characters.
+If the character "Zero Width Space" U+200B is encountered in a file name, the
+name will render identically to a name that does not have the zero width
+space.
+
+If two names within a naming domain have different byte sequences but render
+identically, a user may be confused by it.
+The kernel, in its indifference to upper level encoding schemes, permits this.
+Most filesystem drivers persist the byte sequence names that are given to them
+by the VFS.
+
+Techniques for detecting confusable names are explained in great detail in
+sections 4 and 5 of the
+`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_
+document.
+``xfs_scrub``, when it detects UTF-8 encoding in use on a system, uses the
+Unicode normalization form NFD in conjunction with the confusable name
+detection component of
+`libicu <https://github.com/unicode-org/icu>`_
+to identify names with a directory or within a file's extended attributes that
+could be confused for each other.
+Names are also checked for control characters, non-rendering characters, and
+mixing of bidirectional characters.
+All of these potential issues are reported to the system administrator during
+phase 5.
+
+Media Verification of File Data Extents
+---------------------------------------
+
+The system administrator can elect to initiate a media scan of all file data
+blocks.
+This scan after validation of all filesystem metadata (except for the summary
+counters) as phase 6.
+The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map
+to find areas that are allocated to file data fork extents.
+Gaps betweeen data fork extents that are smaller than 64k are treated as if
+they were data fork extents to reduce the command setup overhead.
+When the space map scan accumulates a region larger than 32MB, a media
+verification request is sent to the disk as a directio read of the raw block
+device.
+
+If the verification read fails, ``xfs_scrub`` retries with single-block reads
+to narrow down the failure to the specific region of the media and recorded.
+When it has finished issuing verification requests, it again uses the space
+mapping ioctl to map the recorded media errors back to metadata structures
+and report what has been lost.
+For media errors in blocks owned by files, the lack of parent pointers means
+that the entire filesystem must be walked to report the file paths and offsets
+corresponding to the media error.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 14/14] xfs: document future directions of online fsck
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
                     ` (9 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 13/14] xfs: document the userspace fsck driver program Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2023-03-01  5:37     ` Allison Henderson
  2022-12-30 22:10   ` [PATCH 11/14] xfs: document metadata file repair Darrick J. Wong
                     ` (4 subsequent siblings)
  15 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Add the seventh and final chapter of the online fsck documentation,
where we talk about future functionality that can tie in with the
functionality provided by the online fsck patchset.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  155 ++++++++++++++++++++
 1 file changed, 155 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 05b9411fac7f..41291edb02b9 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -4067,6 +4067,8 @@ The extra flexibility enables several new use cases:
   (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
   of the updates to the original file, or none of them.
 
+.. _swapext_if_unchanged:
+
 - **Transactional file updates**: The same mechanism as above, but the caller
   only wants the commit to occur if the original file's contents have not
   changed.
@@ -4818,3 +4820,156 @@ and report what has been lost.
 For media errors in blocks owned by files, the lack of parent pointers means
 that the entire filesystem must be walked to report the file paths and offsets
 corresponding to the media error.
+
+7. Conclusion and Future Work
+=============================
+
+It is hoped that the reader of this document has followed the designs laid out
+in this document and now has some familiarity with how XFS performs online
+rebuilding of its metadata indices, and how filesystem users can interact with
+that functionality.
+Although the scope of this work is daunting, it is hoped that this guide will
+make it easier for code readers to understand what has been built, for whom it
+has been built, and why.
+Please feel free to contact the XFS mailing list with questions.
+
+FIEXCHANGE_RANGE
+----------------
+
+As discussed earlier, a second frontend to the atomic extent swap mechanism is
+a new ioctl call that userspace programs can use to commit updates to files
+atomically.
+This frontend has been out for review for several years now, though the
+necessary refinements to online repair and lack of customer demand mean that
+the proposal has not been pushed very hard.
+
+Vectorized Scrub
+----------------
+
+As it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned
+earlier was a catalyst for enabling a vectorized scrub system call.
+Since 2018, the cost of making a kernel call has increased considerably on some
+systems to mitigate the effects of speculative execution attacks.
+This incentivizes program authors to make as few system calls as possible to
+reduce the number of times an execution path crosses a security boundary.
+
+With vectorized scrub, userspace pushes to the kernel the identity of a
+filesystem object, a list of scrub types to run against that object, and a
+simple representation of the data dependencies between the selected scrub
+types.
+The kernel executes as much of the caller's plan as it can until it hits a
+dependency that cannot be satisfied due to a corruption, and tells userspace
+how much was accomplished.
+It is hoped that ``io_uring`` will pick up enough of this functionality that
+online fsck can use that instead of adding a separate vectored scrub system
+call to XFS.
+
+The relevant patchsets are the
+`kernel vectorized scrub
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
+and
+`userspace vectorized scrub
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
+series.
+
+Quality of Service Targets for Scrub
+------------------------------------
+
+One serious shortcoming of the online fsck code is that the amount of time that
+it can spend in the kernel holding resource locks is basically unbounded.
+Userspace is allowed to send a fatal signal to the process which will cause
+``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way
+for userspace to provide a time budget to the kernel.
+Given that the scrub codebase has helpers to detect fatal signals, it shouldn't
+be too much work to allow userspace to specify a timeout for a scrub/repair
+operation and abort the operation if it exceeds budget.
+However, most repair functions have the property that once they begin to touch
+ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS
+timeout is no longer useful.
+
+Defragmenting Free Space
+------------------------
+
+Over the years, many XFS users have requested the creation of a program to
+clear a portion of the physical storage underlying a filesystem so that it
+becomes a contiguous chunk of free space.
+Call this free space defragmenter ``clearspace`` for short.
+
+The first piece the ``clearspace`` program needs is the ability to read the
+reverse mapping index from userspace.
+This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
+The second piece it needs is a new fallocate mode
+(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and
+maps it to a file.
+Call this file the "space collector" file.
+The third piece is the ability to force an online repair.
+
+To clear all the metadata out of a portion of physical storage, clearspace
+uses the new fallocate map-freespace call to map any free space in that region
+to the space collector file.
+Next, clearspace finds all metadata blocks in that region by way of
+``GETFSMAP`` and issues forced repair requests on the data structure.
+This often results in the metadata being rebuilt somewhere that is not being
+cleared.
+After each relocation, clearspace calls the "map free space" function again to
+collect any newly freed space in the region being cleared.
+
+To clear all the file data out of a portion of the physical storage, clearspace
+uses the FSMAP information to find relevant file data blocks.
+Having identified a good target, it uses the ``FICLONERANGE`` call on that part
+of the file to try to share the physical space with a dummy file.
+Cloning the extent means that the original owners cannot overwrite the
+contents; any changes will be written somewhere else via copy-on-write.
+Clearspace makes its own copy of the frozen extent in an area that is not being
+cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
+<swapext_if_unchanged>` feature) to change the target file's data extent
+mapping away from the area being cleared.
+When all other mappings have been moved, clearspace reflinks the space into the
+space collector file so that it becomes unavailable.
+
+There are further optimizations that could apply to the above algorithm.
+To clear a piece of physical storage that has a high sharing factor, it is
+strongly desirable to retain this sharing factor.
+In fact, these extents should be moved first to maximize sharing factor after
+the operation completes.
+To make this work smoothly, clearspace needs a new ioctl
+(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace.
+With the refcount information exposed, clearspace can quickly find the longest,
+most shared data extents in the filesystem, and target them first.
+
+**Question**: How might the filesystem move inode chunks?
+
+*Answer*: Dave Chinner has a prototype that creates a new file with the old
+contents and then locklessly runs around the filesystem updating directory
+entries.
+The operation cannot complete if the filesystem goes down.
+That problem isn't totally insurmountable: create an inode remapping table
+hidden behind a jump label, and a log item that tracks the kernel walking the
+filesystem to update directory entries.
+The trouble is, the kernel can't do anything about open files, since it cannot
+revoke them.
+
+**Question**: Can static keys be used to add a revoke bailout return to
+*every* code path coming in from userspace?
+
+*Answer*: In principle, yes.
+This would eliminate the overhead of the check until a revocation happens.
+It's not clear what we do to a revoked file after all the callers are finished
+with it, however.
+
+The relevant patchsets are the
+`kernel freespace defrag
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
+and
+`userspace freespace defrag
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
+series.
+
+Shrinking Filesystems
+---------------------
+
+Removing the end of the filesystem ought to be a simple matter of evacuating
+the data and metadata at the end of the filesystem, and handing the freed space
+to the shrink code.
+That requires an evacuation of the space at end of the filesystem, which is a
+use of free space defragmentation!


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/8] xfs: variable and naming cleanups for intent items
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
@ 2022-12-30 22:10 ` Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 1/8] xfs: pass the xfs_bmbt_irec directly through the log intent code Darrick J. Wong
                     ` (7 more replies)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: make intent items take a perag reference Darrick J. Wong
                   ` (20 subsequent siblings)
  22 siblings, 8 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

Before we start changing how the deferred intent items work, let us
first cut down on the repeated boxing and unboxing of intent item
parameters, and tidy up the variable naming to be consistent between
types.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=intents-naming-cleanups

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=intents-naming-cleanups
---
 fs/xfs/libxfs/xfs_alloc.c    |   32 +++++----
 fs/xfs/libxfs/xfs_bmap.c     |   32 ++++-----
 fs/xfs/libxfs/xfs_bmap.h     |    5 -
 fs/xfs/libxfs/xfs_refcount.c |   96 +++++++++++++---------------
 fs/xfs/libxfs/xfs_refcount.h |    4 -
 fs/xfs/libxfs/xfs_rmap.c     |   52 +++++++--------
 fs/xfs/libxfs/xfs_rmap.h     |    6 +-
 fs/xfs/xfs_bmap_item.c       |  137 +++++++++++++++++------------------------
 fs/xfs/xfs_extfree_item.c    |   99 +++++++++++++++--------------
 fs/xfs/xfs_refcount_item.c   |  110 +++++++++++++++------------------
 fs/xfs/xfs_rmap_item.c       |  142 ++++++++++++++++++++----------------------
 fs/xfs/xfs_trace.h           |   15 +---
 12 files changed, 335 insertions(+), 395 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/8] xfs: pass the xfs_bmbt_irec directly through the log intent code
  2022-12-30 22:10 ` [PATCHSET v24.0 0/8] xfs: variable and naming cleanups for intent items Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 2/8] xfs: fix confusing variable names in xfs_bmap_item.c Darrick J. Wong
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Instead of repeatedly boxing and unboxing the incore extent mapping
structure as it passes through the BUI code, pass the pointer directly
through.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c |   32 +++++++++---------
 fs/xfs/libxfs/xfs_bmap.h |    5 +--
 fs/xfs/xfs_bmap_item.c   |   81 +++++++++++++++++-----------------------------
 3 files changed, 46 insertions(+), 72 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 0d56a8d862e8..c8c65387136c 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6146,39 +6146,37 @@ xfs_bmap_unmap_extent(
 int
 xfs_bmap_finish_one(
 	struct xfs_trans		*tp,
-	struct xfs_inode		*ip,
-	enum xfs_bmap_intent_type	type,
-	int				whichfork,
-	xfs_fileoff_t			startoff,
-	xfs_fsblock_t			startblock,
-	xfs_filblks_t			*blockcount,
-	xfs_exntst_t			state)
+	struct xfs_bmap_intent		*bi)
 {
+	struct xfs_bmbt_irec		*bmap = &bi->bi_bmap;
 	int				error = 0;
 
 	ASSERT(tp->t_firstblock == NULLFSBLOCK);
 
 	trace_xfs_bmap_deferred(tp->t_mountp,
-			XFS_FSB_TO_AGNO(tp->t_mountp, startblock), type,
-			XFS_FSB_TO_AGBNO(tp->t_mountp, startblock),
-			ip->i_ino, whichfork, startoff, *blockcount, state);
+			XFS_FSB_TO_AGNO(tp->t_mountp, bmap->br_startblock),
+			bi->bi_type,
+			XFS_FSB_TO_AGBNO(tp->t_mountp, bmap->br_startblock),
+			bi->bi_owner->i_ino, bi->bi_whichfork,
+			bmap->br_startoff, bmap->br_blockcount,
+			bmap->br_state);
 
-	if (WARN_ON_ONCE(whichfork != XFS_DATA_FORK))
+	if (WARN_ON_ONCE(bi->bi_whichfork != XFS_DATA_FORK))
 		return -EFSCORRUPTED;
 
 	if (XFS_TEST_ERROR(false, tp->t_mountp,
 			XFS_ERRTAG_BMAP_FINISH_ONE))
 		return -EIO;
 
-	switch (type) {
+	switch (bi->bi_type) {
 	case XFS_BMAP_MAP:
-		error = xfs_bmapi_remap(tp, ip, startoff, *blockcount,
-				startblock, 0);
-		*blockcount = 0;
+		error = xfs_bmapi_remap(tp, bi->bi_owner, bmap->br_startoff,
+				bmap->br_blockcount, bmap->br_startblock, 0);
+		bmap->br_blockcount = 0;
 		break;
 	case XFS_BMAP_UNMAP:
-		error = __xfs_bunmapi(tp, ip, startoff, blockcount,
-				XFS_BMAPI_REMAP, 1);
+		error = __xfs_bunmapi(tp, bi->bi_owner, bmap->br_startoff,
+				&bmap->br_blockcount, XFS_BMAPI_REMAP, 1);
 		break;
 	default:
 		ASSERT(0);
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 16db95b11589..01c2df35c3e3 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -234,10 +234,7 @@ struct xfs_bmap_intent {
 	struct xfs_bmbt_irec			bi_bmap;
 };
 
-int	xfs_bmap_finish_one(struct xfs_trans *tp, struct xfs_inode *ip,
-		enum xfs_bmap_intent_type type, int whichfork,
-		xfs_fileoff_t startoff, xfs_fsblock_t startblock,
-		xfs_filblks_t *blockcount, xfs_exntst_t state);
+int	xfs_bmap_finish_one(struct xfs_trans *tp, struct xfs_bmap_intent *bi);
 void	xfs_bmap_map_extent(struct xfs_trans *tp, struct xfs_inode *ip,
 		struct xfs_bmbt_irec *imap);
 void	xfs_bmap_unmap_extent(struct xfs_trans *tp, struct xfs_inode *ip,
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 41323da523d1..13aa5359c02f 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -246,18 +246,11 @@ static int
 xfs_trans_log_finish_bmap_update(
 	struct xfs_trans		*tp,
 	struct xfs_bud_log_item		*budp,
-	enum xfs_bmap_intent_type	type,
-	struct xfs_inode		*ip,
-	int				whichfork,
-	xfs_fileoff_t			startoff,
-	xfs_fsblock_t			startblock,
-	xfs_filblks_t			*blockcount,
-	xfs_exntst_t			state)
+	struct xfs_bmap_intent		*bi)
 {
 	int				error;
 
-	error = xfs_bmap_finish_one(tp, ip, type, whichfork, startoff,
-			startblock, blockcount, state);
+	error = xfs_bmap_finish_one(tp, bi);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
@@ -378,25 +371,17 @@ xfs_bmap_update_finish_item(
 	struct list_head		*item,
 	struct xfs_btree_cur		**state)
 {
-	struct xfs_bmap_intent		*bmap;
-	xfs_filblks_t			count;
+	struct xfs_bmap_intent		*bi;
 	int				error;
 
-	bmap = container_of(item, struct xfs_bmap_intent, bi_list);
-	count = bmap->bi_bmap.br_blockcount;
-	error = xfs_trans_log_finish_bmap_update(tp, BUD_ITEM(done),
-			bmap->bi_type,
-			bmap->bi_owner, bmap->bi_whichfork,
-			bmap->bi_bmap.br_startoff,
-			bmap->bi_bmap.br_startblock,
-			&count,
-			bmap->bi_bmap.br_state);
-	if (!error && count > 0) {
-		ASSERT(bmap->bi_type == XFS_BMAP_UNMAP);
-		bmap->bi_bmap.br_blockcount = count;
+	bi = container_of(item, struct xfs_bmap_intent, bi_list);
+
+	error = xfs_trans_log_finish_bmap_update(tp, BUD_ITEM(done), bi);
+	if (!error && bi->bi_bmap.br_blockcount > 0) {
+		ASSERT(bi->bi_type == XFS_BMAP_UNMAP);
 		return -EAGAIN;
 	}
-	kmem_cache_free(xfs_bmap_intent_cache, bmap);
+	kmem_cache_free(xfs_bmap_intent_cache, bi);
 	return error;
 }
 
@@ -471,17 +456,13 @@ xfs_bui_item_recover(
 	struct xfs_log_item		*lip,
 	struct list_head		*capture_list)
 {
-	struct xfs_bmbt_irec		irec;
+	struct xfs_bmap_intent		fake = { };
 	struct xfs_bui_log_item		*buip = BUI_ITEM(lip);
 	struct xfs_trans		*tp;
 	struct xfs_inode		*ip = NULL;
 	struct xfs_mount		*mp = lip->li_log->l_mp;
-	struct xfs_map_extent		*bmap;
+	struct xfs_map_extent		*map;
 	struct xfs_bud_log_item		*budp;
-	xfs_filblks_t			count;
-	xfs_exntst_t			state;
-	unsigned int			bui_type;
-	int				whichfork;
 	int				iext_delta;
 	int				error = 0;
 
@@ -491,14 +472,12 @@ xfs_bui_item_recover(
 		return -EFSCORRUPTED;
 	}
 
-	bmap = &buip->bui_format.bui_extents[0];
-	state = (bmap->me_flags & XFS_BMAP_EXTENT_UNWRITTEN) ?
-			XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
-	whichfork = (bmap->me_flags & XFS_BMAP_EXTENT_ATTR_FORK) ?
+	map = &buip->bui_format.bui_extents[0];
+	fake.bi_whichfork = (map->me_flags & XFS_BMAP_EXTENT_ATTR_FORK) ?
 			XFS_ATTR_FORK : XFS_DATA_FORK;
-	bui_type = bmap->me_flags & XFS_BMAP_EXTENT_TYPE_MASK;
+	fake.bi_type = map->me_flags & XFS_BMAP_EXTENT_TYPE_MASK;
 
-	error = xlog_recover_iget(mp, bmap->me_owner, &ip);
+	error = xlog_recover_iget(mp, map->me_owner, &ip);
 	if (error)
 		return error;
 
@@ -512,34 +491,34 @@ xfs_bui_item_recover(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip, 0);
 
-	if (bui_type == XFS_BMAP_MAP)
+	if (fake.bi_type == XFS_BMAP_MAP)
 		iext_delta = XFS_IEXT_ADD_NOSPLIT_CNT;
 	else
 		iext_delta = XFS_IEXT_PUNCH_HOLE_CNT;
 
-	error = xfs_iext_count_may_overflow(ip, whichfork, iext_delta);
+	error = xfs_iext_count_may_overflow(ip, fake.bi_whichfork, iext_delta);
 	if (error == -EFBIG)
 		error = xfs_iext_count_upgrade(tp, ip, iext_delta);
 	if (error)
 		goto err_cancel;
 
-	count = bmap->me_len;
-	error = xfs_trans_log_finish_bmap_update(tp, budp, bui_type, ip,
-			whichfork, bmap->me_startoff, bmap->me_startblock,
-			&count, state);
+	fake.bi_owner = ip;
+	fake.bi_bmap.br_startblock = map->me_startblock;
+	fake.bi_bmap.br_startoff = map->me_startoff;
+	fake.bi_bmap.br_blockcount = map->me_len;
+	fake.bi_bmap.br_state = (map->me_flags & XFS_BMAP_EXTENT_UNWRITTEN) ?
+			XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
+
+	error = xfs_trans_log_finish_bmap_update(tp, budp, &fake);
 	if (error == -EFSCORRUPTED)
-		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bmap,
-				sizeof(*bmap));
+		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, map,
+				sizeof(*map));
 	if (error)
 		goto err_cancel;
 
-	if (count > 0) {
-		ASSERT(bui_type == XFS_BMAP_UNMAP);
-		irec.br_startblock = bmap->me_startblock;
-		irec.br_blockcount = count;
-		irec.br_startoff = bmap->me_startoff;
-		irec.br_state = state;
-		xfs_bmap_unmap_extent(tp, ip, &irec);
+	if (fake.bi_bmap.br_blockcount > 0) {
+		ASSERT(fake.bi_type == XFS_BMAP_UNMAP);
+		xfs_bmap_unmap_extent(tp, ip, &fake.bi_bmap);
 	}
 
 	/*


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/8] xfs: fix confusing variable names in xfs_bmap_item.c
  2022-12-30 22:10 ` [PATCHSET v24.0 0/8] xfs: variable and naming cleanups for intent items Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 1/8] xfs: pass the xfs_bmbt_irec directly through the log intent code Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 8/8] xfs: fix confusing variable names in xfs_refcount_item.c Darrick J. Wong
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Variable names in this code module are inconsistent and confusing.
xfs_map_extent describe file mappings, so rename them "map".
xfs_bmap_intents describe block mapping intents, so rename them "bi".

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_item.c |   56 ++++++++++++++++++++++++------------------------
 1 file changed, 28 insertions(+), 28 deletions(-)


diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 13aa5359c02f..6e2f0013380a 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -283,24 +283,24 @@ xfs_bmap_update_diff_items(
 /* Set the map extent flags for this mapping. */
 static void
 xfs_trans_set_bmap_flags(
-	struct xfs_map_extent		*bmap,
+	struct xfs_map_extent		*map,
 	enum xfs_bmap_intent_type	type,
 	int				whichfork,
 	xfs_exntst_t			state)
 {
-	bmap->me_flags = 0;
+	map->me_flags = 0;
 	switch (type) {
 	case XFS_BMAP_MAP:
 	case XFS_BMAP_UNMAP:
-		bmap->me_flags = type;
+		map->me_flags = type;
 		break;
 	default:
 		ASSERT(0);
 	}
 	if (state == XFS_EXT_UNWRITTEN)
-		bmap->me_flags |= XFS_BMAP_EXTENT_UNWRITTEN;
+		map->me_flags |= XFS_BMAP_EXTENT_UNWRITTEN;
 	if (whichfork == XFS_ATTR_FORK)
-		bmap->me_flags |= XFS_BMAP_EXTENT_ATTR_FORK;
+		map->me_flags |= XFS_BMAP_EXTENT_ATTR_FORK;
 }
 
 /* Log bmap updates in the intent item. */
@@ -308,7 +308,7 @@ STATIC void
 xfs_bmap_update_log_item(
 	struct xfs_trans		*tp,
 	struct xfs_bui_log_item		*buip,
-	struct xfs_bmap_intent		*bmap)
+	struct xfs_bmap_intent		*bi)
 {
 	uint				next_extent;
 	struct xfs_map_extent		*map;
@@ -324,12 +324,12 @@ xfs_bmap_update_log_item(
 	next_extent = atomic_inc_return(&buip->bui_next_extent) - 1;
 	ASSERT(next_extent < buip->bui_format.bui_nextents);
 	map = &buip->bui_format.bui_extents[next_extent];
-	map->me_owner = bmap->bi_owner->i_ino;
-	map->me_startblock = bmap->bi_bmap.br_startblock;
-	map->me_startoff = bmap->bi_bmap.br_startoff;
-	map->me_len = bmap->bi_bmap.br_blockcount;
-	xfs_trans_set_bmap_flags(map, bmap->bi_type, bmap->bi_whichfork,
-			bmap->bi_bmap.br_state);
+	map->me_owner = bi->bi_owner->i_ino;
+	map->me_startblock = bi->bi_bmap.br_startblock;
+	map->me_startoff = bi->bi_bmap.br_startoff;
+	map->me_len = bi->bi_bmap.br_blockcount;
+	xfs_trans_set_bmap_flags(map, bi->bi_type, bi->bi_whichfork,
+			bi->bi_bmap.br_state);
 }
 
 static struct xfs_log_item *
@@ -341,15 +341,15 @@ xfs_bmap_update_create_intent(
 {
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_bui_log_item		*buip = xfs_bui_init(mp);
-	struct xfs_bmap_intent		*bmap;
+	struct xfs_bmap_intent		*bi;
 
 	ASSERT(count == XFS_BUI_MAX_FAST_EXTENTS);
 
 	xfs_trans_add_item(tp, &buip->bui_item);
 	if (sort)
 		list_sort(mp, items, xfs_bmap_update_diff_items);
-	list_for_each_entry(bmap, items, bi_list)
-		xfs_bmap_update_log_item(tp, buip, bmap);
+	list_for_each_entry(bi, items, bi_list)
+		xfs_bmap_update_log_item(tp, buip, bi);
 	return &buip->bui_item;
 }
 
@@ -398,10 +398,10 @@ STATIC void
 xfs_bmap_update_cancel_item(
 	struct list_head		*item)
 {
-	struct xfs_bmap_intent		*bmap;
+	struct xfs_bmap_intent		*bi;
 
-	bmap = container_of(item, struct xfs_bmap_intent, bi_list);
-	kmem_cache_free(xfs_bmap_intent_cache, bmap);
+	bi = container_of(item, struct xfs_bmap_intent, bi_list);
+	kmem_cache_free(xfs_bmap_intent_cache, bi);
 }
 
 const struct xfs_defer_op_type xfs_bmap_update_defer_type = {
@@ -419,18 +419,18 @@ xfs_bui_validate(
 	struct xfs_mount		*mp,
 	struct xfs_bui_log_item		*buip)
 {
-	struct xfs_map_extent		*bmap;
+	struct xfs_map_extent		*map;
 
 	/* Only one mapping operation per BUI... */
 	if (buip->bui_format.bui_nextents != XFS_BUI_MAX_FAST_EXTENTS)
 		return false;
 
-	bmap = &buip->bui_format.bui_extents[0];
+	map = &buip->bui_format.bui_extents[0];
 
-	if (bmap->me_flags & ~XFS_BMAP_EXTENT_FLAGS)
+	if (map->me_flags & ~XFS_BMAP_EXTENT_FLAGS)
 		return false;
 
-	switch (bmap->me_flags & XFS_BMAP_EXTENT_TYPE_MASK) {
+	switch (map->me_flags & XFS_BMAP_EXTENT_TYPE_MASK) {
 	case XFS_BMAP_MAP:
 	case XFS_BMAP_UNMAP:
 		break;
@@ -438,13 +438,13 @@ xfs_bui_validate(
 		return false;
 	}
 
-	if (!xfs_verify_ino(mp, bmap->me_owner))
+	if (!xfs_verify_ino(mp, map->me_owner))
 		return false;
 
-	if (!xfs_verify_fileext(mp, bmap->me_startoff, bmap->me_len))
+	if (!xfs_verify_fileext(mp, map->me_startoff, map->me_len))
 		return false;
 
-	return xfs_verify_fsbext(mp, bmap->me_startblock, bmap->me_len);
+	return xfs_verify_fsbext(mp, map->me_startblock, map->me_len);
 }
 
 /*
@@ -558,18 +558,18 @@ xfs_bui_item_relog(
 {
 	struct xfs_bud_log_item		*budp;
 	struct xfs_bui_log_item		*buip;
-	struct xfs_map_extent		*extp;
+	struct xfs_map_extent		*map;
 	unsigned int			count;
 
 	count = BUI_ITEM(intent)->bui_format.bui_nextents;
-	extp = BUI_ITEM(intent)->bui_format.bui_extents;
+	map = BUI_ITEM(intent)->bui_format.bui_extents;
 
 	tp->t_flags |= XFS_TRANS_DIRTY;
 	budp = xfs_trans_get_bud(tp, BUI_ITEM(intent));
 	set_bit(XFS_LI_DIRTY, &budp->bud_item.li_flags);
 
 	buip = xfs_bui_init(tp->t_mountp);
-	memcpy(buip->bui_format.bui_extents, extp, count * sizeof(*extp));
+	memcpy(buip->bui_format.bui_extents, map, count * sizeof(*map));
 	atomic_set(&buip->bui_next_extent, count);
 	xfs_trans_add_item(tp, &buip->bui_item);
 	set_bit(XFS_LI_DIRTY, &buip->bui_item.li_flags);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 3/8] xfs: pass xfs_extent_free_item directly through the log intent code
  2022-12-30 22:10 ` [PATCHSET v24.0 0/8] xfs: variable and naming cleanups for intent items Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 8/8] xfs: fix confusing variable names in xfs_refcount_item.c Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 5/8] xfs: pass rmap space mapping " Darrick J. Wong
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Pass the incore xfs_extent_free_item through the EFI logging code
instead of repeatedly boxing and unboxing parameters.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_extfree_item.c |   55 +++++++++++++++++++++++++--------------------
 1 file changed, 30 insertions(+), 25 deletions(-)


diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index d5130d1fcfae..618d2f9ff535 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -345,23 +345,30 @@ static int
 xfs_trans_free_extent(
 	struct xfs_trans		*tp,
 	struct xfs_efd_log_item		*efdp,
-	xfs_fsblock_t			start_block,
-	xfs_extlen_t			ext_len,
-	const struct xfs_owner_info	*oinfo,
-	bool				skip_discard)
+	struct xfs_extent_free_item	*free)
 {
+	struct xfs_owner_info		oinfo = { };
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_extent		*extp;
 	uint				next_extent;
-	xfs_agnumber_t			agno = XFS_FSB_TO_AGNO(mp, start_block);
+	xfs_agnumber_t			agno = XFS_FSB_TO_AGNO(mp,
+							free->xefi_startblock);
 	xfs_agblock_t			agbno = XFS_FSB_TO_AGBNO(mp,
-								start_block);
+							free->xefi_startblock);
 	int				error;
 
-	trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno, ext_len);
+	oinfo.oi_owner = free->xefi_owner;
+	if (free->xefi_flags & XFS_EFI_ATTR_FORK)
+		oinfo.oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
+	if (free->xefi_flags & XFS_EFI_BMBT_BLOCK)
+		oinfo.oi_flags |= XFS_OWNER_INFO_BMBT_BLOCK;
 
-	error = __xfs_free_extent(tp, start_block, ext_len,
-				  oinfo, XFS_AG_RESV_NONE, skip_discard);
+	trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno,
+			free->xefi_blockcount);
+
+	error = __xfs_free_extent(tp, free->xefi_startblock,
+			free->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
+			free->xefi_flags & XFS_EFI_SKIP_DISCARD);
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
 	 * transaction is aborted, which:
@@ -375,8 +382,8 @@ xfs_trans_free_extent(
 	next_extent = efdp->efd_next_extent;
 	ASSERT(next_extent < efdp->efd_format.efd_nextents);
 	extp = &(efdp->efd_format.efd_extents[next_extent]);
-	extp->ext_start = start_block;
-	extp->ext_len = ext_len;
+	extp->ext_start = free->xefi_startblock;
+	extp->ext_len = free->xefi_blockcount;
 	efdp->efd_next_extent++;
 
 	return error;
@@ -463,20 +470,12 @@ xfs_extent_free_finish_item(
 	struct list_head		*item,
 	struct xfs_btree_cur		**state)
 {
-	struct xfs_owner_info		oinfo = { };
 	struct xfs_extent_free_item	*free;
 	int				error;
 
 	free = container_of(item, struct xfs_extent_free_item, xefi_list);
-	oinfo.oi_owner = free->xefi_owner;
-	if (free->xefi_flags & XFS_EFI_ATTR_FORK)
-		oinfo.oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
-	if (free->xefi_flags & XFS_EFI_BMBT_BLOCK)
-		oinfo.oi_flags |= XFS_OWNER_INFO_BMBT_BLOCK;
-	error = xfs_trans_free_extent(tp, EFD_ITEM(done),
-			free->xefi_startblock,
-			free->xefi_blockcount,
-			&oinfo, free->xefi_flags & XFS_EFI_SKIP_DISCARD);
+
+	error = xfs_trans_free_extent(tp, EFD_ITEM(done), free);
 	kmem_cache_free(xfs_extfree_item_cache, free);
 	return error;
 }
@@ -599,7 +598,6 @@ xfs_efi_item_recover(
 	struct xfs_mount		*mp = lip->li_log->l_mp;
 	struct xfs_efd_log_item		*efdp;
 	struct xfs_trans		*tp;
-	struct xfs_extent		*extp;
 	int				i;
 	int				error = 0;
 
@@ -624,10 +622,17 @@ xfs_efi_item_recover(
 	efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
 
 	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
+		struct xfs_extent_free_item	fake = {
+			.xefi_owner		= XFS_RMAP_OWN_UNKNOWN,
+		};
+		struct xfs_extent		*extp;
+
 		extp = &efip->efi_format.efi_extents[i];
-		error = xfs_trans_free_extent(tp, efdp, extp->ext_start,
-					      extp->ext_len,
-					      &XFS_RMAP_OINFO_ANY_OWNER, false);
+
+		fake.xefi_startblock = extp->ext_start;
+		fake.xefi_blockcount = extp->ext_len;
+
+		error = xfs_trans_free_extent(tp, efdp, &fake);
 		if (error == -EFSCORRUPTED)
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 					extp, sizeof(*extp));


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 4/8] xfs: fix confusing xfs_extent_item variable names
  2022-12-30 22:10 ` [PATCHSET v24.0 0/8] xfs: variable and naming cleanups for intent items Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 5/8] xfs: pass rmap space mapping " Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 6/8] xfs: fix confusing variable names in xfs_rmap_item.c Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 7/8] xfs: pass refcount intent directly through the log intent code Darrick J. Wong
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Change the name of all pointers to xfs_extent_item structures to "xefi"
to make the name consistent and because the current selections ("new"
and "free") mean other things in C.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c |   32 ++++++++++-----------
 fs/xfs/xfs_extfree_item.c |   70 +++++++++++++++++++++++----------------------
 2 files changed, 51 insertions(+), 51 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 989cf341779b..f8ff81c3de76 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2472,20 +2472,20 @@ xfs_defer_agfl_block(
 	struct xfs_owner_info		*oinfo)
 {
 	struct xfs_mount		*mp = tp->t_mountp;
-	struct xfs_extent_free_item	*new;		/* new element */
+	struct xfs_extent_free_item	*xefi;
 
 	ASSERT(xfs_extfree_item_cache != NULL);
 	ASSERT(oinfo != NULL);
 
-	new = kmem_cache_zalloc(xfs_extfree_item_cache,
+	xefi = kmem_cache_zalloc(xfs_extfree_item_cache,
 			       GFP_KERNEL | __GFP_NOFAIL);
-	new->xefi_startblock = XFS_AGB_TO_FSB(mp, agno, agbno);
-	new->xefi_blockcount = 1;
-	new->xefi_owner = oinfo->oi_owner;
+	xefi->xefi_startblock = XFS_AGB_TO_FSB(mp, agno, agbno);
+	xefi->xefi_blockcount = 1;
+	xefi->xefi_owner = oinfo->oi_owner;
 
 	trace_xfs_agfl_free_defer(mp, agno, 0, agbno, 1);
 
-	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_AGFL_FREE, &new->xefi_list);
+	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_AGFL_FREE, &xefi->xefi_list);
 }
 
 /*
@@ -2500,7 +2500,7 @@ __xfs_free_extent_later(
 	const struct xfs_owner_info	*oinfo,
 	bool				skip_discard)
 {
-	struct xfs_extent_free_item	*new;		/* new element */
+	struct xfs_extent_free_item	*xefi;
 #ifdef DEBUG
 	struct xfs_mount		*mp = tp->t_mountp;
 	xfs_agnumber_t			agno;
@@ -2519,27 +2519,27 @@ __xfs_free_extent_later(
 #endif
 	ASSERT(xfs_extfree_item_cache != NULL);
 
-	new = kmem_cache_zalloc(xfs_extfree_item_cache,
+	xefi = kmem_cache_zalloc(xfs_extfree_item_cache,
 			       GFP_KERNEL | __GFP_NOFAIL);
-	new->xefi_startblock = bno;
-	new->xefi_blockcount = (xfs_extlen_t)len;
+	xefi->xefi_startblock = bno;
+	xefi->xefi_blockcount = (xfs_extlen_t)len;
 	if (skip_discard)
-		new->xefi_flags |= XFS_EFI_SKIP_DISCARD;
+		xefi->xefi_flags |= XFS_EFI_SKIP_DISCARD;
 	if (oinfo) {
 		ASSERT(oinfo->oi_offset == 0);
 
 		if (oinfo->oi_flags & XFS_OWNER_INFO_ATTR_FORK)
-			new->xefi_flags |= XFS_EFI_ATTR_FORK;
+			xefi->xefi_flags |= XFS_EFI_ATTR_FORK;
 		if (oinfo->oi_flags & XFS_OWNER_INFO_BMBT_BLOCK)
-			new->xefi_flags |= XFS_EFI_BMBT_BLOCK;
-		new->xefi_owner = oinfo->oi_owner;
+			xefi->xefi_flags |= XFS_EFI_BMBT_BLOCK;
+		xefi->xefi_owner = oinfo->oi_owner;
 	} else {
-		new->xefi_owner = XFS_RMAP_OWN_NULL;
+		xefi->xefi_owner = XFS_RMAP_OWN_NULL;
 	}
 	trace_xfs_bmap_free_defer(tp->t_mountp,
 			XFS_FSB_TO_AGNO(tp->t_mountp, bno), 0,
 			XFS_FSB_TO_AGBNO(tp->t_mountp, bno), len);
-	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_FREE, &new->xefi_list);
+	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_FREE, &xefi->xefi_list);
 }
 
 #ifdef DEBUG
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 618d2f9ff535..011b50469301 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -345,30 +345,30 @@ static int
 xfs_trans_free_extent(
 	struct xfs_trans		*tp,
 	struct xfs_efd_log_item		*efdp,
-	struct xfs_extent_free_item	*free)
+	struct xfs_extent_free_item	*xefi)
 {
 	struct xfs_owner_info		oinfo = { };
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_extent		*extp;
 	uint				next_extent;
 	xfs_agnumber_t			agno = XFS_FSB_TO_AGNO(mp,
-							free->xefi_startblock);
+							xefi->xefi_startblock);
 	xfs_agblock_t			agbno = XFS_FSB_TO_AGBNO(mp,
-							free->xefi_startblock);
+							xefi->xefi_startblock);
 	int				error;
 
-	oinfo.oi_owner = free->xefi_owner;
-	if (free->xefi_flags & XFS_EFI_ATTR_FORK)
+	oinfo.oi_owner = xefi->xefi_owner;
+	if (xefi->xefi_flags & XFS_EFI_ATTR_FORK)
 		oinfo.oi_flags |= XFS_OWNER_INFO_ATTR_FORK;
-	if (free->xefi_flags & XFS_EFI_BMBT_BLOCK)
+	if (xefi->xefi_flags & XFS_EFI_BMBT_BLOCK)
 		oinfo.oi_flags |= XFS_OWNER_INFO_BMBT_BLOCK;
 
 	trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno,
-			free->xefi_blockcount);
+			xefi->xefi_blockcount);
 
-	error = __xfs_free_extent(tp, free->xefi_startblock,
-			free->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
-			free->xefi_flags & XFS_EFI_SKIP_DISCARD);
+	error = __xfs_free_extent(tp, xefi->xefi_startblock,
+			xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
+			xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
 	 * transaction is aborted, which:
@@ -382,8 +382,8 @@ xfs_trans_free_extent(
 	next_extent = efdp->efd_next_extent;
 	ASSERT(next_extent < efdp->efd_format.efd_nextents);
 	extp = &(efdp->efd_format.efd_extents[next_extent]);
-	extp->ext_start = free->xefi_startblock;
-	extp->ext_len = free->xefi_blockcount;
+	extp->ext_start = xefi->xefi_startblock;
+	extp->ext_len = xefi->xefi_blockcount;
 	efdp->efd_next_extent++;
 
 	return error;
@@ -411,7 +411,7 @@ STATIC void
 xfs_extent_free_log_item(
 	struct xfs_trans		*tp,
 	struct xfs_efi_log_item		*efip,
-	struct xfs_extent_free_item	*free)
+	struct xfs_extent_free_item	*xefi)
 {
 	uint				next_extent;
 	struct xfs_extent		*extp;
@@ -427,8 +427,8 @@ xfs_extent_free_log_item(
 	next_extent = atomic_inc_return(&efip->efi_next_extent) - 1;
 	ASSERT(next_extent < efip->efi_format.efi_nextents);
 	extp = &efip->efi_format.efi_extents[next_extent];
-	extp->ext_start = free->xefi_startblock;
-	extp->ext_len = free->xefi_blockcount;
+	extp->ext_start = xefi->xefi_startblock;
+	extp->ext_len = xefi->xefi_blockcount;
 }
 
 static struct xfs_log_item *
@@ -440,15 +440,15 @@ xfs_extent_free_create_intent(
 {
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_efi_log_item		*efip = xfs_efi_init(mp, count);
-	struct xfs_extent_free_item	*free;
+	struct xfs_extent_free_item	*xefi;
 
 	ASSERT(count > 0);
 
 	xfs_trans_add_item(tp, &efip->efi_item);
 	if (sort)
 		list_sort(mp, items, xfs_extent_free_diff_items);
-	list_for_each_entry(free, items, xefi_list)
-		xfs_extent_free_log_item(tp, efip, free);
+	list_for_each_entry(xefi, items, xefi_list)
+		xfs_extent_free_log_item(tp, efip, xefi);
 	return &efip->efi_item;
 }
 
@@ -470,13 +470,13 @@ xfs_extent_free_finish_item(
 	struct list_head		*item,
 	struct xfs_btree_cur		**state)
 {
-	struct xfs_extent_free_item	*free;
+	struct xfs_extent_free_item	*xefi;
 	int				error;
 
-	free = container_of(item, struct xfs_extent_free_item, xefi_list);
+	xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
 
-	error = xfs_trans_free_extent(tp, EFD_ITEM(done), free);
-	kmem_cache_free(xfs_extfree_item_cache, free);
+	error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
+	kmem_cache_free(xfs_extfree_item_cache, xefi);
 	return error;
 }
 
@@ -493,10 +493,10 @@ STATIC void
 xfs_extent_free_cancel_item(
 	struct list_head		*item)
 {
-	struct xfs_extent_free_item	*free;
+	struct xfs_extent_free_item	*xefi;
 
-	free = container_of(item, struct xfs_extent_free_item, xefi_list);
-	kmem_cache_free(xfs_extfree_item_cache, free);
+	xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
+	kmem_cache_free(xfs_extfree_item_cache, xefi);
 }
 
 const struct xfs_defer_op_type xfs_extent_free_defer_type = {
@@ -522,7 +522,7 @@ xfs_agfl_free_finish_item(
 	struct xfs_owner_info		oinfo = { };
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_efd_log_item		*efdp = EFD_ITEM(done);
-	struct xfs_extent_free_item	*free;
+	struct xfs_extent_free_item	*xefi;
 	struct xfs_extent		*extp;
 	struct xfs_buf			*agbp;
 	int				error;
@@ -531,13 +531,13 @@ xfs_agfl_free_finish_item(
 	uint				next_extent;
 	struct xfs_perag		*pag;
 
-	free = container_of(item, struct xfs_extent_free_item, xefi_list);
-	ASSERT(free->xefi_blockcount == 1);
-	agno = XFS_FSB_TO_AGNO(mp, free->xefi_startblock);
-	agbno = XFS_FSB_TO_AGBNO(mp, free->xefi_startblock);
-	oinfo.oi_owner = free->xefi_owner;
+	xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
+	ASSERT(xefi->xefi_blockcount == 1);
+	agno = XFS_FSB_TO_AGNO(mp, xefi->xefi_startblock);
+	agbno = XFS_FSB_TO_AGBNO(mp, xefi->xefi_startblock);
+	oinfo.oi_owner = xefi->xefi_owner;
 
-	trace_xfs_agfl_free_deferred(mp, agno, 0, agbno, free->xefi_blockcount);
+	trace_xfs_agfl_free_deferred(mp, agno, 0, agbno, xefi->xefi_blockcount);
 
 	pag = xfs_perag_get(mp, agno);
 	error = xfs_alloc_read_agf(pag, tp, 0, &agbp);
@@ -558,11 +558,11 @@ xfs_agfl_free_finish_item(
 	next_extent = efdp->efd_next_extent;
 	ASSERT(next_extent < efdp->efd_format.efd_nextents);
 	extp = &(efdp->efd_format.efd_extents[next_extent]);
-	extp->ext_start = free->xefi_startblock;
-	extp->ext_len = free->xefi_blockcount;
+	extp->ext_start = xefi->xefi_startblock;
+	extp->ext_len = xefi->xefi_blockcount;
 	efdp->efd_next_extent++;
 
-	kmem_cache_free(xfs_extfree_item_cache, free);
+	kmem_cache_free(xfs_extfree_item_cache, xefi);
 	return error;
 }
 


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 5/8] xfs: pass rmap space mapping directly through the log intent code
  2022-12-30 22:10 ` [PATCHSET v24.0 0/8] xfs: variable and naming cleanups for intent items Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 3/8] xfs: pass xfs_extent_free_item directly through the log intent code Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 4/8] xfs: fix confusing xfs_extent_item variable names Darrick J. Wong
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Pass the incore rmap space mapping through the RUI logging code instead
of repeatedly boxing and unboxing parameters.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rmap.c |   52 ++++++++++++++++++-------------------
 fs/xfs/libxfs/xfs_rmap.h |    6 +---
 fs/xfs/xfs_rmap_item.c   |   65 +++++++++++++++++++++-------------------------
 3 files changed, 56 insertions(+), 67 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index b56aca1e7c66..df720041cd3d 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -2390,13 +2390,7 @@ xfs_rmap_finish_one_cleanup(
 int
 xfs_rmap_finish_one(
 	struct xfs_trans		*tp,
-	enum xfs_rmap_intent_type	type,
-	uint64_t			owner,
-	int				whichfork,
-	xfs_fileoff_t			startoff,
-	xfs_fsblock_t			startblock,
-	xfs_filblks_t			blockcount,
-	xfs_exntst_t			state,
+	struct xfs_rmap_intent		*ri,
 	struct xfs_btree_cur		**pcur)
 {
 	struct xfs_mount		*mp = tp->t_mountp;
@@ -2408,11 +2402,13 @@ xfs_rmap_finish_one(
 	xfs_agblock_t			bno;
 	bool				unwritten;
 
-	pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, startblock));
-	bno = XFS_FSB_TO_AGBNO(mp, startblock);
+	pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock));
+	bno = XFS_FSB_TO_AGBNO(mp, ri->ri_bmap.br_startblock);
 
-	trace_xfs_rmap_deferred(mp, pag->pag_agno, type, bno, owner, whichfork,
-			startoff, blockcount, state);
+	trace_xfs_rmap_deferred(mp, pag->pag_agno, ri->ri_type, bno,
+			ri->ri_owner, ri->ri_whichfork,
+			ri->ri_bmap.br_startoff, ri->ri_bmap.br_blockcount,
+			ri->ri_bmap.br_state);
 
 	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_RMAP_FINISH_ONE)) {
 		error = -EIO;
@@ -2448,36 +2444,38 @@ xfs_rmap_finish_one(
 	}
 	*pcur = rcur;
 
-	xfs_rmap_ino_owner(&oinfo, owner, whichfork, startoff);
-	unwritten = state == XFS_EXT_UNWRITTEN;
-	bno = XFS_FSB_TO_AGBNO(rcur->bc_mp, startblock);
+	xfs_rmap_ino_owner(&oinfo, ri->ri_owner, ri->ri_whichfork,
+			ri->ri_bmap.br_startoff);
+	unwritten = ri->ri_bmap.br_state == XFS_EXT_UNWRITTEN;
+	bno = XFS_FSB_TO_AGBNO(rcur->bc_mp, ri->ri_bmap.br_startblock);
 
-	switch (type) {
+	switch (ri->ri_type) {
 	case XFS_RMAP_ALLOC:
 	case XFS_RMAP_MAP:
-		error = xfs_rmap_map(rcur, bno, blockcount, unwritten, &oinfo);
+		error = xfs_rmap_map(rcur, bno, ri->ri_bmap.br_blockcount,
+				unwritten, &oinfo);
 		break;
 	case XFS_RMAP_MAP_SHARED:
-		error = xfs_rmap_map_shared(rcur, bno, blockcount, unwritten,
-				&oinfo);
+		error = xfs_rmap_map_shared(rcur, bno,
+				ri->ri_bmap.br_blockcount, unwritten, &oinfo);
 		break;
 	case XFS_RMAP_FREE:
 	case XFS_RMAP_UNMAP:
-		error = xfs_rmap_unmap(rcur, bno, blockcount, unwritten,
-				&oinfo);
+		error = xfs_rmap_unmap(rcur, bno, ri->ri_bmap.br_blockcount,
+				unwritten, &oinfo);
 		break;
 	case XFS_RMAP_UNMAP_SHARED:
-		error = xfs_rmap_unmap_shared(rcur, bno, blockcount, unwritten,
-				&oinfo);
+		error = xfs_rmap_unmap_shared(rcur, bno,
+				ri->ri_bmap.br_blockcount, unwritten, &oinfo);
 		break;
 	case XFS_RMAP_CONVERT:
-		error = xfs_rmap_convert(rcur, bno, blockcount, !unwritten,
-				&oinfo);
-		break;
-	case XFS_RMAP_CONVERT_SHARED:
-		error = xfs_rmap_convert_shared(rcur, bno, blockcount,
+		error = xfs_rmap_convert(rcur, bno, ri->ri_bmap.br_blockcount,
 				!unwritten, &oinfo);
 		break;
+	case XFS_RMAP_CONVERT_SHARED:
+		error = xfs_rmap_convert_shared(rcur, bno,
+				ri->ri_bmap.br_blockcount, !unwritten, &oinfo);
+		break;
 	default:
 		ASSERT(0);
 		error = -EFSCORRUPTED;
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index 54741a591a17..2dac88cea28d 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -179,10 +179,8 @@ void xfs_rmap_free_extent(struct xfs_trans *tp, xfs_agnumber_t agno,
 
 void xfs_rmap_finish_one_cleanup(struct xfs_trans *tp,
 		struct xfs_btree_cur *rcur, int error);
-int xfs_rmap_finish_one(struct xfs_trans *tp, enum xfs_rmap_intent_type type,
-		uint64_t owner, int whichfork, xfs_fileoff_t startoff,
-		xfs_fsblock_t startblock, xfs_filblks_t blockcount,
-		xfs_exntst_t state, struct xfs_btree_cur **pcur);
+int xfs_rmap_finish_one(struct xfs_trans *tp, struct xfs_rmap_intent *ri,
+		struct xfs_btree_cur **pcur);
 
 int xfs_rmap_lookup_le_range(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		uint64_t owner, uint64_t offset, unsigned int flags,
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 534504ede1a3..e46d040a9fc5 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -293,19 +293,12 @@ static int
 xfs_trans_log_finish_rmap_update(
 	struct xfs_trans		*tp,
 	struct xfs_rud_log_item		*rudp,
-	enum xfs_rmap_intent_type	type,
-	uint64_t			owner,
-	int				whichfork,
-	xfs_fileoff_t			startoff,
-	xfs_fsblock_t			startblock,
-	xfs_filblks_t			blockcount,
-	xfs_exntst_t			state,
+	struct xfs_rmap_intent		*ri,
 	struct xfs_btree_cur		**pcur)
 {
 	int				error;
 
-	error = xfs_rmap_finish_one(tp, type, owner, whichfork, startoff,
-			startblock, blockcount, state, pcur);
+	error = xfs_rmap_finish_one(tp, ri, pcur);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
@@ -409,10 +402,7 @@ xfs_rmap_update_finish_item(
 	int				error;
 
 	rmap = container_of(item, struct xfs_rmap_intent, ri_list);
-	error = xfs_trans_log_finish_rmap_update(tp, RUD_ITEM(done),
-			rmap->ri_type, rmap->ri_owner, rmap->ri_whichfork,
-			rmap->ri_bmap.br_startoff, rmap->ri_bmap.br_startblock,
-			rmap->ri_bmap.br_blockcount, rmap->ri_bmap.br_state,
+	error = xfs_trans_log_finish_rmap_update(tp, RUD_ITEM(done), rmap,
 			state);
 	kmem_cache_free(xfs_rmap_intent_cache, rmap);
 	return error;
@@ -493,15 +483,11 @@ xfs_rui_item_recover(
 	struct list_head		*capture_list)
 {
 	struct xfs_rui_log_item		*ruip = RUI_ITEM(lip);
-	struct xfs_map_extent		*rmap;
 	struct xfs_rud_log_item		*rudp;
 	struct xfs_trans		*tp;
 	struct xfs_btree_cur		*rcur = NULL;
 	struct xfs_mount		*mp = lip->li_log->l_mp;
-	enum xfs_rmap_intent_type	type;
-	xfs_exntst_t			state;
 	int				i;
-	int				whichfork;
 	int				error = 0;
 
 	/*
@@ -526,35 +512,34 @@ xfs_rui_item_recover(
 	rudp = xfs_trans_get_rud(tp, ruip);
 
 	for (i = 0; i < ruip->rui_format.rui_nextents; i++) {
-		rmap = &ruip->rui_format.rui_extents[i];
-		state = (rmap->me_flags & XFS_RMAP_EXTENT_UNWRITTEN) ?
-				XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
-		whichfork = (rmap->me_flags & XFS_RMAP_EXTENT_ATTR_FORK) ?
-				XFS_ATTR_FORK : XFS_DATA_FORK;
-		switch (rmap->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
+		struct xfs_rmap_intent	fake = { };
+		struct xfs_map_extent	*map;
+
+		map = &ruip->rui_format.rui_extents[i];
+		switch (map->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
 		case XFS_RMAP_EXTENT_MAP:
-			type = XFS_RMAP_MAP;
+			fake.ri_type = XFS_RMAP_MAP;
 			break;
 		case XFS_RMAP_EXTENT_MAP_SHARED:
-			type = XFS_RMAP_MAP_SHARED;
+			fake.ri_type = XFS_RMAP_MAP_SHARED;
 			break;
 		case XFS_RMAP_EXTENT_UNMAP:
-			type = XFS_RMAP_UNMAP;
+			fake.ri_type = XFS_RMAP_UNMAP;
 			break;
 		case XFS_RMAP_EXTENT_UNMAP_SHARED:
-			type = XFS_RMAP_UNMAP_SHARED;
+			fake.ri_type = XFS_RMAP_UNMAP_SHARED;
 			break;
 		case XFS_RMAP_EXTENT_CONVERT:
-			type = XFS_RMAP_CONVERT;
+			fake.ri_type = XFS_RMAP_CONVERT;
 			break;
 		case XFS_RMAP_EXTENT_CONVERT_SHARED:
-			type = XFS_RMAP_CONVERT_SHARED;
+			fake.ri_type = XFS_RMAP_CONVERT_SHARED;
 			break;
 		case XFS_RMAP_EXTENT_ALLOC:
-			type = XFS_RMAP_ALLOC;
+			fake.ri_type = XFS_RMAP_ALLOC;
 			break;
 		case XFS_RMAP_EXTENT_FREE:
-			type = XFS_RMAP_FREE;
+			fake.ri_type = XFS_RMAP_FREE;
 			break;
 		default:
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
@@ -563,13 +548,21 @@ xfs_rui_item_recover(
 			error = -EFSCORRUPTED;
 			goto abort_error;
 		}
-		error = xfs_trans_log_finish_rmap_update(tp, rudp, type,
-				rmap->me_owner, whichfork,
-				rmap->me_startoff, rmap->me_startblock,
-				rmap->me_len, state, &rcur);
+
+		fake.ri_owner = map->me_owner;
+		fake.ri_whichfork = (map->me_flags & XFS_RMAP_EXTENT_ATTR_FORK) ?
+				XFS_ATTR_FORK : XFS_DATA_FORK;
+		fake.ri_bmap.br_startblock = map->me_startblock;
+		fake.ri_bmap.br_startoff = map->me_startoff;
+		fake.ri_bmap.br_blockcount = map->me_len;
+		fake.ri_bmap.br_state = (map->me_flags & XFS_RMAP_EXTENT_UNWRITTEN) ?
+				XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
+
+		error = xfs_trans_log_finish_rmap_update(tp, rudp, &fake,
+				&rcur);
 		if (error == -EFSCORRUPTED)
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
-					rmap, sizeof(*rmap));
+					map, sizeof(*map));
 		if (error)
 			goto abort_error;
 


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 6/8] xfs: fix confusing variable names in xfs_rmap_item.c
  2022-12-30 22:10 ` [PATCHSET v24.0 0/8] xfs: variable and naming cleanups for intent items Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 4/8] xfs: fix confusing xfs_extent_item variable names Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 7/8] xfs: pass refcount intent directly through the log intent code Darrick J. Wong
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Variable names in this code module are inconsistent and confusing.
xfs_map_extent describe file mappings, so rename them "map".
xfs_rmap_intents describe block mapping intents, so rename them "ri".

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rmap_item.c |   79 ++++++++++++++++++++++++------------------------
 1 file changed, 40 insertions(+), 39 deletions(-)


diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index e46d040a9fc5..a1619d67015f 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -244,40 +244,40 @@ xfs_trans_get_rud(
 /* Set the map extent flags for this reverse mapping. */
 static void
 xfs_trans_set_rmap_flags(
-	struct xfs_map_extent		*rmap,
+	struct xfs_map_extent		*map,
 	enum xfs_rmap_intent_type	type,
 	int				whichfork,
 	xfs_exntst_t			state)
 {
-	rmap->me_flags = 0;
+	map->me_flags = 0;
 	if (state == XFS_EXT_UNWRITTEN)
-		rmap->me_flags |= XFS_RMAP_EXTENT_UNWRITTEN;
+		map->me_flags |= XFS_RMAP_EXTENT_UNWRITTEN;
 	if (whichfork == XFS_ATTR_FORK)
-		rmap->me_flags |= XFS_RMAP_EXTENT_ATTR_FORK;
+		map->me_flags |= XFS_RMAP_EXTENT_ATTR_FORK;
 	switch (type) {
 	case XFS_RMAP_MAP:
-		rmap->me_flags |= XFS_RMAP_EXTENT_MAP;
+		map->me_flags |= XFS_RMAP_EXTENT_MAP;
 		break;
 	case XFS_RMAP_MAP_SHARED:
-		rmap->me_flags |= XFS_RMAP_EXTENT_MAP_SHARED;
+		map->me_flags |= XFS_RMAP_EXTENT_MAP_SHARED;
 		break;
 	case XFS_RMAP_UNMAP:
-		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP;
+		map->me_flags |= XFS_RMAP_EXTENT_UNMAP;
 		break;
 	case XFS_RMAP_UNMAP_SHARED:
-		rmap->me_flags |= XFS_RMAP_EXTENT_UNMAP_SHARED;
+		map->me_flags |= XFS_RMAP_EXTENT_UNMAP_SHARED;
 		break;
 	case XFS_RMAP_CONVERT:
-		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT;
+		map->me_flags |= XFS_RMAP_EXTENT_CONVERT;
 		break;
 	case XFS_RMAP_CONVERT_SHARED:
-		rmap->me_flags |= XFS_RMAP_EXTENT_CONVERT_SHARED;
+		map->me_flags |= XFS_RMAP_EXTENT_CONVERT_SHARED;
 		break;
 	case XFS_RMAP_ALLOC:
-		rmap->me_flags |= XFS_RMAP_EXTENT_ALLOC;
+		map->me_flags |= XFS_RMAP_EXTENT_ALLOC;
 		break;
 	case XFS_RMAP_FREE:
-		rmap->me_flags |= XFS_RMAP_EXTENT_FREE;
+		map->me_flags |= XFS_RMAP_EXTENT_FREE;
 		break;
 	default:
 		ASSERT(0);
@@ -335,7 +335,7 @@ STATIC void
 xfs_rmap_update_log_item(
 	struct xfs_trans		*tp,
 	struct xfs_rui_log_item		*ruip,
-	struct xfs_rmap_intent		*rmap)
+	struct xfs_rmap_intent		*ri)
 {
 	uint				next_extent;
 	struct xfs_map_extent		*map;
@@ -351,12 +351,12 @@ xfs_rmap_update_log_item(
 	next_extent = atomic_inc_return(&ruip->rui_next_extent) - 1;
 	ASSERT(next_extent < ruip->rui_format.rui_nextents);
 	map = &ruip->rui_format.rui_extents[next_extent];
-	map->me_owner = rmap->ri_owner;
-	map->me_startblock = rmap->ri_bmap.br_startblock;
-	map->me_startoff = rmap->ri_bmap.br_startoff;
-	map->me_len = rmap->ri_bmap.br_blockcount;
-	xfs_trans_set_rmap_flags(map, rmap->ri_type, rmap->ri_whichfork,
-			rmap->ri_bmap.br_state);
+	map->me_owner = ri->ri_owner;
+	map->me_startblock = ri->ri_bmap.br_startblock;
+	map->me_startoff = ri->ri_bmap.br_startoff;
+	map->me_len = ri->ri_bmap.br_blockcount;
+	xfs_trans_set_rmap_flags(map, ri->ri_type, ri->ri_whichfork,
+			ri->ri_bmap.br_state);
 }
 
 static struct xfs_log_item *
@@ -368,15 +368,15 @@ xfs_rmap_update_create_intent(
 {
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_rui_log_item		*ruip = xfs_rui_init(mp, count);
-	struct xfs_rmap_intent		*rmap;
+	struct xfs_rmap_intent		*ri;
 
 	ASSERT(count > 0);
 
 	xfs_trans_add_item(tp, &ruip->rui_item);
 	if (sort)
 		list_sort(mp, items, xfs_rmap_update_diff_items);
-	list_for_each_entry(rmap, items, ri_list)
-		xfs_rmap_update_log_item(tp, ruip, rmap);
+	list_for_each_entry(ri, items, ri_list)
+		xfs_rmap_update_log_item(tp, ruip, ri);
 	return &ruip->rui_item;
 }
 
@@ -398,13 +398,14 @@ xfs_rmap_update_finish_item(
 	struct list_head		*item,
 	struct xfs_btree_cur		**state)
 {
-	struct xfs_rmap_intent		*rmap;
+	struct xfs_rmap_intent		*ri;
 	int				error;
 
-	rmap = container_of(item, struct xfs_rmap_intent, ri_list);
-	error = xfs_trans_log_finish_rmap_update(tp, RUD_ITEM(done), rmap,
+	ri = container_of(item, struct xfs_rmap_intent, ri_list);
+
+	error = xfs_trans_log_finish_rmap_update(tp, RUD_ITEM(done), ri,
 			state);
-	kmem_cache_free(xfs_rmap_intent_cache, rmap);
+	kmem_cache_free(xfs_rmap_intent_cache, ri);
 	return error;
 }
 
@@ -421,10 +422,10 @@ STATIC void
 xfs_rmap_update_cancel_item(
 	struct list_head		*item)
 {
-	struct xfs_rmap_intent		*rmap;
+	struct xfs_rmap_intent		*ri;
 
-	rmap = container_of(item, struct xfs_rmap_intent, ri_list);
-	kmem_cache_free(xfs_rmap_intent_cache, rmap);
+	ri = container_of(item, struct xfs_rmap_intent, ri_list);
+	kmem_cache_free(xfs_rmap_intent_cache, ri);
 }
 
 const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
@@ -441,15 +442,15 @@ const struct xfs_defer_op_type xfs_rmap_update_defer_type = {
 static inline bool
 xfs_rui_validate_map(
 	struct xfs_mount		*mp,
-	struct xfs_map_extent		*rmap)
+	struct xfs_map_extent		*map)
 {
 	if (!xfs_has_rmapbt(mp))
 		return false;
 
-	if (rmap->me_flags & ~XFS_RMAP_EXTENT_FLAGS)
+	if (map->me_flags & ~XFS_RMAP_EXTENT_FLAGS)
 		return false;
 
-	switch (rmap->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
+	switch (map->me_flags & XFS_RMAP_EXTENT_TYPE_MASK) {
 	case XFS_RMAP_EXTENT_MAP:
 	case XFS_RMAP_EXTENT_MAP_SHARED:
 	case XFS_RMAP_EXTENT_UNMAP:
@@ -463,14 +464,14 @@ xfs_rui_validate_map(
 		return false;
 	}
 
-	if (!XFS_RMAP_NON_INODE_OWNER(rmap->me_owner) &&
-	    !xfs_verify_ino(mp, rmap->me_owner))
+	if (!XFS_RMAP_NON_INODE_OWNER(map->me_owner) &&
+	    !xfs_verify_ino(mp, map->me_owner))
 		return false;
 
-	if (!xfs_verify_fileext(mp, rmap->me_startoff, rmap->me_len))
+	if (!xfs_verify_fileext(mp, map->me_startoff, map->me_len))
 		return false;
 
-	return xfs_verify_fsbext(mp, rmap->me_startblock, rmap->me_len);
+	return xfs_verify_fsbext(mp, map->me_startblock, map->me_len);
 }
 
 /*
@@ -593,18 +594,18 @@ xfs_rui_item_relog(
 {
 	struct xfs_rud_log_item		*rudp;
 	struct xfs_rui_log_item		*ruip;
-	struct xfs_map_extent		*extp;
+	struct xfs_map_extent		*map;
 	unsigned int			count;
 
 	count = RUI_ITEM(intent)->rui_format.rui_nextents;
-	extp = RUI_ITEM(intent)->rui_format.rui_extents;
+	map = RUI_ITEM(intent)->rui_format.rui_extents;
 
 	tp->t_flags |= XFS_TRANS_DIRTY;
 	rudp = xfs_trans_get_rud(tp, RUI_ITEM(intent));
 	set_bit(XFS_LI_DIRTY, &rudp->rud_item.li_flags);
 
 	ruip = xfs_rui_init(tp->t_mountp, count);
-	memcpy(ruip->rui_format.rui_extents, extp, count * sizeof(*extp));
+	memcpy(ruip->rui_format.rui_extents, map, count * sizeof(*map));
 	atomic_set(&ruip->rui_next_extent, count);
 	xfs_trans_add_item(tp, &ruip->rui_item);
 	set_bit(XFS_LI_DIRTY, &ruip->rui_item.li_flags);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 7/8] xfs: pass refcount intent directly through the log intent code
  2022-12-30 22:10 ` [PATCHSET v24.0 0/8] xfs: variable and naming cleanups for intent items Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 6/8] xfs: fix confusing variable names in xfs_rmap_item.c Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Pass the incore refcount intent through the CUI logging code instead of
repeatedly boxing and unboxing parameters.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_refcount.c |   96 +++++++++++++++++++-----------------------
 fs/xfs/libxfs/xfs_refcount.h |    4 --
 fs/xfs/xfs_refcount_item.c   |   62 +++++++++++----------------
 fs/xfs/xfs_trace.h           |   15 ++-----
 4 files changed, 74 insertions(+), 103 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 6f7ed9288fe4..bcf46aa0d08b 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1213,37 +1213,33 @@ xfs_refcount_adjust_extents(
 STATIC int
 xfs_refcount_adjust(
 	struct xfs_btree_cur	*cur,
-	xfs_agblock_t		agbno,
-	xfs_extlen_t		aglen,
-	xfs_agblock_t		*new_agbno,
-	xfs_extlen_t		*new_aglen,
+	xfs_agblock_t		*agbno,
+	xfs_extlen_t		*aglen,
 	enum xfs_refc_adjust_op	adj)
 {
 	bool			shape_changed;
 	int			shape_changes = 0;
 	int			error;
 
-	*new_agbno = agbno;
-	*new_aglen = aglen;
 	if (adj == XFS_REFCOUNT_ADJUST_INCREASE)
-		trace_xfs_refcount_increase(cur->bc_mp, cur->bc_ag.pag->pag_agno,
-				agbno, aglen);
+		trace_xfs_refcount_increase(cur->bc_mp,
+				cur->bc_ag.pag->pag_agno, *agbno, *aglen);
 	else
-		trace_xfs_refcount_decrease(cur->bc_mp, cur->bc_ag.pag->pag_agno,
-				agbno, aglen);
+		trace_xfs_refcount_decrease(cur->bc_mp,
+				cur->bc_ag.pag->pag_agno, *agbno, *aglen);
 
 	/*
 	 * Ensure that no rcextents cross the boundary of the adjustment range.
 	 */
 	error = xfs_refcount_split_extent(cur, XFS_REFC_DOMAIN_SHARED,
-			agbno, &shape_changed);
+			*agbno, &shape_changed);
 	if (error)
 		goto out_error;
 	if (shape_changed)
 		shape_changes++;
 
 	error = xfs_refcount_split_extent(cur, XFS_REFC_DOMAIN_SHARED,
-			agbno + aglen, &shape_changed);
+			*agbno + *aglen, &shape_changed);
 	if (error)
 		goto out_error;
 	if (shape_changed)
@@ -1253,7 +1249,7 @@ xfs_refcount_adjust(
 	 * Try to merge with the left or right extents of the range.
 	 */
 	error = xfs_refcount_merge_extents(cur, XFS_REFC_DOMAIN_SHARED,
-			new_agbno, new_aglen, adj, &shape_changed);
+			agbno, aglen, adj, &shape_changed);
 	if (error)
 		goto out_error;
 	if (shape_changed)
@@ -1262,7 +1258,7 @@ xfs_refcount_adjust(
 		cur->bc_ag.refc.shape_changes++;
 
 	/* Now that we've taken care of the ends, adjust the middle extents */
-	error = xfs_refcount_adjust_extents(cur, new_agbno, new_aglen, adj);
+	error = xfs_refcount_adjust_extents(cur, agbno, aglen, adj);
 	if (error)
 		goto out_error;
 
@@ -1298,21 +1294,20 @@ xfs_refcount_finish_one_cleanup(
 static inline int
 xfs_refcount_continue_op(
 	struct xfs_btree_cur		*cur,
-	xfs_fsblock_t			startblock,
-	xfs_agblock_t			new_agbno,
-	xfs_extlen_t			new_len,
-	xfs_fsblock_t			*new_fsbno)
+	struct xfs_refcount_intent	*ri,
+	xfs_agblock_t			new_agbno)
 {
 	struct xfs_mount		*mp = cur->bc_mp;
 	struct xfs_perag		*pag = cur->bc_ag.pag;
 
-	if (XFS_IS_CORRUPT(mp, !xfs_verify_agbext(pag, new_agbno, new_len)))
+	if (XFS_IS_CORRUPT(mp, !xfs_verify_agbext(pag, new_agbno,
+					ri->ri_blockcount)))
 		return -EFSCORRUPTED;
 
-	*new_fsbno = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);
+	ri->ri_startblock = XFS_AGB_TO_FSB(mp, pag->pag_agno, new_agbno);
 
-	ASSERT(xfs_verify_fsbext(mp, *new_fsbno, new_len));
-	ASSERT(pag->pag_agno == XFS_FSB_TO_AGNO(mp, *new_fsbno));
+	ASSERT(xfs_verify_fsbext(mp, ri->ri_startblock, ri->ri_blockcount));
+	ASSERT(pag->pag_agno == XFS_FSB_TO_AGNO(mp, ri->ri_startblock));
 
 	return 0;
 }
@@ -1327,11 +1322,7 @@ xfs_refcount_continue_op(
 int
 xfs_refcount_finish_one(
 	struct xfs_trans		*tp,
-	enum xfs_refcount_intent_type	type,
-	xfs_fsblock_t			startblock,
-	xfs_extlen_t			blockcount,
-	xfs_fsblock_t			*new_fsb,
-	xfs_extlen_t			*new_len,
+	struct xfs_refcount_intent	*ri,
 	struct xfs_btree_cur		**pcur)
 {
 	struct xfs_mount		*mp = tp->t_mountp;
@@ -1339,17 +1330,16 @@ xfs_refcount_finish_one(
 	struct xfs_buf			*agbp = NULL;
 	int				error = 0;
 	xfs_agblock_t			bno;
-	xfs_agblock_t			new_agbno;
 	unsigned long			nr_ops = 0;
 	int				shape_changes = 0;
 	struct xfs_perag		*pag;
 
-	pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, startblock));
-	bno = XFS_FSB_TO_AGBNO(mp, startblock);
+	pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ri->ri_startblock));
+	bno = XFS_FSB_TO_AGBNO(mp, ri->ri_startblock);
 
-	trace_xfs_refcount_deferred(mp, XFS_FSB_TO_AGNO(mp, startblock),
-			type, XFS_FSB_TO_AGBNO(mp, startblock),
-			blockcount);
+	trace_xfs_refcount_deferred(mp, XFS_FSB_TO_AGNO(mp, ri->ri_startblock),
+			ri->ri_type, XFS_FSB_TO_AGBNO(mp, ri->ri_startblock),
+			ri->ri_blockcount);
 
 	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_REFCOUNT_FINISH_ONE)) {
 		error = -EIO;
@@ -1380,42 +1370,42 @@ xfs_refcount_finish_one(
 	}
 	*pcur = rcur;
 
-	switch (type) {
+	switch (ri->ri_type) {
 	case XFS_REFCOUNT_INCREASE:
-		error = xfs_refcount_adjust(rcur, bno, blockcount, &new_agbno,
-				new_len, XFS_REFCOUNT_ADJUST_INCREASE);
+		error = xfs_refcount_adjust(rcur, &bno, &ri->ri_blockcount,
+				XFS_REFCOUNT_ADJUST_INCREASE);
 		if (error)
 			goto out_drop;
-		if (*new_len > 0)
-			error = xfs_refcount_continue_op(rcur, startblock,
-					new_agbno, *new_len, new_fsb);
+		if (ri->ri_blockcount > 0)
+			error = xfs_refcount_continue_op(rcur, ri, bno);
 		break;
 	case XFS_REFCOUNT_DECREASE:
-		error = xfs_refcount_adjust(rcur, bno, blockcount, &new_agbno,
-				new_len, XFS_REFCOUNT_ADJUST_DECREASE);
+		error = xfs_refcount_adjust(rcur, &bno, &ri->ri_blockcount,
+				XFS_REFCOUNT_ADJUST_DECREASE);
 		if (error)
 			goto out_drop;
-		if (*new_len > 0)
-			error = xfs_refcount_continue_op(rcur, startblock,
-					new_agbno, *new_len, new_fsb);
+		if (ri->ri_blockcount > 0)
+			error = xfs_refcount_continue_op(rcur, ri, bno);
 		break;
 	case XFS_REFCOUNT_ALLOC_COW:
-		*new_fsb = startblock + blockcount;
-		*new_len = 0;
-		error = __xfs_refcount_cow_alloc(rcur, bno, blockcount);
+		error = __xfs_refcount_cow_alloc(rcur, bno, ri->ri_blockcount);
+		if (error)
+			goto out_drop;
+		ri->ri_blockcount = 0;
 		break;
 	case XFS_REFCOUNT_FREE_COW:
-		*new_fsb = startblock + blockcount;
-		*new_len = 0;
-		error = __xfs_refcount_cow_free(rcur, bno, blockcount);
+		error = __xfs_refcount_cow_free(rcur, bno, ri->ri_blockcount);
+		if (error)
+			goto out_drop;
+		ri->ri_blockcount = 0;
 		break;
 	default:
 		ASSERT(0);
 		error = -EFSCORRUPTED;
 	}
-	if (!error && *new_len > 0)
-		trace_xfs_refcount_finish_one_leftover(mp, pag->pag_agno, type,
-				bno, blockcount, new_agbno, *new_len);
+	if (!error && ri->ri_blockcount > 0)
+		trace_xfs_refcount_finish_one_leftover(mp, pag->pag_agno,
+				ri->ri_type, bno, ri->ri_blockcount);
 out_drop:
 	xfs_perag_put(pag);
 	return error;
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 452f30556f5a..c633477ce3ce 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -75,9 +75,7 @@ void xfs_refcount_decrease_extent(struct xfs_trans *tp,
 extern void xfs_refcount_finish_one_cleanup(struct xfs_trans *tp,
 		struct xfs_btree_cur *rcur, int error);
 extern int xfs_refcount_finish_one(struct xfs_trans *tp,
-		enum xfs_refcount_intent_type type, xfs_fsblock_t startblock,
-		xfs_extlen_t blockcount, xfs_fsblock_t *new_fsb,
-		xfs_extlen_t *new_len, struct xfs_btree_cur **pcur);
+		struct xfs_refcount_intent *ri, struct xfs_btree_cur **pcur);
 
 extern int xfs_refcount_find_shared(struct xfs_btree_cur *cur,
 		xfs_agblock_t agbno, xfs_extlen_t aglen, xfs_agblock_t *fbno,
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 858e3e9eb4a8..ff4d5087ba00 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -252,17 +252,12 @@ static int
 xfs_trans_log_finish_refcount_update(
 	struct xfs_trans		*tp,
 	struct xfs_cud_log_item		*cudp,
-	enum xfs_refcount_intent_type	type,
-	xfs_fsblock_t			startblock,
-	xfs_extlen_t			blockcount,
-	xfs_fsblock_t			*new_fsb,
-	xfs_extlen_t			*new_len,
+	struct xfs_refcount_intent	*ri,
 	struct xfs_btree_cur		**pcur)
 {
 	int				error;
 
-	error = xfs_refcount_finish_one(tp, type, startblock,
-			blockcount, new_fsb, new_len, pcur);
+	error = xfs_refcount_finish_one(tp, ri, pcur);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
@@ -378,25 +373,20 @@ xfs_refcount_update_finish_item(
 	struct list_head		*item,
 	struct xfs_btree_cur		**state)
 {
-	struct xfs_refcount_intent	*refc;
-	xfs_fsblock_t			new_fsb;
-	xfs_extlen_t			new_aglen;
+	struct xfs_refcount_intent	*ri;
 	int				error;
 
-	refc = container_of(item, struct xfs_refcount_intent, ri_list);
-	error = xfs_trans_log_finish_refcount_update(tp, CUD_ITEM(done),
-			refc->ri_type, refc->ri_startblock, refc->ri_blockcount,
-			&new_fsb, &new_aglen, state);
+	ri = container_of(item, struct xfs_refcount_intent, ri_list);
+	error = xfs_trans_log_finish_refcount_update(tp, CUD_ITEM(done), ri,
+			state);
 
 	/* Did we run out of reservation?  Requeue what we didn't finish. */
-	if (!error && new_aglen > 0) {
-		ASSERT(refc->ri_type == XFS_REFCOUNT_INCREASE ||
-		       refc->ri_type == XFS_REFCOUNT_DECREASE);
-		refc->ri_startblock = new_fsb;
-		refc->ri_blockcount = new_aglen;
+	if (!error && ri->ri_blockcount > 0) {
+		ASSERT(ri->ri_type == XFS_REFCOUNT_INCREASE ||
+		       ri->ri_type == XFS_REFCOUNT_DECREASE);
 		return -EAGAIN;
 	}
-	kmem_cache_free(xfs_refcount_intent_cache, refc);
+	kmem_cache_free(xfs_refcount_intent_cache, ri);
 	return error;
 }
 
@@ -463,18 +453,13 @@ xfs_cui_item_recover(
 	struct xfs_log_item		*lip,
 	struct list_head		*capture_list)
 {
-	struct xfs_bmbt_irec		irec;
 	struct xfs_cui_log_item		*cuip = CUI_ITEM(lip);
-	struct xfs_phys_extent		*refc;
 	struct xfs_cud_log_item		*cudp;
 	struct xfs_trans		*tp;
 	struct xfs_btree_cur		*rcur = NULL;
 	struct xfs_mount		*mp = lip->li_log->l_mp;
-	xfs_fsblock_t			new_fsb;
-	xfs_extlen_t			new_len;
 	unsigned int			refc_type;
 	bool				requeue_only = false;
-	enum xfs_refcount_intent_type	type;
 	int				i;
 	int				error = 0;
 
@@ -513,6 +498,9 @@ xfs_cui_item_recover(
 	cudp = xfs_trans_get_cud(tp, cuip);
 
 	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
+		struct xfs_refcount_intent	fake = { };
+		struct xfs_phys_extent		*refc;
+
 		refc = &cuip->cui_format.cui_extents[i];
 		refc_type = refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK;
 		switch (refc_type) {
@@ -520,7 +508,7 @@ xfs_cui_item_recover(
 		case XFS_REFCOUNT_DECREASE:
 		case XFS_REFCOUNT_ALLOC_COW:
 		case XFS_REFCOUNT_FREE_COW:
-			type = refc_type;
+			fake.ri_type = refc_type;
 			break;
 		default:
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
@@ -529,13 +517,12 @@ xfs_cui_item_recover(
 			error = -EFSCORRUPTED;
 			goto abort_error;
 		}
-		if (requeue_only) {
-			new_fsb = refc->pe_startblock;
-			new_len = refc->pe_len;
-		} else
+
+		fake.ri_startblock = refc->pe_startblock;
+		fake.ri_blockcount = refc->pe_len;
+		if (!requeue_only)
 			error = xfs_trans_log_finish_refcount_update(tp, cudp,
-				type, refc->pe_startblock, refc->pe_len,
-				&new_fsb, &new_len, &rcur);
+					&fake, &rcur);
 		if (error == -EFSCORRUPTED)
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 					&cuip->cui_format,
@@ -544,10 +531,13 @@ xfs_cui_item_recover(
 			goto abort_error;
 
 		/* Requeue what we didn't finish. */
-		if (new_len > 0) {
-			irec.br_startblock = new_fsb;
-			irec.br_blockcount = new_len;
-			switch (type) {
+		if (fake.ri_blockcount > 0) {
+			struct xfs_bmbt_irec	irec = {
+				.br_startblock	= fake.ri_startblock,
+				.br_blockcount	= fake.ri_blockcount,
+			};
+
+			switch (fake.ri_type) {
 			case XFS_REFCOUNT_INCREASE:
 				xfs_refcount_increase_extent(tp, &irec);
 				break;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 421d1e504ac4..6b0e9ae7c513 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3207,17 +3207,14 @@ DEFINE_REFCOUNT_DEFERRED_EVENT(xfs_refcount_deferred);
 
 TRACE_EVENT(xfs_refcount_finish_one_leftover,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
-		 int type, xfs_agblock_t agbno, xfs_extlen_t len,
-		 xfs_agblock_t new_agbno, xfs_extlen_t new_len),
-	TP_ARGS(mp, agno, type, agbno, len, new_agbno, new_len),
+		 int type, xfs_agblock_t agbno, xfs_extlen_t len),
+	TP_ARGS(mp, agno, type, agbno, len),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
 		__field(int, type)
 		__field(xfs_agblock_t, agbno)
 		__field(xfs_extlen_t, len)
-		__field(xfs_agblock_t, new_agbno)
-		__field(xfs_extlen_t, new_len)
 	),
 	TP_fast_assign(
 		__entry->dev = mp->m_super->s_dev;
@@ -3225,17 +3222,13 @@ TRACE_EVENT(xfs_refcount_finish_one_leftover,
 		__entry->type = type;
 		__entry->agbno = agbno;
 		__entry->len = len;
-		__entry->new_agbno = new_agbno;
-		__entry->new_len = new_len;
 	),
-	TP_printk("dev %d:%d type %d agno 0x%x agbno 0x%x fsbcount 0x%x new_agbno 0x%x new_fsbcount 0x%x",
+	TP_printk("dev %d:%d type %d agno 0x%x agbno 0x%x fsbcount 0x%x",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->type,
 		  __entry->agno,
 		  __entry->agbno,
-		  __entry->len,
-		  __entry->new_agbno,
-		  __entry->new_len)
+		  __entry->len)
 );
 
 /* simple inode-based error/%ip tracepoint class */


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 8/8] xfs: fix confusing variable names in xfs_refcount_item.c
  2022-12-30 22:10 ` [PATCHSET v24.0 0/8] xfs: variable and naming cleanups for intent items Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 1/8] xfs: pass the xfs_bmbt_irec directly through the log intent code Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 2/8] xfs: fix confusing variable names in xfs_bmap_item.c Darrick J. Wong
@ 2022-12-30 22:10   ` Darrick J. Wong
  2022-12-30 22:10   ` [PATCH 3/8] xfs: pass xfs_extent_free_item directly through the log intent code Darrick J. Wong
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:10 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Variable names in this code module are inconsistent and confusing.
xfs_phys_extent describe physical mappings, so rename them "pmap".
xfs_refcount_intents describe refcount intents, so rename them "ri".

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_refcount_item.c |   54 ++++++++++++++++++++++----------------------
 1 file changed, 27 insertions(+), 27 deletions(-)


diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index ff4d5087ba00..48d771a76add 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -292,16 +292,16 @@ xfs_refcount_update_diff_items(
 /* Set the phys extent flags for this reverse mapping. */
 static void
 xfs_trans_set_refcount_flags(
-	struct xfs_phys_extent		*refc,
+	struct xfs_phys_extent		*pmap,
 	enum xfs_refcount_intent_type	type)
 {
-	refc->pe_flags = 0;
+	pmap->pe_flags = 0;
 	switch (type) {
 	case XFS_REFCOUNT_INCREASE:
 	case XFS_REFCOUNT_DECREASE:
 	case XFS_REFCOUNT_ALLOC_COW:
 	case XFS_REFCOUNT_FREE_COW:
-		refc->pe_flags |= type;
+		pmap->pe_flags |= type;
 		break;
 	default:
 		ASSERT(0);
@@ -313,10 +313,10 @@ STATIC void
 xfs_refcount_update_log_item(
 	struct xfs_trans		*tp,
 	struct xfs_cui_log_item		*cuip,
-	struct xfs_refcount_intent	*refc)
+	struct xfs_refcount_intent	*ri)
 {
 	uint				next_extent;
-	struct xfs_phys_extent		*ext;
+	struct xfs_phys_extent		*pmap;
 
 	tp->t_flags |= XFS_TRANS_DIRTY;
 	set_bit(XFS_LI_DIRTY, &cuip->cui_item.li_flags);
@@ -328,10 +328,10 @@ xfs_refcount_update_log_item(
 	 */
 	next_extent = atomic_inc_return(&cuip->cui_next_extent) - 1;
 	ASSERT(next_extent < cuip->cui_format.cui_nextents);
-	ext = &cuip->cui_format.cui_extents[next_extent];
-	ext->pe_startblock = refc->ri_startblock;
-	ext->pe_len = refc->ri_blockcount;
-	xfs_trans_set_refcount_flags(ext, refc->ri_type);
+	pmap = &cuip->cui_format.cui_extents[next_extent];
+	pmap->pe_startblock = ri->ri_startblock;
+	pmap->pe_len = ri->ri_blockcount;
+	xfs_trans_set_refcount_flags(pmap, ri->ri_type);
 }
 
 static struct xfs_log_item *
@@ -343,15 +343,15 @@ xfs_refcount_update_create_intent(
 {
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_cui_log_item		*cuip = xfs_cui_init(mp, count);
-	struct xfs_refcount_intent	*refc;
+	struct xfs_refcount_intent	*ri;
 
 	ASSERT(count > 0);
 
 	xfs_trans_add_item(tp, &cuip->cui_item);
 	if (sort)
 		list_sort(mp, items, xfs_refcount_update_diff_items);
-	list_for_each_entry(refc, items, ri_list)
-		xfs_refcount_update_log_item(tp, cuip, refc);
+	list_for_each_entry(ri, items, ri_list)
+		xfs_refcount_update_log_item(tp, cuip, ri);
 	return &cuip->cui_item;
 }
 
@@ -403,10 +403,10 @@ STATIC void
 xfs_refcount_update_cancel_item(
 	struct list_head		*item)
 {
-	struct xfs_refcount_intent	*refc;
+	struct xfs_refcount_intent	*ri;
 
-	refc = container_of(item, struct xfs_refcount_intent, ri_list);
-	kmem_cache_free(xfs_refcount_intent_cache, refc);
+	ri = container_of(item, struct xfs_refcount_intent, ri_list);
+	kmem_cache_free(xfs_refcount_intent_cache, ri);
 }
 
 const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
@@ -423,15 +423,15 @@ const struct xfs_defer_op_type xfs_refcount_update_defer_type = {
 static inline bool
 xfs_cui_validate_phys(
 	struct xfs_mount		*mp,
-	struct xfs_phys_extent		*refc)
+	struct xfs_phys_extent		*pmap)
 {
 	if (!xfs_has_reflink(mp))
 		return false;
 
-	if (refc->pe_flags & ~XFS_REFCOUNT_EXTENT_FLAGS)
+	if (pmap->pe_flags & ~XFS_REFCOUNT_EXTENT_FLAGS)
 		return false;
 
-	switch (refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK) {
+	switch (pmap->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK) {
 	case XFS_REFCOUNT_INCREASE:
 	case XFS_REFCOUNT_DECREASE:
 	case XFS_REFCOUNT_ALLOC_COW:
@@ -441,7 +441,7 @@ xfs_cui_validate_phys(
 		return false;
 	}
 
-	return xfs_verify_fsbext(mp, refc->pe_startblock, refc->pe_len);
+	return xfs_verify_fsbext(mp, pmap->pe_startblock, pmap->pe_len);
 }
 
 /*
@@ -499,10 +499,10 @@ xfs_cui_item_recover(
 
 	for (i = 0; i < cuip->cui_format.cui_nextents; i++) {
 		struct xfs_refcount_intent	fake = { };
-		struct xfs_phys_extent		*refc;
+		struct xfs_phys_extent		*pmap;
 
-		refc = &cuip->cui_format.cui_extents[i];
-		refc_type = refc->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK;
+		pmap = &cuip->cui_format.cui_extents[i];
+		refc_type = pmap->pe_flags & XFS_REFCOUNT_EXTENT_TYPE_MASK;
 		switch (refc_type) {
 		case XFS_REFCOUNT_INCREASE:
 		case XFS_REFCOUNT_DECREASE:
@@ -518,8 +518,8 @@ xfs_cui_item_recover(
 			goto abort_error;
 		}
 
-		fake.ri_startblock = refc->pe_startblock;
-		fake.ri_blockcount = refc->pe_len;
+		fake.ri_startblock = pmap->pe_startblock;
+		fake.ri_blockcount = pmap->pe_len;
 		if (!requeue_only)
 			error = xfs_trans_log_finish_refcount_update(tp, cudp,
 					&fake, &rcur);
@@ -586,18 +586,18 @@ xfs_cui_item_relog(
 {
 	struct xfs_cud_log_item		*cudp;
 	struct xfs_cui_log_item		*cuip;
-	struct xfs_phys_extent		*extp;
+	struct xfs_phys_extent		*pmap;
 	unsigned int			count;
 
 	count = CUI_ITEM(intent)->cui_format.cui_nextents;
-	extp = CUI_ITEM(intent)->cui_format.cui_extents;
+	pmap = CUI_ITEM(intent)->cui_format.cui_extents;
 
 	tp->t_flags |= XFS_TRANS_DIRTY;
 	cudp = xfs_trans_get_cud(tp, CUI_ITEM(intent));
 	set_bit(XFS_LI_DIRTY, &cudp->cud_item.li_flags);
 
 	cuip = xfs_cui_init(tp->t_mountp, count);
-	memcpy(cuip->cui_format.cui_extents, extp, count * sizeof(*extp));
+	memcpy(cuip->cui_format.cui_extents, pmap, count * sizeof(*pmap));
 	atomic_set(&cuip->cui_next_extent, count);
 	xfs_trans_add_item(tp, &cuip->cui_item);
 	set_bit(XFS_LI_DIRTY, &cuip->cui_item.li_flags);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/5] xfs: make intent items take a perag reference
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
  2022-12-30 22:10 ` [PATCHSET v24.0 0/8] xfs: variable and naming cleanups for intent items Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/5] xfs: give xfs_bmap_intent its own " Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/1] xfs: pass perag references around when possible Darrick J. Wong
                   ` (19 subsequent siblings)
  22 siblings, 5 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

Now that we've cleaned up some code warts in the deferred work item
processing code, let's make intent items take an active perag reference
from their creation until they are finally freed by the defer ops
machinery.  This change facilitates the scrub drain in the next patchset
and will make it easier for the future AG removal code to detect a busy
AG in need of quiescing.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=intents-perag-refs

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=intents-perag-refs
---
 fs/xfs/libxfs/xfs_ag.c             |    6 +---
 fs/xfs/libxfs/xfs_alloc.c          |   22 +++++++--------
 fs/xfs/libxfs/xfs_alloc.h          |   12 ++++++--
 fs/xfs/libxfs/xfs_bmap.c           |    1 +
 fs/xfs/libxfs/xfs_bmap.h           |    4 +++
 fs/xfs/libxfs/xfs_ialloc_btree.c   |    7 +++--
 fs/xfs/libxfs/xfs_refcount.c       |   33 +++++++++-------------
 fs/xfs/libxfs/xfs_refcount.h       |    4 +++
 fs/xfs/libxfs/xfs_refcount_btree.c |    5 ++-
 fs/xfs/libxfs/xfs_rmap.c           |   29 +++++++------------
 fs/xfs/libxfs/xfs_rmap.h           |    4 +++
 fs/xfs/scrub/repair.c              |    3 +-
 fs/xfs/xfs_bmap_item.c             |   29 +++++++++++++++++++
 fs/xfs/xfs_extfree_item.c          |   54 +++++++++++++++++++++++++-----------
 fs/xfs/xfs_refcount_item.c         |   36 +++++++++++++++++++++---
 fs/xfs/xfs_rmap_item.c             |   32 +++++++++++++++++++--
 16 files changed, 196 insertions(+), 85 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/5] xfs: give xfs_bmap_intent its own perag reference
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: make intent items take a perag reference Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 5/5] xfs: give xfs_refcount_intent " Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Give the xfs_bmap_intent an active reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  Later, shrink will use these
active references to know if an AG is quiesced or not.

The reason why we take an active ref for a file mapping operation is
simple: we're committing to some sort of action involving space in an
AG, so we want to indicate our active interest in that AG.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c |    1 +
 fs/xfs/libxfs/xfs_bmap.h |    4 ++++
 fs/xfs/xfs_bmap_item.c   |   29 ++++++++++++++++++++++++++++-
 3 files changed, 33 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c8c65387136c..45dfa5a56154 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6109,6 +6109,7 @@ __xfs_bmap_add(
 	bi->bi_whichfork = whichfork;
 	bi->bi_bmap = *bmap;
 
+	xfs_bmap_update_get_group(tp->t_mountp, bi);
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_BMAP, &bi->bi_list);
 	return 0;
 }
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 01c2df35c3e3..0cd86781fcd5 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -231,9 +231,13 @@ struct xfs_bmap_intent {
 	enum xfs_bmap_intent_type		bi_type;
 	int					bi_whichfork;
 	struct xfs_inode			*bi_owner;
+	struct xfs_perag			*bi_pag;
 	struct xfs_bmbt_irec			bi_bmap;
 };
 
+void xfs_bmap_update_get_group(struct xfs_mount *mp,
+		struct xfs_bmap_intent *bi);
+
 int	xfs_bmap_finish_one(struct xfs_trans *tp, struct xfs_bmap_intent *bi);
 void	xfs_bmap_map_extent(struct xfs_trans *tp, struct xfs_inode *ip,
 		struct xfs_bmbt_irec *imap);
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 6e2f0013380a..32ccd4bb9f46 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -24,6 +24,7 @@
 #include "xfs_error.h"
 #include "xfs_log_priv.h"
 #include "xfs_log_recover.h"
+#include "xfs_ag.h"
 
 struct kmem_cache	*xfs_bui_cache;
 struct kmem_cache	*xfs_bud_cache;
@@ -363,6 +364,26 @@ xfs_bmap_update_create_done(
 	return &xfs_trans_get_bud(tp, BUI_ITEM(intent))->bud_item;
 }
 
+/* Take an active ref to the AG containing the space we're mapping. */
+void
+xfs_bmap_update_get_group(
+	struct xfs_mount	*mp,
+	struct xfs_bmap_intent	*bi)
+{
+	xfs_agnumber_t		agno;
+
+	agno = XFS_FSB_TO_AGNO(mp, bi->bi_bmap.br_startblock);
+	bi->bi_pag = xfs_perag_get(mp, agno);
+}
+
+/* Release an active AG ref after finishing mapping work. */
+static inline void
+xfs_bmap_update_put_group(
+	struct xfs_bmap_intent	*bi)
+{
+	xfs_perag_put(bi->bi_pag);
+}
+
 /* Process a deferred rmap update. */
 STATIC int
 xfs_bmap_update_finish_item(
@@ -381,6 +402,8 @@ xfs_bmap_update_finish_item(
 		ASSERT(bi->bi_type == XFS_BMAP_UNMAP);
 		return -EAGAIN;
 	}
+
+	xfs_bmap_update_put_group(bi);
 	kmem_cache_free(xfs_bmap_intent_cache, bi);
 	return error;
 }
@@ -393,7 +416,7 @@ xfs_bmap_update_abort_intent(
 	xfs_bui_release(BUI_ITEM(intent));
 }
 
-/* Cancel a deferred rmap update. */
+/* Cancel a deferred bmap update. */
 STATIC void
 xfs_bmap_update_cancel_item(
 	struct list_head		*item)
@@ -401,6 +424,8 @@ xfs_bmap_update_cancel_item(
 	struct xfs_bmap_intent		*bi;
 
 	bi = container_of(item, struct xfs_bmap_intent, bi_list);
+
+	xfs_bmap_update_put_group(bi);
 	kmem_cache_free(xfs_bmap_intent_cache, bi);
 }
 
@@ -509,10 +534,12 @@ xfs_bui_item_recover(
 	fake.bi_bmap.br_state = (map->me_flags & XFS_BMAP_EXTENT_UNWRITTEN) ?
 			XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
 
+	xfs_bmap_update_get_group(mp, &fake);
 	error = xfs_trans_log_finish_bmap_update(tp, budp, &fake);
 	if (error == -EFSCORRUPTED)
 		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, map,
 				sizeof(*map));
+	xfs_bmap_update_put_group(&fake);
 	if (error)
 		goto err_cancel;
 


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/5] xfs: pass per-ag references to xfs_free_extent
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: make intent items take a perag reference Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/5] xfs: give xfs_bmap_intent its own " Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 5/5] xfs: give xfs_refcount_intent " Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/5] xfs: give xfs_extfree_intent its own perag reference Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 4/5] xfs: give xfs_rmap_intent " Darrick J. Wong
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Pass a reference to the per-AG structure to xfs_free_extent.  Most
callers already have one, so we can eliminate unnecessary lookups.  The
one exception to this is the EFI code, which the next patch will fix.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c             |    6 ++----
 fs/xfs/libxfs/xfs_alloc.c          |   15 +++++----------
 fs/xfs/libxfs/xfs_alloc.h          |    8 +++++---
 fs/xfs/libxfs/xfs_ialloc_btree.c   |    7 +++++--
 fs/xfs/libxfs/xfs_refcount_btree.c |    5 +++--
 fs/xfs/scrub/repair.c              |    3 ++-
 fs/xfs/xfs_extfree_item.c          |    8 ++++++--
 7 files changed, 28 insertions(+), 24 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index bb0c700afe3c..8de4143a5899 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -982,10 +982,8 @@ xfs_ag_extend_space(
 	if (error)
 		return error;
 
-	error = xfs_free_extent(tp, XFS_AGB_TO_FSB(pag->pag_mount, pag->pag_agno,
-					be32_to_cpu(agf->agf_length) - len),
-				len, &XFS_RMAP_OINFO_SKIP_UPDATE,
-				XFS_AG_RESV_NONE);
+	error = xfs_free_extent(tp, pag, be32_to_cpu(agf->agf_length) - len,
+			len, &XFS_RMAP_OINFO_SKIP_UPDATE, XFS_AG_RESV_NONE);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index f8ff81c3de76..79790c9e7de4 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -3381,7 +3381,8 @@ xfs_free_extent_fix_freelist(
 int
 __xfs_free_extent(
 	struct xfs_trans		*tp,
-	xfs_fsblock_t			bno,
+	struct xfs_perag		*pag,
+	xfs_agblock_t			agbno,
 	xfs_extlen_t			len,
 	const struct xfs_owner_info	*oinfo,
 	enum xfs_ag_resv_type		type,
@@ -3389,12 +3390,9 @@ __xfs_free_extent(
 {
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_buf			*agbp;
-	xfs_agnumber_t			agno = XFS_FSB_TO_AGNO(mp, bno);
-	xfs_agblock_t			agbno = XFS_FSB_TO_AGBNO(mp, bno);
 	struct xfs_agf			*agf;
 	int				error;
 	unsigned int			busy_flags = 0;
-	struct xfs_perag		*pag;
 
 	ASSERT(len != 0);
 	ASSERT(type != XFS_AG_RESV_AGFL);
@@ -3403,10 +3401,9 @@ __xfs_free_extent(
 			XFS_ERRTAG_FREE_EXTENT))
 		return -EIO;
 
-	pag = xfs_perag_get(mp, agno);
 	error = xfs_free_extent_fix_freelist(tp, pag, &agbp);
 	if (error)
-		goto err;
+		return error;
 	agf = agbp->b_addr;
 
 	if (XFS_IS_CORRUPT(mp, agbno >= mp->m_sb.sb_agblocks)) {
@@ -3420,20 +3417,18 @@ __xfs_free_extent(
 		goto err_release;
 	}
 
-	error = xfs_free_ag_extent(tp, agbp, agno, agbno, len, oinfo, type);
+	error = xfs_free_ag_extent(tp, agbp, pag->pag_agno, agbno, len, oinfo,
+			type);
 	if (error)
 		goto err_release;
 
 	if (skip_discard)
 		busy_flags |= XFS_EXTENT_BUSY_SKIP_DISCARD;
 	xfs_extent_busy_insert(tp, pag, agbno, len, busy_flags);
-	xfs_perag_put(pag);
 	return 0;
 
 err_release:
 	xfs_trans_brelse(tp, agbp);
-err:
-	xfs_perag_put(pag);
 	return error;
 }
 
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 2c3f762dfb58..5074aed6dfad 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -130,7 +130,8 @@ xfs_alloc_vextent(
 int				/* error */
 __xfs_free_extent(
 	struct xfs_trans	*tp,	/* transaction pointer */
-	xfs_fsblock_t		bno,	/* starting block number of extent */
+	struct xfs_perag	*pag,
+	xfs_agblock_t		agbno,
 	xfs_extlen_t		len,	/* length of extent */
 	const struct xfs_owner_info	*oinfo,	/* extent owner */
 	enum xfs_ag_resv_type	type,	/* block reservation type */
@@ -139,12 +140,13 @@ __xfs_free_extent(
 static inline int
 xfs_free_extent(
 	struct xfs_trans	*tp,
-	xfs_fsblock_t		bno,
+	struct xfs_perag	*pag,
+	xfs_agblock_t		agbno,
 	xfs_extlen_t		len,
 	const struct xfs_owner_info	*oinfo,
 	enum xfs_ag_resv_type	type)
 {
-	return __xfs_free_extent(tp, bno, len, oinfo, type, false);
+	return __xfs_free_extent(tp, pag, agbno, len, oinfo, type, false);
 }
 
 int				/* error */
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 8c83e265770c..2dbe553d87fb 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -156,9 +156,12 @@ __xfs_inobt_free_block(
 	struct xfs_buf		*bp,
 	enum xfs_ag_resv_type	resv)
 {
+	xfs_fsblock_t		fsbno;
+
 	xfs_inobt_mod_blockcount(cur, -1);
-	return xfs_free_extent(cur->bc_tp,
-			XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp)), 1,
+	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp));
+	return xfs_free_extent(cur->bc_tp, cur->bc_ag.pag,
+			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1,
 			&XFS_RMAP_OINFO_INOBT, resv);
 }
 
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index e1f789866683..3d8e62da2ccc 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -112,8 +112,9 @@ xfs_refcountbt_free_block(
 			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
 	be32_add_cpu(&agf->agf_refcount_blocks, -1);
 	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
-	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &XFS_RMAP_OINFO_REFC,
-			XFS_AG_RESV_METADATA);
+	error = xfs_free_extent(cur->bc_tp, cur->bc_ag.pag,
+			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1,
+			&XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 4b92f9253ccd..a0b85bdd4c5a 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -599,7 +599,8 @@ xrep_reap_block(
 	else if (resv == XFS_AG_RESV_AGFL)
 		error = xrep_put_freelist(sc, agbno);
 	else
-		error = xfs_free_extent(sc->tp, fsbno, 1, oinfo, resv);
+		error = xfs_free_extent(sc->tp, sc->sa.pag, agbno, 1, oinfo,
+				resv);
 	if (agf_bp != sc->sa.agf_bp)
 		xfs_trans_brelse(sc->tp, agf_bp);
 	if (error)
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 011b50469301..c1aae07467c9 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -350,6 +350,7 @@ xfs_trans_free_extent(
 	struct xfs_owner_info		oinfo = { };
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_extent		*extp;
+	struct xfs_perag		*pag;
 	uint				next_extent;
 	xfs_agnumber_t			agno = XFS_FSB_TO_AGNO(mp,
 							xefi->xefi_startblock);
@@ -366,9 +367,12 @@ xfs_trans_free_extent(
 	trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno,
 			xefi->xefi_blockcount);
 
-	error = __xfs_free_extent(tp, xefi->xefi_startblock,
-			xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
+	pag = xfs_perag_get(mp, agno);
+	error = __xfs_free_extent(tp, pag, agbno, xefi->xefi_blockcount,
+			&oinfo, XFS_AG_RESV_NONE,
 			xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
+	xfs_perag_put(pag);
+
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
 	 * transaction is aborted, which:


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 3/5] xfs: give xfs_extfree_intent its own perag reference
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: make intent items take a perag reference Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 2/5] xfs: pass per-ag references to xfs_free_extent Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 4/5] xfs: give xfs_rmap_intent " Darrick J. Wong
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Give the xfs_extfree_intent an active reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  Later, shrink will use these
active references to know if an AG is quiesced or not.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c |    7 ++++-
 fs/xfs/libxfs/xfs_alloc.h |    4 +++
 fs/xfs/xfs_extfree_item.c |   58 +++++++++++++++++++++++++++++----------------
 3 files changed, 47 insertions(+), 22 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 79790c9e7de4..199f22ddc379 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2485,6 +2485,7 @@ xfs_defer_agfl_block(
 
 	trace_xfs_agfl_free_defer(mp, agno, 0, agbno, 1);
 
+	xfs_extent_free_get_group(mp, xefi);
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_AGFL_FREE, &xefi->xefi_list);
 }
 
@@ -2501,8 +2502,8 @@ __xfs_free_extent_later(
 	bool				skip_discard)
 {
 	struct xfs_extent_free_item	*xefi;
-#ifdef DEBUG
 	struct xfs_mount		*mp = tp->t_mountp;
+#ifdef DEBUG
 	xfs_agnumber_t			agno;
 	xfs_agblock_t			agbno;
 
@@ -2536,9 +2537,11 @@ __xfs_free_extent_later(
 	} else {
 		xefi->xefi_owner = XFS_RMAP_OWN_NULL;
 	}
-	trace_xfs_bmap_free_defer(tp->t_mountp,
+	trace_xfs_bmap_free_defer(mp,
 			XFS_FSB_TO_AGNO(tp->t_mountp, bno), 0,
 			XFS_FSB_TO_AGBNO(tp->t_mountp, bno), len);
+
+	xfs_extent_free_get_group(mp, xefi);
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_FREE, &xefi->xefi_list);
 }
 
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 5074aed6dfad..f84f3966e849 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -226,9 +226,13 @@ struct xfs_extent_free_item {
 	uint64_t		xefi_owner;
 	xfs_fsblock_t		xefi_startblock;/* starting fs block number */
 	xfs_extlen_t		xefi_blockcount;/* number of blocks in extent */
+	struct xfs_perag	*xefi_pag;
 	unsigned int		xefi_flags;
 };
 
+void xfs_extent_free_get_group(struct xfs_mount *mp,
+		struct xfs_extent_free_item *xefi);
+
 #define XFS_EFI_SKIP_DISCARD	(1U << 0) /* don't issue discard */
 #define XFS_EFI_ATTR_FORK	(1U << 1) /* freeing attr fork block */
 #define XFS_EFI_BMBT_BLOCK	(1U << 2) /* freeing bmap btree block */
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index c1aae07467c9..8db9d9abb54a 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -350,10 +350,7 @@ xfs_trans_free_extent(
 	struct xfs_owner_info		oinfo = { };
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_extent		*extp;
-	struct xfs_perag		*pag;
 	uint				next_extent;
-	xfs_agnumber_t			agno = XFS_FSB_TO_AGNO(mp,
-							xefi->xefi_startblock);
 	xfs_agblock_t			agbno = XFS_FSB_TO_AGBNO(mp,
 							xefi->xefi_startblock);
 	int				error;
@@ -364,14 +361,12 @@ xfs_trans_free_extent(
 	if (xefi->xefi_flags & XFS_EFI_BMBT_BLOCK)
 		oinfo.oi_flags |= XFS_OWNER_INFO_BMBT_BLOCK;
 
-	trace_xfs_bmap_free_deferred(tp->t_mountp, agno, 0, agbno,
-			xefi->xefi_blockcount);
+	trace_xfs_bmap_free_deferred(tp->t_mountp, xefi->xefi_pag->pag_agno, 0,
+			agbno, xefi->xefi_blockcount);
 
-	pag = xfs_perag_get(mp, agno);
-	error = __xfs_free_extent(tp, pag, agbno, xefi->xefi_blockcount,
-			&oinfo, XFS_AG_RESV_NONE,
+	error = __xfs_free_extent(tp, xefi->xefi_pag, agbno,
+			xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
 			xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
-	xfs_perag_put(pag);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
@@ -400,14 +395,13 @@ xfs_extent_free_diff_items(
 	const struct list_head		*a,
 	const struct list_head		*b)
 {
-	struct xfs_mount		*mp = priv;
 	struct xfs_extent_free_item	*ra;
 	struct xfs_extent_free_item	*rb;
 
 	ra = container_of(a, struct xfs_extent_free_item, xefi_list);
 	rb = container_of(b, struct xfs_extent_free_item, xefi_list);
-	return  XFS_FSB_TO_AGNO(mp, ra->xefi_startblock) -
-		XFS_FSB_TO_AGNO(mp, rb->xefi_startblock);
+
+	return ra->xefi_pag->pag_agno - rb->xefi_pag->pag_agno;
 }
 
 /* Log a free extent to the intent item. */
@@ -466,6 +460,26 @@ xfs_extent_free_create_done(
 	return &xfs_trans_get_efd(tp, EFI_ITEM(intent), count)->efd_item;
 }
 
+/* Take an active ref to the AG containing the space we're freeing. */
+void
+xfs_extent_free_get_group(
+	struct xfs_mount		*mp,
+	struct xfs_extent_free_item	*xefi)
+{
+	xfs_agnumber_t			agno;
+
+	agno = XFS_FSB_TO_AGNO(mp, xefi->xefi_startblock);
+	xefi->xefi_pag = xfs_perag_get(mp, agno);
+}
+
+/* Release an active AG ref after some freeing work. */
+static inline void
+xfs_extent_free_put_group(
+	struct xfs_extent_free_item	*xefi)
+{
+	xfs_perag_put(xefi->xefi_pag);
+}
+
 /* Process a free extent. */
 STATIC int
 xfs_extent_free_finish_item(
@@ -480,6 +494,8 @@ xfs_extent_free_finish_item(
 	xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
 
 	error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
+
+	xfs_extent_free_put_group(xefi);
 	kmem_cache_free(xfs_extfree_item_cache, xefi);
 	return error;
 }
@@ -500,6 +516,8 @@ xfs_extent_free_cancel_item(
 	struct xfs_extent_free_item	*xefi;
 
 	xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
+
+	xfs_extent_free_put_group(xefi);
 	kmem_cache_free(xfs_extfree_item_cache, xefi);
 }
 
@@ -530,24 +548,21 @@ xfs_agfl_free_finish_item(
 	struct xfs_extent		*extp;
 	struct xfs_buf			*agbp;
 	int				error;
-	xfs_agnumber_t			agno;
 	xfs_agblock_t			agbno;
 	uint				next_extent;
-	struct xfs_perag		*pag;
 
 	xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
 	ASSERT(xefi->xefi_blockcount == 1);
-	agno = XFS_FSB_TO_AGNO(mp, xefi->xefi_startblock);
 	agbno = XFS_FSB_TO_AGBNO(mp, xefi->xefi_startblock);
 	oinfo.oi_owner = xefi->xefi_owner;
 
-	trace_xfs_agfl_free_deferred(mp, agno, 0, agbno, xefi->xefi_blockcount);
+	trace_xfs_agfl_free_deferred(mp, xefi->xefi_pag->pag_agno, 0, agbno,
+			xefi->xefi_blockcount);
 
-	pag = xfs_perag_get(mp, agno);
-	error = xfs_alloc_read_agf(pag, tp, 0, &agbp);
+	error = xfs_alloc_read_agf(xefi->xefi_pag, tp, 0, &agbp);
 	if (!error)
-		error = xfs_free_agfl_block(tp, agno, agbno, agbp, &oinfo);
-	xfs_perag_put(pag);
+		error = xfs_free_agfl_block(tp, xefi->xefi_pag->pag_agno,
+				agbno, agbp, &oinfo);
 
 	/*
 	 * Mark the transaction dirty, even on error. This ensures the
@@ -566,6 +581,7 @@ xfs_agfl_free_finish_item(
 	extp->ext_len = xefi->xefi_blockcount;
 	efdp->efd_next_extent++;
 
+	xfs_extent_free_put_group(xefi);
 	kmem_cache_free(xfs_extfree_item_cache, xefi);
 	return error;
 }
@@ -636,7 +652,9 @@ xfs_efi_item_recover(
 		fake.xefi_startblock = extp->ext_start;
 		fake.xefi_blockcount = extp->ext_len;
 
+		xfs_extent_free_get_group(mp, &fake);
 		error = xfs_trans_free_extent(tp, efdp, &fake);
+		xfs_extent_free_put_group(&fake);
 		if (error == -EFSCORRUPTED)
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 					extp, sizeof(*extp));


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 4/5] xfs: give xfs_rmap_intent its own perag reference
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: make intent items take a perag reference Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 3/5] xfs: give xfs_extfree_intent its own perag reference Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Give the xfs_rmap_intent an active reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  Later, shrink will use these
active references to know if an AG is quiesced or not.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rmap.c |   29 +++++++++++------------------
 fs/xfs/libxfs/xfs_rmap.h |    4 ++++
 fs/xfs/xfs_rmap_item.c   |   32 +++++++++++++++++++++++++++++---
 3 files changed, 44 insertions(+), 21 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index df720041cd3d..c2624d11f041 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -2394,7 +2394,6 @@ xfs_rmap_finish_one(
 	struct xfs_btree_cur		**pcur)
 {
 	struct xfs_mount		*mp = tp->t_mountp;
-	struct xfs_perag		*pag;
 	struct xfs_btree_cur		*rcur;
 	struct xfs_buf			*agbp = NULL;
 	int				error = 0;
@@ -2402,26 +2401,22 @@ xfs_rmap_finish_one(
 	xfs_agblock_t			bno;
 	bool				unwritten;
 
-	pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock));
 	bno = XFS_FSB_TO_AGBNO(mp, ri->ri_bmap.br_startblock);
 
-	trace_xfs_rmap_deferred(mp, pag->pag_agno, ri->ri_type, bno,
+	trace_xfs_rmap_deferred(mp, ri->ri_pag->pag_agno, ri->ri_type, bno,
 			ri->ri_owner, ri->ri_whichfork,
 			ri->ri_bmap.br_startoff, ri->ri_bmap.br_blockcount,
 			ri->ri_bmap.br_state);
 
-	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_RMAP_FINISH_ONE)) {
-		error = -EIO;
-		goto out_drop;
-	}
-
+	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_RMAP_FINISH_ONE))
+		return -EIO;
 
 	/*
 	 * If we haven't gotten a cursor or the cursor AG doesn't match
 	 * the startblock, get one now.
 	 */
 	rcur = *pcur;
-	if (rcur != NULL && rcur->bc_ag.pag != pag) {
+	if (rcur != NULL && rcur->bc_ag.pag != ri->ri_pag) {
 		xfs_rmap_finish_one_cleanup(tp, rcur, 0);
 		rcur = NULL;
 		*pcur = NULL;
@@ -2432,15 +2427,13 @@ xfs_rmap_finish_one(
 		 * rmapbt, because a shape change could cause us to
 		 * allocate blocks.
 		 */
-		error = xfs_free_extent_fix_freelist(tp, pag, &agbp);
+		error = xfs_free_extent_fix_freelist(tp, ri->ri_pag, &agbp);
 		if (error)
-			goto out_drop;
-		if (XFS_IS_CORRUPT(tp->t_mountp, !agbp)) {
-			error = -EFSCORRUPTED;
-			goto out_drop;
-		}
+			return error;
+		if (XFS_IS_CORRUPT(tp->t_mountp, !agbp))
+			return -EFSCORRUPTED;
 
-		rcur = xfs_rmapbt_init_cursor(mp, tp, agbp, pag);
+		rcur = xfs_rmapbt_init_cursor(mp, tp, agbp, ri->ri_pag);
 	}
 	*pcur = rcur;
 
@@ -2480,8 +2473,7 @@ xfs_rmap_finish_one(
 		ASSERT(0);
 		error = -EFSCORRUPTED;
 	}
-out_drop:
-	xfs_perag_put(pag);
+
 	return error;
 }
 
@@ -2526,6 +2518,7 @@ __xfs_rmap_add(
 	ri->ri_whichfork = whichfork;
 	ri->ri_bmap = *bmap;
 
+	xfs_rmap_update_get_group(tp->t_mountp, ri);
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_RMAP, &ri->ri_list);
 }
 
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index 2dac88cea28d..1472ae570a8a 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -162,8 +162,12 @@ struct xfs_rmap_intent {
 	int					ri_whichfork;
 	uint64_t				ri_owner;
 	struct xfs_bmbt_irec			ri_bmap;
+	struct xfs_perag			*ri_pag;
 };
 
+void xfs_rmap_update_get_group(struct xfs_mount *mp,
+		struct xfs_rmap_intent *ri);
+
 /* functions for updating the rmapbt based on bmbt map/unmap operations */
 void xfs_rmap_map_extent(struct xfs_trans *tp, struct xfs_inode *ip,
 		int whichfork, struct xfs_bmbt_irec *imap);
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index a1619d67015f..10b971d24b5f 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -20,6 +20,7 @@
 #include "xfs_error.h"
 #include "xfs_log_priv.h"
 #include "xfs_log_recover.h"
+#include "xfs_ag.h"
 
 struct kmem_cache	*xfs_rui_cache;
 struct kmem_cache	*xfs_rud_cache;
@@ -320,14 +321,13 @@ xfs_rmap_update_diff_items(
 	const struct list_head		*a,
 	const struct list_head		*b)
 {
-	struct xfs_mount		*mp = priv;
 	struct xfs_rmap_intent		*ra;
 	struct xfs_rmap_intent		*rb;
 
 	ra = container_of(a, struct xfs_rmap_intent, ri_list);
 	rb = container_of(b, struct xfs_rmap_intent, ri_list);
-	return  XFS_FSB_TO_AGNO(mp, ra->ri_bmap.br_startblock) -
-		XFS_FSB_TO_AGNO(mp, rb->ri_bmap.br_startblock);
+
+	return ra->ri_pag->pag_agno - rb->ri_pag->pag_agno;
 }
 
 /* Log rmap updates in the intent item. */
@@ -390,6 +390,26 @@ xfs_rmap_update_create_done(
 	return &xfs_trans_get_rud(tp, RUI_ITEM(intent))->rud_item;
 }
 
+/* Take an active ref to the AG containing the space we're rmapping. */
+void
+xfs_rmap_update_get_group(
+	struct xfs_mount	*mp,
+	struct xfs_rmap_intent	*ri)
+{
+	xfs_agnumber_t		agno;
+
+	agno = XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock);
+	ri->ri_pag = xfs_perag_get(mp, agno);
+}
+
+/* Release an active AG ref after finishing rmapping work. */
+static inline void
+xfs_rmap_update_put_group(
+	struct xfs_rmap_intent	*ri)
+{
+	xfs_perag_put(ri->ri_pag);
+}
+
 /* Process a deferred rmap update. */
 STATIC int
 xfs_rmap_update_finish_item(
@@ -405,6 +425,8 @@ xfs_rmap_update_finish_item(
 
 	error = xfs_trans_log_finish_rmap_update(tp, RUD_ITEM(done), ri,
 			state);
+
+	xfs_rmap_update_put_group(ri);
 	kmem_cache_free(xfs_rmap_intent_cache, ri);
 	return error;
 }
@@ -425,6 +447,8 @@ xfs_rmap_update_cancel_item(
 	struct xfs_rmap_intent		*ri;
 
 	ri = container_of(item, struct xfs_rmap_intent, ri_list);
+
+	xfs_rmap_update_put_group(ri);
 	kmem_cache_free(xfs_rmap_intent_cache, ri);
 }
 
@@ -559,11 +583,13 @@ xfs_rui_item_recover(
 		fake.ri_bmap.br_state = (map->me_flags & XFS_RMAP_EXTENT_UNWRITTEN) ?
 				XFS_EXT_UNWRITTEN : XFS_EXT_NORM;
 
+		xfs_rmap_update_get_group(mp, &fake);
 		error = xfs_trans_log_finish_rmap_update(tp, rudp, &fake,
 				&rcur);
 		if (error == -EFSCORRUPTED)
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 					map, sizeof(*map));
+		xfs_rmap_update_put_group(&fake);
 		if (error)
 			goto abort_error;
 


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 5/5] xfs: give xfs_refcount_intent its own perag reference
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: make intent items take a perag reference Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/5] xfs: give xfs_bmap_intent its own " Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/5] xfs: pass per-ag references to xfs_free_extent Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Give the xfs_refcount_intent an active reference to the perag structure
data.  This reference will be used to enable scrub intent draining
functionality in subsequent patches.  Later, shrink will use these
active references to know if an AG is quiesced or not.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_refcount.c |   33 ++++++++++++++-------------------
 fs/xfs/libxfs/xfs_refcount.h |    4 ++++
 fs/xfs/xfs_refcount_item.c   |   36 ++++++++++++++++++++++++++++++++----
 3 files changed, 50 insertions(+), 23 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index bcf46aa0d08b..6dc968618e66 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1332,26 +1332,22 @@ xfs_refcount_finish_one(
 	xfs_agblock_t			bno;
 	unsigned long			nr_ops = 0;
 	int				shape_changes = 0;
-	struct xfs_perag		*pag;
 
-	pag = xfs_perag_get(mp, XFS_FSB_TO_AGNO(mp, ri->ri_startblock));
 	bno = XFS_FSB_TO_AGBNO(mp, ri->ri_startblock);
 
 	trace_xfs_refcount_deferred(mp, XFS_FSB_TO_AGNO(mp, ri->ri_startblock),
 			ri->ri_type, XFS_FSB_TO_AGBNO(mp, ri->ri_startblock),
 			ri->ri_blockcount);
 
-	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_REFCOUNT_FINISH_ONE)) {
-		error = -EIO;
-		goto out_drop;
-	}
+	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_REFCOUNT_FINISH_ONE))
+		return -EIO;
 
 	/*
 	 * If we haven't gotten a cursor or the cursor AG doesn't match
 	 * the startblock, get one now.
 	 */
 	rcur = *pcur;
-	if (rcur != NULL && rcur->bc_ag.pag != pag) {
+	if (rcur != NULL && rcur->bc_ag.pag != ri->ri_pag) {
 		nr_ops = rcur->bc_ag.refc.nr_ops;
 		shape_changes = rcur->bc_ag.refc.shape_changes;
 		xfs_refcount_finish_one_cleanup(tp, rcur, 0);
@@ -1359,12 +1355,12 @@ xfs_refcount_finish_one(
 		*pcur = NULL;
 	}
 	if (rcur == NULL) {
-		error = xfs_alloc_read_agf(pag, tp, XFS_ALLOC_FLAG_FREEING,
-				&agbp);
+		error = xfs_alloc_read_agf(ri->ri_pag, tp,
+				XFS_ALLOC_FLAG_FREEING, &agbp);
 		if (error)
-			goto out_drop;
+			return error;
 
-		rcur = xfs_refcountbt_init_cursor(mp, tp, agbp, pag);
+		rcur = xfs_refcountbt_init_cursor(mp, tp, agbp, ri->ri_pag);
 		rcur->bc_ag.refc.nr_ops = nr_ops;
 		rcur->bc_ag.refc.shape_changes = shape_changes;
 	}
@@ -1375,7 +1371,7 @@ xfs_refcount_finish_one(
 		error = xfs_refcount_adjust(rcur, &bno, &ri->ri_blockcount,
 				XFS_REFCOUNT_ADJUST_INCREASE);
 		if (error)
-			goto out_drop;
+			return error;
 		if (ri->ri_blockcount > 0)
 			error = xfs_refcount_continue_op(rcur, ri, bno);
 		break;
@@ -1383,31 +1379,29 @@ xfs_refcount_finish_one(
 		error = xfs_refcount_adjust(rcur, &bno, &ri->ri_blockcount,
 				XFS_REFCOUNT_ADJUST_DECREASE);
 		if (error)
-			goto out_drop;
+			return error;
 		if (ri->ri_blockcount > 0)
 			error = xfs_refcount_continue_op(rcur, ri, bno);
 		break;
 	case XFS_REFCOUNT_ALLOC_COW:
 		error = __xfs_refcount_cow_alloc(rcur, bno, ri->ri_blockcount);
 		if (error)
-			goto out_drop;
+			return error;
 		ri->ri_blockcount = 0;
 		break;
 	case XFS_REFCOUNT_FREE_COW:
 		error = __xfs_refcount_cow_free(rcur, bno, ri->ri_blockcount);
 		if (error)
-			goto out_drop;
+			return error;
 		ri->ri_blockcount = 0;
 		break;
 	default:
 		ASSERT(0);
-		error = -EFSCORRUPTED;
+		return -EFSCORRUPTED;
 	}
 	if (!error && ri->ri_blockcount > 0)
-		trace_xfs_refcount_finish_one_leftover(mp, pag->pag_agno,
+		trace_xfs_refcount_finish_one_leftover(mp, ri->ri_pag->pag_agno,
 				ri->ri_type, bno, ri->ri_blockcount);
-out_drop:
-	xfs_perag_put(pag);
 	return error;
 }
 
@@ -1435,6 +1429,7 @@ __xfs_refcount_add(
 	ri->ri_startblock = startblock;
 	ri->ri_blockcount = blockcount;
 
+	xfs_refcount_update_get_group(tp->t_mountp, ri);
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_REFCOUNT, &ri->ri_list);
 }
 
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index c633477ce3ce..c89f0fcd1ee3 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -50,6 +50,7 @@ enum xfs_refcount_intent_type {
 
 struct xfs_refcount_intent {
 	struct list_head			ri_list;
+	struct xfs_perag			*ri_pag;
 	enum xfs_refcount_intent_type		ri_type;
 	xfs_extlen_t				ri_blockcount;
 	xfs_fsblock_t				ri_startblock;
@@ -67,6 +68,9 @@ xfs_refcount_check_domain(
 	return true;
 }
 
+void xfs_refcount_update_get_group(struct xfs_mount *mp,
+		struct xfs_refcount_intent *ri);
+
 void xfs_refcount_increase_extent(struct xfs_trans *tp,
 		struct xfs_bmbt_irec *irec);
 void xfs_refcount_decrease_extent(struct xfs_trans *tp,
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 48d771a76add..4c4706a15056 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -20,6 +20,7 @@
 #include "xfs_error.h"
 #include "xfs_log_priv.h"
 #include "xfs_log_recover.h"
+#include "xfs_ag.h"
 
 struct kmem_cache	*xfs_cui_cache;
 struct kmem_cache	*xfs_cud_cache;
@@ -279,14 +280,13 @@ xfs_refcount_update_diff_items(
 	const struct list_head		*a,
 	const struct list_head		*b)
 {
-	struct xfs_mount		*mp = priv;
 	struct xfs_refcount_intent	*ra;
 	struct xfs_refcount_intent	*rb;
 
 	ra = container_of(a, struct xfs_refcount_intent, ri_list);
 	rb = container_of(b, struct xfs_refcount_intent, ri_list);
-	return  XFS_FSB_TO_AGNO(mp, ra->ri_startblock) -
-		XFS_FSB_TO_AGNO(mp, rb->ri_startblock);
+
+	return ra->ri_pag->pag_agno - rb->ri_pag->pag_agno;
 }
 
 /* Set the phys extent flags for this reverse mapping. */
@@ -365,6 +365,26 @@ xfs_refcount_update_create_done(
 	return &xfs_trans_get_cud(tp, CUI_ITEM(intent))->cud_item;
 }
 
+/* Take an active ref to the AG containing the space we're refcounting. */
+void
+xfs_refcount_update_get_group(
+	struct xfs_mount		*mp,
+	struct xfs_refcount_intent	*ri)
+{
+	xfs_agnumber_t			agno;
+
+	agno = XFS_FSB_TO_AGNO(mp, ri->ri_startblock);
+	ri->ri_pag = xfs_perag_get(mp, agno);
+}
+
+/* Release an active AG ref after finishing refcounting work. */
+static inline void
+xfs_refcount_update_put_group(
+	struct xfs_refcount_intent	*ri)
+{
+	xfs_perag_put(ri->ri_pag);
+}
+
 /* Process a deferred refcount update. */
 STATIC int
 xfs_refcount_update_finish_item(
@@ -386,6 +406,8 @@ xfs_refcount_update_finish_item(
 		       ri->ri_type == XFS_REFCOUNT_DECREASE);
 		return -EAGAIN;
 	}
+
+	xfs_refcount_update_put_group(ri);
 	kmem_cache_free(xfs_refcount_intent_cache, ri);
 	return error;
 }
@@ -406,6 +428,8 @@ xfs_refcount_update_cancel_item(
 	struct xfs_refcount_intent	*ri;
 
 	ri = container_of(item, struct xfs_refcount_intent, ri_list);
+
+	xfs_refcount_update_put_group(ri);
 	kmem_cache_free(xfs_refcount_intent_cache, ri);
 }
 
@@ -520,9 +544,13 @@ xfs_cui_item_recover(
 
 		fake.ri_startblock = pmap->pe_startblock;
 		fake.ri_blockcount = pmap->pe_len;
-		if (!requeue_only)
+
+		if (!requeue_only) {
+			xfs_refcount_update_get_group(mp, &fake);
 			error = xfs_trans_log_finish_refcount_update(tp, cudp,
 					&fake, &rcur);
+			xfs_refcount_update_put_group(&fake);
+		}
 		if (error == -EFSCORRUPTED)
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 					&cuip->cui_format,


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/1] xfs: pass perag references around when possible
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (2 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: make intent items take a perag reference Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/1] xfs: create a function to duplicate an active perag reference Darrick J. Wong
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: drain deferred work items when scrubbing Darrick J. Wong
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

Avoid the cost of perag radix tree lookups by passing around active perag
references when possible.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=pass-perag-refs

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pass-perag-refs
---
 fs/xfs/libxfs/xfs_ag.c             |   15 +++++++++++++++
 fs/xfs/libxfs/xfs_ag.h             |    1 +
 fs/xfs/libxfs/xfs_alloc_btree.c    |    4 +---
 fs/xfs/libxfs/xfs_ialloc_btree.c   |    4 +---
 fs/xfs/libxfs/xfs_refcount_btree.c |    5 +----
 fs/xfs/libxfs/xfs_rmap_btree.c     |    5 +----
 fs/xfs/xfs_iunlink_item.c          |    4 +---
 fs/xfs/xfs_iwalk.c                 |    3 +--
 fs/xfs/xfs_trace.h                 |    1 +
 9 files changed, 23 insertions(+), 19 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/1] xfs: create a function to duplicate an active perag reference
  2022-12-30 22:11 ` [PATCHSET v24.0 0/1] xfs: pass perag references around when possible Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

There a few object constructor functions throughout XFS where a caller
provides an active perag reference and the constructor wants to give the
new object its own active reference.  Replace the open-coded logic with
a common function to do this instead of open-coding atomic_inc logic.

This new function adds a few safeguards -- it checks that there's at
least one active reference to the perag structure passed in, and it
records the refcount bump in the ftrace information.  This makes it much
easier to debug refcounting problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c             |   15 +++++++++++++++
 fs/xfs/libxfs/xfs_ag.h             |    1 +
 fs/xfs/libxfs/xfs_alloc_btree.c    |    4 +---
 fs/xfs/libxfs/xfs_ialloc_btree.c   |    4 +---
 fs/xfs/libxfs/xfs_refcount_btree.c |    5 +----
 fs/xfs/libxfs/xfs_rmap_btree.c     |    5 +----
 fs/xfs/xfs_iunlink_item.c          |    4 +---
 fs/xfs/xfs_iwalk.c                 |    3 +--
 fs/xfs/xfs_trace.h                 |    1 +
 9 files changed, 23 insertions(+), 19 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 8de4143a5899..fed965831f2d 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -57,6 +57,21 @@ xfs_perag_get(
 	return pag;
 }
 
+/* Get our own reference to a perag, given an existing active reference. */
+struct xfs_perag *
+xfs_perag_bump(
+	struct xfs_perag	*pag)
+{
+	if (!atomic_inc_not_zero(&pag->pag_ref)) {
+		ASSERT(0);
+		return NULL;
+	}
+
+	trace_xfs_perag_bump(pag->pag_mount, pag->pag_agno,
+			atomic_read(&pag->pag_ref), _RET_IP_);
+	return pag;
+}
+
 /*
  * search from @first to find the next perag with the given tag set.
  */
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index 191b22b9a35b..d61b07e60802 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -112,6 +112,7 @@ int xfs_initialize_perag_data(struct xfs_mount *mp, xfs_agnumber_t agno);
 void xfs_free_perag(struct xfs_mount *mp);
 
 struct xfs_perag *xfs_perag_get(struct xfs_mount *mp, xfs_agnumber_t agno);
+struct xfs_perag *xfs_perag_bump(struct xfs_perag *pag);
 struct xfs_perag *xfs_perag_get_tag(struct xfs_mount *mp, xfs_agnumber_t agno,
 		unsigned int tag);
 void xfs_perag_put(struct xfs_perag *pag);
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 549a3cba0234..0e78e00e02f9 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -492,9 +492,7 @@ xfs_allocbt_init_common(
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_abtb_2);
 	}
 
-	/* take a reference for the cursor */
-	atomic_inc(&pag->pag_ref);
-	cur->bc_ag.pag = pag;
+	cur->bc_ag.pag = xfs_perag_bump(pag);
 
 	if (xfs_has_crc(mp))
 		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 2dbe553d87fb..fb10760fd686 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -450,9 +450,7 @@ xfs_inobt_init_common(
 	if (xfs_has_crc(mp))
 		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
 
-	/* take a reference for the cursor */
-	atomic_inc(&pag->pag_ref);
-	cur->bc_ag.pag = pag;
+	cur->bc_ag.pag = xfs_perag_bump(pag);
 	return cur;
 }
 
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 3d8e62da2ccc..f5bdac3cf19f 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -340,10 +340,7 @@ xfs_refcountbt_init_common(
 
 	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
 
-	/* take a reference for the cursor */
-	atomic_inc(&pag->pag_ref);
-	cur->bc_ag.pag = pag;
-
+	cur->bc_ag.pag = xfs_perag_bump(pag);
 	cur->bc_ag.refc.nr_ops = 0;
 	cur->bc_ag.refc.shape_changes = 0;
 	cur->bc_ops = &xfs_refcountbt_ops;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 7f83f62e51e0..12c26c42c162 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -460,10 +460,7 @@ xfs_rmapbt_init_common(
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
 	cur->bc_ops = &xfs_rmapbt_ops;
 
-	/* take a reference for the cursor */
-	atomic_inc(&pag->pag_ref);
-	cur->bc_ag.pag = pag;
-
+	cur->bc_ag.pag = xfs_perag_bump(pag);
 	return cur;
 }
 
diff --git a/fs/xfs/xfs_iunlink_item.c b/fs/xfs/xfs_iunlink_item.c
index 43005ce8bd48..5024a59f0c75 100644
--- a/fs/xfs/xfs_iunlink_item.c
+++ b/fs/xfs/xfs_iunlink_item.c
@@ -168,9 +168,7 @@ xfs_iunlink_log_inode(
 	iup->ip = ip;
 	iup->next_agino = next_agino;
 	iup->old_agino = ip->i_next_unlinked;
-
-	atomic_inc(&pag->pag_ref);
-	iup->pag = pag;
+	iup->pag = xfs_perag_bump(pag);
 
 	xfs_trans_add_item(tp, &iup->item);
 	tp->t_flags |= XFS_TRANS_DIRTY;
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 7558486f4937..594ccadb729f 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -670,8 +670,7 @@ xfs_iwalk_threaded(
 		 * perag is being handed off to async work, so take another
 		 * reference for the async work to release.
 		 */
-		atomic_inc(&pag->pag_ref);
-		iwag->pag = pag;
+		iwag->pag = xfs_perag_bump(pag);
 		iwag->iwalk_fn = iwalk_fn;
 		iwag->data = data;
 		iwag->startino = startino;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 6b0e9ae7c513..0448b992a561 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -187,6 +187,7 @@ DEFINE_EVENT(xfs_perag_class, name,	\
 		 unsigned long caller_ip),					\
 	TP_ARGS(mp, agno, refcount, caller_ip))
 DEFINE_PERAG_REF_EVENT(xfs_perag_get);
+DEFINE_PERAG_REF_EVENT(xfs_perag_bump);
 DEFINE_PERAG_REF_EVENT(xfs_perag_get_tag);
 DEFINE_PERAG_REF_EVENT(xfs_perag_put);
 DEFINE_PERAG_REF_EVENT(xfs_perag_set_inode_tag);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/5] xfs: drain deferred work items when scrubbing
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (3 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/1] xfs: pass perag references around when possible Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/5] xfs: clean up scrub context if scrub setup returns -EDEADLOCK Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/8] xfs: standardize btree record checking code Darrick J. Wong
                   ` (17 subsequent siblings)
  22 siblings, 5 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

The design doc for XFS online fsck contains a long discussion of the
eventual consistency models in use for XFS metadata.  In that chapter,
we note that it is possible for scrub to collide with a chain of
deferred space metadata updates, and proposes a lightweight solution:
The use of a pending-intents counter so that scrub can wait for the
system to drain all chains.

This patchset implements that scrub drain.  The first patch implements
the basic mechanism, and the subsequent patches reduce the runtime
overhead by converting the implementation to use sloppy counters and
introducing jump labels to avoid walking into scrub hooks when it isn't
running.  This last paradigm repeats elsewhere in this megaseries.

v23.1: make intent items take an active ref to the perag structure and
       document why we bump and drop the intent counts when we do

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-drain-intents
---
 fs/xfs/Kconfig             |    5 ++
 fs/xfs/Makefile            |    2 +
 fs/xfs/libxfs/xfs_ag.c     |    4 +
 fs/xfs/libxfs/xfs_ag.h     |    8 +++
 fs/xfs/libxfs/xfs_defer.c  |    6 +-
 fs/xfs/scrub/agheader.c    |    9 +++
 fs/xfs/scrub/alloc.c       |    3 +
 fs/xfs/scrub/bmap.c        |    3 +
 fs/xfs/scrub/btree.c       |    1 
 fs/xfs/scrub/common.c      |  129 ++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/scrub/common.h      |   15 +++++
 fs/xfs/scrub/dabtree.c     |    1 
 fs/xfs/scrub/fscounters.c  |    7 ++
 fs/xfs/scrub/health.c      |    2 +
 fs/xfs/scrub/ialloc.c      |    2 +
 fs/xfs/scrub/inode.c       |    3 +
 fs/xfs/scrub/quota.c       |    3 +
 fs/xfs/scrub/refcount.c    |    9 +++
 fs/xfs/scrub/repair.c      |    3 +
 fs/xfs/scrub/rmap.c        |    3 +
 fs/xfs/scrub/scrub.c       |   63 ++++++++++++++++-----
 fs/xfs/scrub/scrub.h       |    6 ++
 fs/xfs/scrub/trace.h       |   69 ++++++++++++++++++++++++
 fs/xfs/xfs_bmap_item.c     |   10 +++
 fs/xfs/xfs_drain.c         |  121 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_drain.h         |   80 +++++++++++++++++++++++++++
 fs/xfs/xfs_extfree_item.c  |    2 +
 fs/xfs/xfs_linux.h         |    1 
 fs/xfs/xfs_refcount_item.c |    2 +
 fs/xfs/xfs_rmap_item.c     |    2 +
 fs/xfs/xfs_trace.h         |   71 ++++++++++++++++++++++++
 31 files changed, 614 insertions(+), 31 deletions(-)
 create mode 100644 fs/xfs/xfs_drain.c
 create mode 100644 fs/xfs/xfs_drain.h


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/5] xfs: add a tracepoint to report incorrect extent refcounts
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: drain deferred work items when scrubbing Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/5] xfs: clean up scrub context if scrub setup returns -EDEADLOCK Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/5] xfs: allow queued AG intents to drain before scrubbing Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 4/5] xfs: minimize overhead of drain wakeups by using jump labels Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 5/5] xfs: scrub should use ECHRNG to signal that the drain is needed Darrick J. Wong
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a new tracepoint so that I can see exactly what and where we failed
the refcount check.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/refcount.c |    5 ++++-
 fs/xfs/scrub/trace.h    |   35 +++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index d9c1b3cea4a5..ffa6eda8b7d4 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -13,6 +13,7 @@
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/btree.h"
+#include "scrub/trace.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_ag.h"
@@ -300,8 +301,10 @@ xchk_refcountbt_xref_rmap(
 		goto out_free;
 
 	xchk_refcountbt_process_rmap_fragments(&refchk);
-	if (irec->rc_refcount != refchk.seen)
+	if (irec->rc_refcount != refchk.seen) {
+		trace_xchk_refcount_incorrect(sc->sa.pag, irec, refchk.seen);
 		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
+	}
 
 out_free:
 	list_for_each_entry_safe(frag, n, &refchk.fragments, list) {
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 93ece6df02e3..403c0e62257e 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -30,6 +30,9 @@ TRACE_DEFINE_ENUM(XFS_BTNUM_FINOi);
 TRACE_DEFINE_ENUM(XFS_BTNUM_RMAPi);
 TRACE_DEFINE_ENUM(XFS_BTNUM_REFCi);
 
+TRACE_DEFINE_ENUM(XFS_REFC_DOMAIN_SHARED);
+TRACE_DEFINE_ENUM(XFS_REFC_DOMAIN_COW);
+
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_PROBE);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_SB);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_AGF);
@@ -657,6 +660,38 @@ TRACE_EVENT(xchk_fscounters_within_range,
 		  __entry->old_value)
 )
 
+TRACE_EVENT(xchk_refcount_incorrect,
+	TP_PROTO(struct xfs_perag *pag, const struct xfs_refcount_irec *irec,
+		 xfs_nlink_t seen),
+	TP_ARGS(pag, irec, seen),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(enum xfs_refc_domain, domain)
+		__field(xfs_agblock_t, startblock)
+		__field(xfs_extlen_t, blockcount)
+		__field(xfs_nlink_t, refcount)
+		__field(xfs_nlink_t, seen)
+	),
+	TP_fast_assign(
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->domain = irec->rc_domain;
+		__entry->startblock = irec->rc_startblock;
+		__entry->blockcount = irec->rc_blockcount;
+		__entry->refcount = irec->rc_refcount;
+		__entry->seen = seen;
+	),
+	TP_printk("dev %d:%d agno 0x%x dom %s agbno 0x%x fsbcount 0x%x refcount %u seen %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __print_symbolic(__entry->domain, XFS_REFC_DOMAIN_STRINGS),
+		  __entry->startblock,
+		  __entry->blockcount,
+		  __entry->refcount,
+		  __entry->seen)
+)
+
 /* repair tracepoints */
 #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)
 


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/5] xfs: allow queued AG intents to drain before scrubbing
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: drain deferred work items when scrubbing Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/5] xfs: clean up scrub context if scrub setup returns -EDEADLOCK Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/5] xfs: add a tracepoint to report incorrect extent refcounts Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When a writer thread executes a chain of log intent items, the AG header
buffer locks will cycle during a transaction roll to get from one intent
item to the next in a chain.  Although scrub takes all AG header buffer
locks, this isn't sufficient to guard against scrub checking an AG while
that writer thread is in the middle of finishing a chain because there's
no higher level locking primitive guarding allocation groups.

When there's a collision, cross-referencing between data structures
(e.g. rmapbt and refcountbt) yields false corruption events; if repair
is running, this results in incorrect repairs, which is catastrophic.

Fix this by adding to the perag structure the count of active intents
and make scrub wait until it has both AG header buffer locks and the
intent counter reaches zero.

One quirk of the drain code is that deferred bmap updates also bump and
drop the intent counter.  A fundamental decision made during the design
phase of the reverse mapping feature is that updates to the rmapbt
records are always made by the same code that updates the primary
metadata.  In other words, callers of bmapi functions expect that the
bmapi functions will queue deferred rmap updates.

Some parts of the reflink code queue deferred refcount (CUI) and bmap
(BUI) updates in the same head transaction, but the deferred work
manager completely finishes the CUI before the BUI work is started.  As
a result, the CUI drops the intent count long before the deferred rmap
(RUI) update even has a chance to bump the intent count.  The only way
to keep the intent count elevated between the CUI and RUI is for the BUI
to bump the counter until the RUI has been created.

A second quirk of the intent drain code is that deferred work items must
increment the intent counter as soon as the work item is added to the
transaction.  When a BUI completes and queues an RUI, the RUI must
increment the counter before the BUI decrements it.  The only way to
accomplish this is to require that the counter be bumped as soon as the
deferred work item is created in memory.

In the next patches we'll improve on this facility, but this patch
provides the basic functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Kconfig             |    4 ++
 fs/xfs/Makefile            |    2 +
 fs/xfs/libxfs/xfs_ag.c     |    4 ++
 fs/xfs/libxfs/xfs_ag.h     |    8 +++
 fs/xfs/libxfs/xfs_defer.c  |    6 ++-
 fs/xfs/scrub/common.c      |  103 +++++++++++++++++++++++++++++++++++++++-----
 fs/xfs/scrub/health.c      |    2 +
 fs/xfs/scrub/refcount.c    |    2 +
 fs/xfs/xfs_bmap_item.c     |   10 ++++
 fs/xfs/xfs_drain.c         |   96 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_drain.h         |   77 +++++++++++++++++++++++++++++++++
 fs/xfs/xfs_extfree_item.c  |    2 +
 fs/xfs/xfs_linux.h         |    1 
 fs/xfs/xfs_refcount_item.c |    2 +
 fs/xfs/xfs_rmap_item.c     |    2 +
 fs/xfs/xfs_trace.h         |   71 ++++++++++++++++++++++++++++++
 16 files changed, 379 insertions(+), 13 deletions(-)
 create mode 100644 fs/xfs/xfs_drain.c
 create mode 100644 fs/xfs/xfs_drain.h


diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 9fac5ea8d0e4..ab24e683b440 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -93,10 +93,14 @@ config XFS_RT
 
 	  If unsure, say N.
 
+config XFS_DRAIN_INTENTS
+	bool
+
 config XFS_ONLINE_SCRUB
 	bool "XFS online metadata check support"
 	default n
 	depends on XFS_FS
+	select XFS_DRAIN_INTENTS
 	help
 	  If you say Y here you will be able to check metadata on a
 	  mounted XFS filesystem.  This feature is intended to reduce
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 03135a1c31b6..ea0725cfb6fb 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -135,6 +135,8 @@ ifeq ($(CONFIG_MEMORY_FAILURE),y)
 xfs-$(CONFIG_FS_DAX)		+= xfs_notify_failure.o
 endif
 
+xfs-$(CONFIG_XFS_DRAIN_INTENTS)	+= xfs_drain.o
+
 # online scrub/repair
 ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
 
diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index fed965831f2d..8b1bb228cba6 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -207,6 +207,7 @@ xfs_free_perag(
 		spin_unlock(&mp->m_perag_lock);
 		ASSERT(pag);
 		XFS_IS_CORRUPT(pag->pag_mount, atomic_read(&pag->pag_ref) != 0);
+		xfs_drain_free(&pag->pag_intents);
 
 		cancel_delayed_work_sync(&pag->pag_blockgc_work);
 		xfs_buf_hash_destroy(pag);
@@ -328,6 +329,7 @@ xfs_initialize_perag(
 		spin_lock_init(&pag->pag_state_lock);
 		INIT_DELAYED_WORK(&pag->pag_blockgc_work, xfs_blockgc_worker);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
+		xfs_drain_init(&pag->pag_intents);
 		init_waitqueue_head(&pag->pagb_wait);
 		pag->pagb_count = 0;
 		pag->pagb_tree = RB_ROOT;
@@ -360,6 +362,7 @@ xfs_initialize_perag(
 	return 0;
 
 out_remove_pag:
+	xfs_drain_free(&pag->pag_intents);
 	radix_tree_delete(&mp->m_perag_tree, index);
 out_free_pag:
 	kmem_free(pag);
@@ -370,6 +373,7 @@ xfs_initialize_perag(
 		if (!pag)
 			break;
 		xfs_buf_hash_destroy(pag);
+		xfs_drain_free(&pag->pag_intents);
 		kmem_free(pag);
 	}
 	return error;
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index d61b07e60802..5b4b8658685f 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -103,6 +103,14 @@ struct xfs_perag {
 	/* background prealloc block trimming */
 	struct delayed_work	pag_blockgc_work;
 
+	/*
+	 * We use xfs_drain to track the number of deferred log intent items
+	 * that have been queued (but not yet processed) so that waiters (e.g.
+	 * scrub) will not lock resources when other threads are in the middle
+	 * of processing a chain of intent items only to find momentary
+	 * inconsistencies.
+	 */
+	struct xfs_drain	pag_intents;
 #endif /* __KERNEL__ */
 };
 
diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 5a321b783398..bcfb6a4203cd 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -397,6 +397,7 @@ xfs_defer_cancel_list(
 		list_for_each_safe(pwi, n, &dfp->dfp_work) {
 			list_del(pwi);
 			dfp->dfp_count--;
+			trace_xfs_defer_cancel_item(mp, dfp, pwi);
 			ops->cancel_item(pwi);
 		}
 		ASSERT(dfp->dfp_count == 0);
@@ -476,6 +477,7 @@ xfs_defer_finish_one(
 	list_for_each_safe(li, n, &dfp->dfp_work) {
 		list_del(li);
 		dfp->dfp_count--;
+		trace_xfs_defer_finish_item(tp->t_mountp, dfp, li);
 		error = ops->finish_item(tp, dfp->dfp_done, li, &state);
 		if (error == -EAGAIN) {
 			int		ret;
@@ -623,7 +625,7 @@ xfs_defer_add(
 	struct list_head		*li)
 {
 	struct xfs_defer_pending	*dfp = NULL;
-	const struct xfs_defer_op_type	*ops;
+	const struct xfs_defer_op_type	*ops = defer_op_types[type];
 
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
 	BUILD_BUG_ON(ARRAY_SIZE(defer_op_types) != XFS_DEFER_OPS_TYPE_MAX);
@@ -636,7 +638,6 @@ xfs_defer_add(
 	if (!list_empty(&tp->t_dfops)) {
 		dfp = list_last_entry(&tp->t_dfops,
 				struct xfs_defer_pending, dfp_list);
-		ops = defer_op_types[dfp->dfp_type];
 		if (dfp->dfp_type != type ||
 		    (ops->max_items && dfp->dfp_count >= ops->max_items))
 			dfp = NULL;
@@ -653,6 +654,7 @@ xfs_defer_add(
 	}
 
 	list_add_tail(li, &dfp->dfp_work);
+	trace_xfs_defer_add_item(tp->t_mountp, dfp, li);
 	dfp->dfp_count++;
 }
 
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 613260b04a3d..453d8c3f2370 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -396,26 +396,19 @@ want_ag_read_header_failure(
 }
 
 /*
- * Grab the perag structure and all the headers for an AG.
+ * Grab the AG header buffers for the attached perag structure.
  *
  * The headers should be released by xchk_ag_free, but as a fail safe we attach
  * all the buffers we grab to the scrub transaction so they'll all be freed
- * when we cancel it.  Returns ENOENT if we can't grab the perag structure.
+ * when we cancel it.
  */
-int
-xchk_ag_read_headers(
+static inline int
+xchk_perag_read_headers(
 	struct xfs_scrub	*sc,
-	xfs_agnumber_t		agno,
 	struct xchk_ag		*sa)
 {
-	struct xfs_mount	*mp = sc->mp;
 	int			error;
 
-	ASSERT(!sa->pag);
-	sa->pag = xfs_perag_get(mp, agno);
-	if (!sa->pag)
-		return -ENOENT;
-
 	error = xfs_ialloc_read_agi(sa->pag, sc->tp, &sa->agi_bp);
 	if (error && want_ag_read_header_failure(sc, XFS_SCRUB_TYPE_AGI))
 		return error;
@@ -427,6 +420,94 @@ xchk_ag_read_headers(
 	return 0;
 }
 
+/*
+ * Grab the AG headers for the attached perag structure and wait for pending
+ * intents to drain.
+ */
+static int
+xchk_perag_lock(
+	struct xfs_scrub	*sc)
+{
+	struct xchk_ag		*sa = &sc->sa;
+	int			error = 0;
+
+	ASSERT(sa->pag != NULL);
+	ASSERT(sa->agi_bp == NULL);
+	ASSERT(sa->agf_bp == NULL);
+
+	do {
+		if (xchk_should_terminate(sc, &error))
+			return error;
+
+		error = xchk_perag_read_headers(sc, sa);
+		if (error)
+			return error;
+
+		/*
+		 * Decide if this AG is quiet enough for all metadata to be
+		 * consistent with each other.  XFS allows the AG header buffer
+		 * locks to cycle across transaction rolls while processing
+		 * chains of deferred ops, which means that there could be
+		 * other threads in the middle of processing a chain of
+		 * deferred ops.  For regular operations we are careful about
+		 * ordering operations to prevent collisions between threads
+		 * (which is why we don't need a per-AG lock), but scrub and
+		 * repair have to serialize against chained operations.
+		 *
+		 * We just locked all the AG headers buffers; now take a look
+		 * to see if there are any intents in progress.  If there are,
+		 * drop the AG headers and wait for the intents to drain.
+		 * Since we hold all the AG header locks for the duration of
+		 * the scrub, this is the only time we have to sample the
+		 * intents counter; any threads increasing it after this point
+		 * can't possibly be in the middle of a chain of AG metadata
+		 * updates.
+		 *
+		 * Obviously, this should be slanted against scrub and in favor
+		 * of runtime threads.
+		 */
+		if (!xfs_perag_intents_busy(sa->pag))
+			return 0;
+
+		if (sa->agf_bp) {
+			xfs_trans_brelse(sc->tp, sa->agf_bp);
+			sa->agf_bp = NULL;
+		}
+
+		if (sa->agi_bp) {
+			xfs_trans_brelse(sc->tp, sa->agi_bp);
+			sa->agi_bp = NULL;
+		}
+
+		error = xfs_perag_drain_intents(sa->pag);
+		if (error == -ERESTARTSYS)
+			error = -EINTR;
+	} while (!error);
+
+	return error;
+}
+
+/*
+ * Grab the per-AG structure, grab all AG header buffers, and wait until there
+ * aren't any pending intents.  Returns -ENOENT if we can't grab the perag
+ * structure.
+ */
+int
+xchk_ag_read_headers(
+	struct xfs_scrub	*sc,
+	xfs_agnumber_t		agno,
+	struct xchk_ag		*sa)
+{
+	struct xfs_mount	*mp = sc->mp;
+
+	ASSERT(!sa->pag);
+	sa->pag = xfs_perag_get(mp, agno);
+	if (!sa->pag)
+		return -ENOENT;
+
+	return xchk_perag_lock(sc);
+}
+
 /* Release all the AG btree cursors. */
 void
 xchk_ag_btcur_free(
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index aa65ec88a0c0..f7c5a109615f 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -7,6 +7,8 @@
 #include "xfs_fs.h"
 #include "xfs_shared.h"
 #include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
 #include "xfs_btree.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index ffa6eda8b7d4..080487f99e5f 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -7,6 +7,8 @@
 #include "xfs_fs.h"
 #include "xfs_shared.h"
 #include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
 #include "xfs_btree.h"
 #include "xfs_rmap.h"
 #include "xfs_refcount.h"
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 32ccd4bb9f46..e13184afebaf 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -374,6 +374,15 @@ xfs_bmap_update_get_group(
 
 	agno = XFS_FSB_TO_AGNO(mp, bi->bi_bmap.br_startblock);
 	bi->bi_pag = xfs_perag_get(mp, agno);
+
+	/*
+	 * Bump the intent count on behalf of the deferred rmap intent item
+	 * that we will queue when we finish this bmap work.  This rmap item
+	 * will bump the intent count before the bmap intent drops the intent
+	 * count, ensuring that the intent count remains nonzero across the
+	 * transaction roll.
+	 */
+	xfs_perag_bump_intents(bi->bi_pag);
 }
 
 /* Release an active AG ref after finishing mapping work. */
@@ -381,6 +390,7 @@ static inline void
 xfs_bmap_update_put_group(
 	struct xfs_bmap_intent	*bi)
 {
+	xfs_perag_drop_intents(bi->bi_pag);
 	xfs_perag_put(bi->bi_pag);
 }
 
diff --git a/fs/xfs/xfs_drain.c b/fs/xfs/xfs_drain.c
new file mode 100644
index 000000000000..e8fced914f88
--- /dev/null
+++ b/fs/xfs/xfs_drain.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_ag.h"
+#include "xfs_trace.h"
+
+void
+xfs_drain_init(
+	struct xfs_drain	*dr)
+{
+	atomic_set(&dr->dr_count, 0);
+	init_waitqueue_head(&dr->dr_waiters);
+}
+
+void
+xfs_drain_free(struct xfs_drain	*dr)
+{
+	ASSERT(atomic_read(&dr->dr_count) == 0);
+}
+
+/* Increase the pending intent count. */
+static inline void xfs_drain_bump(struct xfs_drain *dr)
+{
+	atomic_inc(&dr->dr_count);
+}
+
+/* Decrease the pending intent count, and wake any waiters, if appropriate. */
+static inline void xfs_drain_drop(struct xfs_drain *dr)
+{
+	if (atomic_dec_and_test(&dr->dr_count) &&
+	    wq_has_sleeper(&dr->dr_waiters))
+		wake_up(&dr->dr_waiters);
+}
+
+/* Are there work items pending? */
+static inline bool xfs_drain_busy(struct xfs_drain *dr)
+{
+	return atomic_read(&dr->dr_count) > 0;
+}
+
+/*
+ * Wait for the pending intent count for a drain to hit zero.
+ *
+ * Callers must not hold any locks that would prevent intents from being
+ * finished.
+ */
+static inline int xfs_drain_wait(struct xfs_drain *dr)
+{
+	return wait_event_killable(dr->dr_waiters, !xfs_drain_busy(dr));
+}
+
+/* Add an item to the pending count. */
+void
+xfs_perag_bump_intents(
+	struct xfs_perag	*pag)
+{
+	trace_xfs_perag_bump_intents(pag, __return_address);
+	xfs_drain_bump(&pag->pag_intents);
+}
+
+/* Remove an item from the pending count. */
+void
+xfs_perag_drop_intents(
+	struct xfs_perag	*pag)
+{
+	trace_xfs_perag_drop_intents(pag, __return_address);
+	xfs_drain_drop(&pag->pag_intents);
+}
+
+/*
+ * Wait for the pending intent count for AG metadata to hit zero.
+ * Callers must not hold any AG header buffers.
+ */
+int
+xfs_perag_drain_intents(
+	struct xfs_perag	*pag)
+{
+	trace_xfs_perag_wait_intents(pag, __return_address);
+	return xfs_drain_wait(&pag->pag_intents);
+}
+
+/* Might someone else be processing intents for this AG? */
+bool
+xfs_perag_intents_busy(
+	struct xfs_perag	*pag)
+{
+	return xfs_drain_busy(&pag->pag_intents);
+}
diff --git a/fs/xfs/xfs_drain.h b/fs/xfs/xfs_drain.h
new file mode 100644
index 000000000000..f01a2b5c7337
--- /dev/null
+++ b/fs/xfs/xfs_drain.h
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2022 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef XFS_DRAIN_H_
+#define XFS_DRAIN_H_
+
+struct xfs_perag;
+
+#ifdef CONFIG_XFS_DRAIN_INTENTS
+/*
+ * Passive drain mechanism.  This data structure tracks a count of some items
+ * and contains a waitqueue for callers who would like to wake up when the
+ * count hits zero.
+ */
+struct xfs_drain {
+	/* Number of items pending in some part of the filesystem. */
+	atomic_t		dr_count;
+
+	/* Queue to wait for dri_count to go to zero */
+	struct wait_queue_head	dr_waiters;
+};
+
+void xfs_drain_init(struct xfs_drain *dr);
+void xfs_drain_free(struct xfs_drain *dr);
+
+/*
+ * Deferred Work Intent Drains
+ * ===========================
+ *
+ * When a writer thread executes a chain of log intent items, the AG header
+ * buffer locks will cycle during a transaction roll to get from one intent
+ * item to the next in a chain.  Although scrub takes all AG header buffer
+ * locks, this isn't sufficient to guard against scrub checking an AG while
+ * that writer thread is in the middle of finishing a chain because there's no
+ * higher level locking primitive guarding allocation groups.
+ *
+ * When there's a collision, cross-referencing between data structures (e.g.
+ * rmapbt and refcountbt) yields false corruption events; if repair is running,
+ * this results in incorrect repairs, which is catastrophic.
+ *
+ * The solution is to the perag structure the count of active intents and make
+ * scrub wait until it has both AG header buffer locks and the intent counter
+ * reaches zero.  It is therefore critical that deferred work threads hold the
+ * AGI or AGF buffers when decrementing the intent counter.
+ *
+ * Given a list of deferred work items, the deferred work manager will complete
+ * a work item and all the sub-items that the parent item creates before moving
+ * on to the next work item in the list.  This is also true for all levels of
+ * sub-items.  Writer threads are permitted to queue multiple work items
+ * targetting the same AG, so a deferred work item (such as a BUI) that creates
+ * sub-items (such as RUIs) must bump the intent counter and maintain it until
+ * the sub-items can themselves bump the intent counter.
+ *
+ * Therefore, the intent count tracks entire lifetimes of deferred work items.
+ * All functions that create work items must increment the intent counter as
+ * soon as the item is added to the transaction and cannot drop the counter
+ * until the item is finished or cancelled.
+ */
+void xfs_perag_bump_intents(struct xfs_perag *pag);
+void xfs_perag_drop_intents(struct xfs_perag *pag);
+
+int xfs_perag_drain_intents(struct xfs_perag *pag);
+bool xfs_perag_intents_busy(struct xfs_perag *pag);
+#else
+struct xfs_drain { /* empty */ };
+
+#define xfs_drain_free(dr)		((void)0)
+#define xfs_drain_init(dr)		((void)0)
+
+static inline void xfs_perag_bump_intents(struct xfs_perag *pag) { }
+static inline void xfs_perag_drop_intents(struct xfs_perag *pag) { }
+
+#endif /* CONFIG_XFS_DRAIN_INTENTS */
+
+#endif /* XFS_DRAIN_H_ */
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 8db9d9abb54a..cec637de322e 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -470,6 +470,7 @@ xfs_extent_free_get_group(
 
 	agno = XFS_FSB_TO_AGNO(mp, xefi->xefi_startblock);
 	xefi->xefi_pag = xfs_perag_get(mp, agno);
+	xfs_perag_bump_intents(xefi->xefi_pag);
 }
 
 /* Release an active AG ref after some freeing work. */
@@ -477,6 +478,7 @@ static inline void
 xfs_extent_free_put_group(
 	struct xfs_extent_free_item	*xefi)
 {
+	xfs_perag_drop_intents(xefi->xefi_pag);
 	xfs_perag_put(xefi->xefi_pag);
 }
 
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index f9878021e7d0..51e84f824a7c 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -79,6 +79,7 @@ typedef __u32			xfs_nlink_t;
 #include "xfs_cksum.h"
 #include "xfs_buf.h"
 #include "xfs_message.h"
+#include "xfs_drain.h"
 
 #ifdef __BIG_ENDIAN
 #define XFS_NATIVE_HOST 1
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 4c4706a15056..5c6eecc5318a 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -375,6 +375,7 @@ xfs_refcount_update_get_group(
 
 	agno = XFS_FSB_TO_AGNO(mp, ri->ri_startblock);
 	ri->ri_pag = xfs_perag_get(mp, agno);
+	xfs_perag_bump_intents(ri->ri_pag);
 }
 
 /* Release an active AG ref after finishing refcounting work. */
@@ -382,6 +383,7 @@ static inline void
 xfs_refcount_update_put_group(
 	struct xfs_refcount_intent	*ri)
 {
+	xfs_perag_drop_intents(ri->ri_pag);
 	xfs_perag_put(ri->ri_pag);
 }
 
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 10b971d24b5f..38915e92bf2b 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -400,6 +400,7 @@ xfs_rmap_update_get_group(
 
 	agno = XFS_FSB_TO_AGNO(mp, ri->ri_bmap.br_startblock);
 	ri->ri_pag = xfs_perag_get(mp, agno);
+	xfs_perag_bump_intents(ri->ri_pag);
 }
 
 /* Release an active AG ref after finishing rmapping work. */
@@ -407,6 +408,7 @@ static inline void
 xfs_rmap_update_put_group(
 	struct xfs_rmap_intent	*ri)
 {
+	xfs_perag_drop_intents(ri->ri_pag);
 	xfs_perag_put(ri->ri_pag);
 }
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 0448b992a561..6941deb80244 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2679,6 +2679,44 @@ DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_bmap_free_deferred);
 DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_agfl_free_defer);
 DEFINE_BMAP_FREE_DEFERRED_EVENT(xfs_agfl_free_deferred);
 
+DECLARE_EVENT_CLASS(xfs_defer_pending_item_class,
+	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_pending *dfp,
+		 void *item),
+	TP_ARGS(mp, dfp, item),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, type)
+		__field(void *, intent)
+		__field(void *, item)
+		__field(char, committed)
+		__field(int, nr)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->type = dfp->dfp_type;
+		__entry->intent = dfp->dfp_intent;
+		__entry->item = item;
+		__entry->committed = dfp->dfp_done != NULL;
+		__entry->nr = dfp->dfp_count;
+	),
+	TP_printk("dev %d:%d optype %d intent %p item %p committed %d nr %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->type,
+		  __entry->intent,
+		  __entry->item,
+		  __entry->committed,
+		  __entry->nr)
+)
+#define DEFINE_DEFER_PENDING_ITEM_EVENT(name) \
+DEFINE_EVENT(xfs_defer_pending_item_class, name, \
+	TP_PROTO(struct xfs_mount *mp, struct xfs_defer_pending *dfp, \
+		 void *item), \
+	TP_ARGS(mp, dfp, item))
+
+DEFINE_DEFER_PENDING_ITEM_EVENT(xfs_defer_add_item);
+DEFINE_DEFER_PENDING_ITEM_EVENT(xfs_defer_cancel_item);
+DEFINE_DEFER_PENDING_ITEM_EVENT(xfs_defer_finish_item);
+
 /* rmap tracepoints */
 DECLARE_EVENT_CLASS(xfs_rmap_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
@@ -4318,6 +4356,39 @@ TRACE_EVENT(xfs_force_shutdown,
 		__entry->line_num)
 );
 
+#ifdef CONFIG_XFS_DRAIN_INTENTS
+DECLARE_EVENT_CLASS(xfs_perag_intents_class,
+	TP_PROTO(struct xfs_perag *pag, void *caller_ip),
+	TP_ARGS(pag, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(long, nr_intents)
+		__field(void *, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = pag->pag_mount->m_super->s_dev;
+		__entry->agno = pag->pag_agno;
+		__entry->nr_intents = atomic_read(&pag->pag_intents.dr_count);
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d agno 0x%x intents %ld caller %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->nr_intents,
+		  __entry->caller_ip)
+);
+
+#define DEFINE_PERAG_INTENTS_EVENT(name)	\
+DEFINE_EVENT(xfs_perag_intents_class, name,					\
+	TP_PROTO(struct xfs_perag *pag, void *caller_ip), \
+	TP_ARGS(pag, caller_ip))
+DEFINE_PERAG_INTENTS_EVENT(xfs_perag_bump_intents);
+DEFINE_PERAG_INTENTS_EVENT(xfs_perag_drop_intents);
+DEFINE_PERAG_INTENTS_EVENT(xfs_perag_wait_intents);
+
+#endif /* CONFIG_XFS_DRAIN_INTENTS */
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 3/5] xfs: clean up scrub context if scrub setup returns -EDEADLOCK
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: drain deferred work items when scrubbing Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/5] xfs: allow queued AG intents to drain before scrubbing Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

It has been a longstanding convention that online scrub and repair
functions can return -EDEADLOCK to signal that they weren't able to
obtain some necessary resource.  When this happens, the scrub framework
is supposed to release all resources attached to the scrub context, set
the TRY_HARDER flag in the scrub context flags, and try again.  In this
context, individual scrub functions are supposed to take all the
resources they (incorrectly) speculated were not necessary.

We're about to make it so that the functions that lock and wait for a
filesystem AG can also return EDEADLOCK to signal that we need to try
again with the drain waiters enabled.  Therefore, refactor
xfs_scrub_metadata to support this behavior for ->setup() functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/scrub.c |   28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)


diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 07a7a75f987f..50db13c5f626 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -491,23 +491,16 @@ xfs_scrub_metadata(
 
 	/* Set up for the operation. */
 	error = sc->ops->setup(sc);
+	if (error == -EDEADLOCK && !(sc->flags & XCHK_TRY_HARDER))
+		goto try_harder;
 	if (error)
 		goto out_teardown;
 
 	/* Scrub for errors. */
 	error = sc->ops->scrub(sc);
-	if (!(sc->flags & XCHK_TRY_HARDER) && error == -EDEADLOCK) {
-		/*
-		 * Scrubbers return -EDEADLOCK to mean 'try harder'.
-		 * Tear down everything we hold, then set up again with
-		 * preparation for worst-case scenarios.
-		 */
-		error = xchk_teardown(sc, 0);
-		if (error)
-			goto out_sc;
-		sc->flags |= XCHK_TRY_HARDER;
-		goto retry_op;
-	} else if (error || (sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE))
+	if (error == -EDEADLOCK && !(sc->flags & XCHK_TRY_HARDER))
+		goto try_harder;
+	if (error || (sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE))
 		goto out_teardown;
 
 	xchk_update_health(sc);
@@ -565,4 +558,15 @@ xfs_scrub_metadata(
 		error = 0;
 	}
 	return error;
+try_harder:
+	/*
+	 * Scrubbers return -EDEADLOCK to mean 'try harder'.  Tear down
+	 * everything we hold, then set up again with preparation for
+	 * worst-case scenarios.
+	 */
+	error = xchk_teardown(sc, 0);
+	if (error)
+		goto out_sc;
+	sc->flags |= XCHK_TRY_HARDER;
+	goto retry_op;
 }


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 4/5] xfs: minimize overhead of drain wakeups by using jump labels
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: drain deferred work items when scrubbing Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 1/5] xfs: add a tracepoint to report incorrect extent refcounts Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 5/5] xfs: scrub should use ECHRNG to signal that the drain is needed Darrick J. Wong
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

To reduce the runtime overhead even further when online fsck isn't
running, use a static branch key to decide if we call wake_up on the
drain.  For compilers that support jump labels, the call to wake_up is
replaced by a nop sled when nobody is waiting for intents to drain.

From my initial microbenchmarking, every transition of the static key
between the on and off states takes about 22000ns to complete; this is
paid entirely by the xfs_scrub process.  When the static key is off
(which it should be when fsck isn't running), the nop sled adds an
overhead of approximately 0.36ns to runtime code.

For the few compilers that don't support jump labels, runtime code pays
the cost of calling wake_up on an empty waitqueue, which was observed to
be about 30ns.  However, most architectures that have sufficient memory
and CPU capacity to run XFS also support jump labels, so this is not
much of a worry.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Kconfig            |    1 +
 fs/xfs/scrub/agheader.c   |    9 +++++++++
 fs/xfs/scrub/alloc.c      |    3 +++
 fs/xfs/scrub/bmap.c       |    3 +++
 fs/xfs/scrub/common.c     |   24 ++++++++++++++++++++++++
 fs/xfs/scrub/common.h     |   15 +++++++++++++++
 fs/xfs/scrub/fscounters.c |    7 +++++++
 fs/xfs/scrub/ialloc.c     |    2 ++
 fs/xfs/scrub/inode.c      |    3 +++
 fs/xfs/scrub/quota.c      |    3 +++
 fs/xfs/scrub/refcount.c   |    2 ++
 fs/xfs/scrub/rmap.c       |    3 +++
 fs/xfs/scrub/scrub.c      |   25 +++++++++++++++++++++----
 fs/xfs/scrub/scrub.h      |    5 ++++-
 fs/xfs/scrub/trace.h      |   33 +++++++++++++++++++++++++++++++++
 fs/xfs/xfs_drain.c        |   27 ++++++++++++++++++++++++++-
 fs/xfs/xfs_drain.h        |    3 +++
 17 files changed, 162 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index ab24e683b440..05bc865142b8 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -95,6 +95,7 @@ config XFS_RT
 
 config XFS_DRAIN_INTENTS
 	bool
+	select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL
 
 config XFS_ONLINE_SCRUB
 	bool "XFS online metadata check support"
diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index 4dd52b15f09c..3dd9151a20ad 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -18,6 +18,15 @@
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 
+int
+xchk_setup_agheader(
+	struct xfs_scrub	*sc)
+{
+	if (xchk_need_fshook_drain(sc))
+		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
+	return xchk_setup_fs(sc);
+}
+
 /* Superblock */
 
 /* Cross-reference with the other btrees. */
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index 3b38f4e2a537..d0509219722f 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -24,6 +24,9 @@ int
 xchk_setup_ag_allocbt(
 	struct xfs_scrub	*sc)
 {
+	if (xchk_need_fshook_drain(sc))
+		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
+
 	return xchk_setup_ag_btree(sc, false);
 }
 
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index d50d0eab196a..5c4b25585b8c 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -31,6 +31,9 @@ xchk_setup_inode_bmap(
 {
 	int			error;
 
+	if (xchk_need_fshook_drain(sc))
+		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
+
 	error = xchk_get_inode(sc);
 	if (error)
 		goto out;
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 453d8c3f2370..2c8ce015f3a9 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -479,6 +479,8 @@ xchk_perag_lock(
 			sa->agi_bp = NULL;
 		}
 
+		if (!(sc->flags & XCHK_FSHOOKS_DRAIN))
+			return -EDEADLOCK;
 		error = xfs_perag_drain_intents(sa->pag);
 		if (error == -ERESTARTSYS)
 			error = -EINTR;
@@ -992,3 +994,25 @@ xchk_start_reaping(
 	}
 	sc->flags &= ~XCHK_REAPING_DISABLED;
 }
+
+/*
+ * Enable filesystem hooks (i.e. runtime code patching) before starting a scrub
+ * operation.  Callers must not hold any locks that intersect with the CPU
+ * hotplug lock (e.g. writeback locks) because code patching must halt the CPUs
+ * to change kernel code.
+ */
+void
+xchk_fshooks_enable(
+	struct xfs_scrub	*sc,
+	unsigned int		scrub_fshooks)
+{
+	ASSERT(!(scrub_fshooks & ~XCHK_FSHOOKS_ALL));
+	ASSERT(!(sc->flags & scrub_fshooks));
+
+	trace_xchk_fshooks_enable(sc, scrub_fshooks);
+
+	if (scrub_fshooks & XCHK_FSHOOKS_DRAIN)
+		xfs_drain_wait_enable();
+
+	sc->flags |= scrub_fshooks;
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index b73648d81d23..4de5677390a4 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -72,6 +72,7 @@ bool xchk_should_check_xref(struct xfs_scrub *sc, int *error,
 			   struct xfs_btree_cur **curpp);
 
 /* Setup functions */
+int xchk_setup_agheader(struct xfs_scrub *sc);
 int xchk_setup_fs(struct xfs_scrub *sc);
 int xchk_setup_ag_allocbt(struct xfs_scrub *sc);
 int xchk_setup_ag_iallocbt(struct xfs_scrub *sc);
@@ -151,4 +152,18 @@ int xchk_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
 void xchk_stop_reaping(struct xfs_scrub *sc);
 void xchk_start_reaping(struct xfs_scrub *sc);
 
+/*
+ * Setting up a hook to wait for intents to drain is costly -- we have to take
+ * the CPU hotplug lock and force an i-cache flush on all CPUs once to set it
+ * up, and again to tear it down.  These costs add up quickly, so we only want
+ * to enable the drain waiter if the drain actually detected a conflict with
+ * running intent chains.
+ */
+static inline bool xchk_need_fshook_drain(struct xfs_scrub *sc)
+{
+	return sc->flags & XCHK_TRY_HARDER;
+}
+
+void xchk_fshooks_enable(struct xfs_scrub *sc, unsigned int scrub_fshooks);
+
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index 4777e7b89fdc..63755ba4fc0e 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -128,6 +128,13 @@ xchk_setup_fscounters(
 	struct xchk_fscounters	*fsc;
 	int			error;
 
+	/*
+	 * If the AGF doesn't track btreeblks, we have to lock the AGF to count
+	 * btree block usage by walking the actual btrees.
+	 */
+	if (!xfs_has_lazysbcount(sc->mp))
+		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
+
 	sc->buf = kzalloc(sizeof(struct xchk_fscounters), XCHK_GFP_FLAGS);
 	if (!sc->buf)
 		return -ENOMEM;
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index e312be7cd375..fd5bc289de4c 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -32,6 +32,8 @@ int
 xchk_setup_ag_iallocbt(
 	struct xfs_scrub	*sc)
 {
+	if (xchk_need_fshook_drain(sc))
+		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
 	return xchk_setup_ag_btree(sc, sc->flags & XCHK_TRY_HARDER);
 }
 
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 7a2f38e5202c..8c972ee15a30 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -32,6 +32,9 @@ xchk_setup_inode(
 {
 	int			error;
 
+	if (xchk_need_fshook_drain(sc))
+		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
+
 	/*
 	 * Try to get the inode.  If the verifiers fail, we try again
 	 * in raw mode.
diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index 9eeac8565394..7b21e1012eff 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -53,6 +53,9 @@ xchk_setup_quota(
 	if (!xfs_this_quota_on(sc->mp, dqtype))
 		return -ENOENT;
 
+	if (xchk_need_fshook_drain(sc))
+		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
+
 	error = xchk_setup_fs(sc);
 	if (error)
 		return error;
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 080487f99e5f..9423aad28511 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -27,6 +27,8 @@ int
 xchk_setup_ag_refcountbt(
 	struct xfs_scrub	*sc)
 {
+	if (xchk_need_fshook_drain(sc))
+		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
 	return xchk_setup_ag_btree(sc, false);
 }
 
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 229826b2e1c0..afc4f840b6bc 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -24,6 +24,9 @@ int
 xchk_setup_ag_rmapbt(
 	struct xfs_scrub	*sc)
 {
+	if (xchk_need_fshook_drain(sc))
+		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
+
 	return xchk_setup_ag_btree(sc, false);
 }
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 50db13c5f626..8f8a4eb758ea 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -145,6 +145,21 @@ xchk_probe(
 
 /* Scrub setup and teardown */
 
+static inline void
+xchk_fshooks_disable(
+	struct xfs_scrub	*sc)
+{
+	if (!(sc->flags & XCHK_FSHOOKS_ALL))
+		return;
+
+	trace_xchk_fshooks_disable(sc, sc->flags & XCHK_FSHOOKS_ALL);
+
+	if (sc->flags & XCHK_FSHOOKS_DRAIN)
+		xfs_drain_wait_disable();
+
+	sc->flags &= ~XCHK_FSHOOKS_ALL;
+}
+
 /* Free all the resources and finish the transactions. */
 STATIC int
 xchk_teardown(
@@ -177,6 +192,8 @@ xchk_teardown(
 		kvfree(sc->buf);
 		sc->buf = NULL;
 	}
+
+	xchk_fshooks_disable(sc);
 	return error;
 }
 
@@ -191,25 +208,25 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 	},
 	[XFS_SCRUB_TYPE_SB] = {		/* superblock */
 		.type	= ST_PERAG,
-		.setup	= xchk_setup_fs,
+		.setup	= xchk_setup_agheader,
 		.scrub	= xchk_superblock,
 		.repair	= xrep_superblock,
 	},
 	[XFS_SCRUB_TYPE_AGF] = {	/* agf */
 		.type	= ST_PERAG,
-		.setup	= xchk_setup_fs,
+		.setup	= xchk_setup_agheader,
 		.scrub	= xchk_agf,
 		.repair	= xrep_agf,
 	},
 	[XFS_SCRUB_TYPE_AGFL]= {	/* agfl */
 		.type	= ST_PERAG,
-		.setup	= xchk_setup_fs,
+		.setup	= xchk_setup_agheader,
 		.scrub	= xchk_agfl,
 		.repair	= xrep_agfl,
 	},
 	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
 		.type	= ST_PERAG,
-		.setup	= xchk_setup_fs,
+		.setup	= xchk_setup_agheader,
 		.scrub	= xchk_agi,
 		.repair	= xrep_agi,
 	},
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index b4d391b4c938..4ff4b19bee3d 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -96,9 +96,12 @@ struct xfs_scrub {
 
 /* XCHK state flags grow up from zero, XREP state flags grown down from 2^31 */
 #define XCHK_TRY_HARDER		(1 << 0)  /* can't get resources, try again */
-#define XCHK_REAPING_DISABLED	(1 << 2)  /* background block reaping paused */
+#define XCHK_REAPING_DISABLED	(1 << 1)  /* background block reaping paused */
+#define XCHK_FSHOOKS_DRAIN	(1 << 2)  /* defer ops draining enabled */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
 
+#define XCHK_FSHOOKS_ALL	(XCHK_FSHOOKS_DRAIN)
+
 /* Metadata scrubbers */
 int xchk_tester(struct xfs_scrub *sc);
 int xchk_superblock(struct xfs_scrub *sc);
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 403c0e62257e..034b80371da5 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -96,6 +96,12 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_FSCOUNTERS);
 	{ XFS_SCRUB_OFLAG_WARNING,		"warning" }, \
 	{ XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED,	"norepair" }
 
+#define XFS_SCRUB_STATE_STRINGS \
+	{ XCHK_TRY_HARDER,			"try_harder" }, \
+	{ XCHK_REAPING_DISABLED,		"reaping_disabled" }, \
+	{ XCHK_FSHOOKS_DRAIN,			"fshooks_drain" }, \
+	{ XREP_ALREADY_FIXED,			"already_fixed" }
+
 DECLARE_EVENT_CLASS(xchk_class,
 	TP_PROTO(struct xfs_inode *ip, struct xfs_scrub_metadata *sm,
 		 int error),
@@ -142,6 +148,33 @@ DEFINE_SCRUB_EVENT(xchk_deadlock_retry);
 DEFINE_SCRUB_EVENT(xrep_attempt);
 DEFINE_SCRUB_EVENT(xrep_done);
 
+DECLARE_EVENT_CLASS(xchk_fshook_class,
+	TP_PROTO(struct xfs_scrub *sc, unsigned int fshooks),
+	TP_ARGS(sc, fshooks),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, type)
+		__field(unsigned int, fshooks)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->type = sc->sm->sm_type;
+		__entry->fshooks = fshooks;
+	),
+	TP_printk("dev %d:%d type %s fshooks '%s'",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_symbolic(__entry->type, XFS_SCRUB_TYPE_STRINGS),
+		  __print_flags(__entry->fshooks, "|", XFS_SCRUB_STATE_STRINGS))
+)
+
+#define DEFINE_SCRUB_FSHOOK_EVENT(name) \
+DEFINE_EVENT(xchk_fshook_class, name, \
+	TP_PROTO(struct xfs_scrub *sc, unsigned int fshooks), \
+	TP_ARGS(sc, fshooks))
+
+DEFINE_SCRUB_FSHOOK_EVENT(xchk_fshooks_enable);
+DEFINE_SCRUB_FSHOOK_EVENT(xchk_fshooks_disable);
+
 TRACE_EVENT(xchk_op_error,
 	TP_PROTO(struct xfs_scrub *sc, xfs_agnumber_t agno,
 		 xfs_agblock_t bno, int error, void *ret_ip),
diff --git a/fs/xfs/xfs_drain.c b/fs/xfs/xfs_drain.c
index e8fced914f88..9b463e1183f6 100644
--- a/fs/xfs/xfs_drain.c
+++ b/fs/xfs/xfs_drain.c
@@ -12,6 +12,31 @@
 #include "xfs_ag.h"
 #include "xfs_trace.h"
 
+/*
+ * Use a static key here to reduce the overhead of xfs_drain_drop.  If the
+ * compiler supports jump labels, the static branch will be replaced by a nop
+ * sled when there are no xfs_drain_wait callers.  Online fsck is currently
+ * the only caller, so this is a reasonable tradeoff.
+ *
+ * Note: Patching the kernel code requires taking the cpu hotplug lock.  Other
+ * parts of the kernel allocate memory with that lock held, which means that
+ * XFS callers cannot hold any locks that might be used by memory reclaim or
+ * writeback when calling the static_branch_{inc,dec} functions.
+ */
+static DEFINE_STATIC_KEY_FALSE(xfs_drain_waiter_hook);
+
+void
+xfs_drain_wait_disable(void)
+{
+	static_branch_dec(&xfs_drain_waiter_hook);
+}
+
+void
+xfs_drain_wait_enable(void)
+{
+	static_branch_inc(&xfs_drain_waiter_hook);
+}
+
 void
 xfs_drain_init(
 	struct xfs_drain	*dr)
@@ -36,7 +61,7 @@ static inline void xfs_drain_bump(struct xfs_drain *dr)
 static inline void xfs_drain_drop(struct xfs_drain *dr)
 {
 	if (atomic_dec_and_test(&dr->dr_count) &&
-	    wq_has_sleeper(&dr->dr_waiters))
+	    static_branch_unlikely(&xfs_drain_waiter_hook))
 		wake_up(&dr->dr_waiters);
 }
 
diff --git a/fs/xfs/xfs_drain.h b/fs/xfs/xfs_drain.h
index f01a2b5c7337..a980df6d3508 100644
--- a/fs/xfs/xfs_drain.h
+++ b/fs/xfs/xfs_drain.h
@@ -25,6 +25,9 @@ struct xfs_drain {
 void xfs_drain_init(struct xfs_drain *dr);
 void xfs_drain_free(struct xfs_drain *dr);
 
+void xfs_drain_wait_disable(void);
+void xfs_drain_wait_enable(void);
+
 /*
  * Deferred Work Intent Drains
  * ===========================


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 5/5] xfs: scrub should use ECHRNG to signal that the drain is needed
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: drain deferred work items when scrubbing Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 4/5] xfs: minimize overhead of drain wakeups by using jump labels Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In the previous patch, we added jump labels to the intent drain code so
that regular filesystem operations need not pay the price of checking
for someone (scrub) waiting on intents to drain from some part of the
filesystem when that someone isn't running.

However, I observed that xfs/285 now spends a lot more time pushing the
AIL from the inode btree scrubber than it used to.  This is because the
inobt scrubber will try push the AIL to try to get logged inode cores
written to the filesystem when it sees a weird discrepancy between the
ondisk inode and the inobt records.  This AIL push is triggered when the
setup function sees TRY_HARDER is set; and the requisite EDEADLOCK
return is initiated when the discrepancy is seen.

The solution to this performance slow down is to use a different result
code (ECHRNG) for scrub code to signal that it needs to wait for
deferred intent work items to drain out of some part of the filesystem.
When this happens, set a new scrub state flag (XCHK_NEED_DRAIN) so that
setup functions will activate the jump label.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/btree.c   |    1 +
 fs/xfs/scrub/common.c  |    4 +++-
 fs/xfs/scrub/common.h  |    2 +-
 fs/xfs/scrub/dabtree.c |    1 +
 fs/xfs/scrub/repair.c  |    3 +++
 fs/xfs/scrub/scrub.c   |   10 ++++++++++
 fs/xfs/scrub/scrub.h   |    1 +
 fs/xfs/scrub/trace.h   |    1 +
 8 files changed, 21 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 0fd36d5b4646..ebbf1c5fd0c6 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -36,6 +36,7 @@ __xchk_btree_process_error(
 
 	switch (*error) {
 	case -EDEADLOCK:
+	case -ECHRNG:
 		/* Used to restart an op with deadlock avoidance. */
 		trace_xchk_deadlock_retry(sc->ip, sc->sm, *error);
 		break;
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 2c8ce015f3a9..b21d675dd158 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -75,6 +75,7 @@ __xchk_process_error(
 	case 0:
 		return true;
 	case -EDEADLOCK:
+	case -ECHRNG:
 		/* Used to restart an op with deadlock avoidance. */
 		trace_xchk_deadlock_retry(
 				sc->ip ? sc->ip : XFS_I(file_inode(sc->file)),
@@ -130,6 +131,7 @@ __xchk_fblock_process_error(
 	case 0:
 		return true;
 	case -EDEADLOCK:
+	case -ECHRNG:
 		/* Used to restart an op with deadlock avoidance. */
 		trace_xchk_deadlock_retry(sc->ip, sc->sm, *error);
 		break;
@@ -480,7 +482,7 @@ xchk_perag_lock(
 		}
 
 		if (!(sc->flags & XCHK_FSHOOKS_DRAIN))
-			return -EDEADLOCK;
+			return -ECHRNG;
 		error = xfs_perag_drain_intents(sa->pag);
 		if (error == -ERESTARTSYS)
 			error = -EINTR;
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 4de5677390a4..0efe6b947d88 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -161,7 +161,7 @@ void xchk_start_reaping(struct xfs_scrub *sc);
  */
 static inline bool xchk_need_fshook_drain(struct xfs_scrub *sc)
 {
-	return sc->flags & XCHK_TRY_HARDER;
+	return sc->flags & XCHK_NEED_DRAIN;
 }
 
 void xchk_fshooks_enable(struct xfs_scrub *sc, unsigned int scrub_fshooks);
diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
index d17cee177085..957a0b1a2f0b 100644
--- a/fs/xfs/scrub/dabtree.c
+++ b/fs/xfs/scrub/dabtree.c
@@ -39,6 +39,7 @@ xchk_da_process_error(
 
 	switch (*error) {
 	case -EDEADLOCK:
+	case -ECHRNG:
 		/* Used to restart an op with deadlock avoidance. */
 		trace_xchk_deadlock_retry(sc->ip, sc->sm, *error);
 		break;
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index a0b85bdd4c5a..446ffe987ca0 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -60,6 +60,9 @@ xrep_attempt(
 		sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT;
 		sc->flags |= XREP_ALREADY_FIXED;
 		return -EAGAIN;
+	case -ECHRNG:
+		sc->flags |= XCHK_NEED_DRAIN;
+		return -EAGAIN;
 	case -EDEADLOCK:
 		/* Tell the caller to try again having grabbed all the locks. */
 		if (!(sc->flags & XCHK_TRY_HARDER)) {
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 8f8a4eb758ea..7a3557a69fe0 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -510,6 +510,8 @@ xfs_scrub_metadata(
 	error = sc->ops->setup(sc);
 	if (error == -EDEADLOCK && !(sc->flags & XCHK_TRY_HARDER))
 		goto try_harder;
+	if (error == -ECHRNG && !(sc->flags & XCHK_NEED_DRAIN))
+		goto need_drain;
 	if (error)
 		goto out_teardown;
 
@@ -517,6 +519,8 @@ xfs_scrub_metadata(
 	error = sc->ops->scrub(sc);
 	if (error == -EDEADLOCK && !(sc->flags & XCHK_TRY_HARDER))
 		goto try_harder;
+	if (error == -ECHRNG && !(sc->flags & XCHK_NEED_DRAIN))
+		goto need_drain;
 	if (error || (sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE))
 		goto out_teardown;
 
@@ -575,6 +579,12 @@ xfs_scrub_metadata(
 		error = 0;
 	}
 	return error;
+need_drain:
+	error = xchk_teardown(sc, 0);
+	if (error)
+		goto out_sc;
+	sc->flags |= XCHK_NEED_DRAIN;
+	goto retry_op;
 try_harder:
 	/*
 	 * Scrubbers return -EDEADLOCK to mean 'try harder'.  Tear down
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 4ff4b19bee3d..85c055c2ddc5 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -98,6 +98,7 @@ struct xfs_scrub {
 #define XCHK_TRY_HARDER		(1 << 0)  /* can't get resources, try again */
 #define XCHK_REAPING_DISABLED	(1 << 1)  /* background block reaping paused */
 #define XCHK_FSHOOKS_DRAIN	(1 << 2)  /* defer ops draining enabled */
+#define XCHK_NEED_DRAIN		(1 << 3)  /* scrub needs to use intent drain */
 #define XREP_ALREADY_FIXED	(1 << 31) /* checking our repair work */
 
 #define XCHK_FSHOOKS_ALL	(XCHK_FSHOOKS_DRAIN)
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 034b80371da5..cd9cfe98f14f 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -100,6 +100,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_FSCOUNTERS);
 	{ XCHK_TRY_HARDER,			"try_harder" }, \
 	{ XCHK_REAPING_DISABLED,		"reaping_disabled" }, \
 	{ XCHK_FSHOOKS_DRAIN,			"fshooks_drain" }, \
+	{ XCHK_NEED_DRAIN,			"need_drain" }, \
 	{ XREP_ALREADY_FIXED,			"already_fixed" }
 
 DECLARE_EVENT_CLASS(xchk_class,


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/8] xfs: standardize btree record checking code
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (4 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: drain deferred work items when scrubbing Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 4/8] xfs: return a failure address from xfs_rmap_irec_offset_unpack Darrick J. Wong
                     ` (7 more replies)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: hoist scrub record checks into libxfs Darrick J. Wong
                   ` (16 subsequent siblings)
  22 siblings, 8 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

While I was cleaning things up for 6.1, I noticed that the btree
_query_range and _query_all functions don't perform the same checking
that the _get_rec functions perform.  In fact, they don't perform /any/
sanity checking, which means that callers aren't warned about impossible
records.

Therefore, hoist the record validation and complaint logging code into
separate functions, and call them from any place where we convert an
ondisk record into an incore record.  For online scrub, we can replace
checking code with a call to the record checking functions in libxfs,
thereby reducing the size of the codebase.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=btree-complain-bad-records

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=btree-complain-bad-records
---
 fs/xfs/libxfs/xfs_alloc.c        |   82 ++++++++++++++++++++++++---------
 fs/xfs/libxfs/xfs_alloc.h        |    6 ++
 fs/xfs/libxfs/xfs_bmap.c         |   31 ++++++++++++
 fs/xfs/libxfs/xfs_bmap.h         |    2 +
 fs/xfs/libxfs/xfs_ialloc.c       |   77 +++++++++++++++++++++----------
 fs/xfs/libxfs/xfs_ialloc.h       |    2 +
 fs/xfs/libxfs/xfs_ialloc_btree.c |    2 -
 fs/xfs/libxfs/xfs_ialloc_btree.h |    2 -
 fs/xfs/libxfs/xfs_inode_fork.c   |    3 +
 fs/xfs/libxfs/xfs_refcount.c     |   73 +++++++++++++++++++----------
 fs/xfs/libxfs/xfs_refcount.h     |    2 +
 fs/xfs/libxfs/xfs_rmap.c         |   95 ++++++++++++++++++++++++--------------
 fs/xfs/libxfs/xfs_rmap.h         |   12 +++--
 fs/xfs/scrub/alloc.c             |   24 +++++-----
 fs/xfs/scrub/bmap.c              |    6 ++
 fs/xfs/scrub/ialloc.c            |   24 ++--------
 fs/xfs/scrub/refcount.c          |   14 +-----
 fs/xfs/scrub/rmap.c              |   44 ++----------------
 18 files changed, 303 insertions(+), 198 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/8] xfs: standardize ondisk to incore conversion for free space btrees
  2022-12-30 22:11 ` [PATCHSET v24.0 0/8] xfs: standardize btree record checking code Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 4/8] xfs: return a failure address from xfs_rmap_irec_offset_unpack Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/8] xfs: standardize ondisk to incore conversion for inode btrees Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/8] xfs: standardize ondisk to incore conversion for refcount btrees Darrick J. Wong
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a xfs_alloc_btrec_to_irec function to convert an ondisk record to
an incore record, and a xfs_alloc_check_irec function to detect
corruption.  Replace all the open-coded logic with calls to the new
helpers and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c |   56 +++++++++++++++++++++++++++++++++++----------
 fs/xfs/libxfs/xfs_alloc.h |    6 +++++
 fs/xfs/scrub/alloc.c      |   24 ++++++++++---------
 3 files changed, 61 insertions(+), 25 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 199f22ddc379..13b668673243 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -237,6 +237,34 @@ xfs_alloc_update(
 	return xfs_btree_update(cur, &rec);
 }
 
+/* Convert the ondisk btree record to its incore representation. */
+void
+xfs_alloc_btrec_to_irec(
+	const union xfs_btree_rec	*rec,
+	struct xfs_alloc_rec_incore	*irec)
+{
+	irec->ar_startblock = be32_to_cpu(rec->alloc.ar_startblock);
+	irec->ar_blockcount = be32_to_cpu(rec->alloc.ar_blockcount);
+}
+
+/* Simple checks for free space records. */
+xfs_failaddr_t
+xfs_alloc_check_irec(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_alloc_rec_incore *irec)
+{
+	struct xfs_perag		*pag = cur->bc_ag.pag;
+
+	if (irec->ar_blockcount == 0)
+		return __this_address;
+
+	/* check for valid extent range, including overflow */
+	if (!xfs_verify_agbext(pag, irec->ar_startblock, irec->ar_blockcount))
+		return __this_address;
+
+	return NULL;
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -247,34 +275,34 @@ xfs_alloc_get_rec(
 	xfs_extlen_t		*len,	/* output: length of extent */
 	int			*stat)	/* output: success/failure */
 {
+	struct xfs_alloc_rec_incore irec;
 	struct xfs_mount	*mp = cur->bc_mp;
 	struct xfs_perag	*pag = cur->bc_ag.pag;
 	union xfs_btree_rec	*rec;
+	xfs_failaddr_t		fa;
 	int			error;
 
 	error = xfs_btree_get_rec(cur, &rec, stat);
 	if (error || !(*stat))
 		return error;
 
-	*bno = be32_to_cpu(rec->alloc.ar_startblock);
-	*len = be32_to_cpu(rec->alloc.ar_blockcount);
-
-	if (*len == 0)
-		goto out_bad_rec;
-
-	/* check for valid extent range, including overflow */
-	if (!xfs_verify_agbext(pag, *bno, *len))
+	xfs_alloc_btrec_to_irec(rec, &irec);
+	fa = xfs_alloc_check_irec(cur, &irec);
+	if (fa)
 		goto out_bad_rec;
 
+	*bno = irec.ar_startblock;
+	*len = irec.ar_blockcount;
 	return 0;
 
 out_bad_rec:
 	xfs_warn(mp,
-		"%s Freespace BTree record corruption in AG %d detected!",
+		"%s Freespace BTree record corruption in AG %d detected at %pS!",
 		cur->bc_btnum == XFS_BTNUM_BNO ? "Block" : "Size",
-		pag->pag_agno);
+		pag->pag_agno, fa);
 	xfs_warn(mp,
-		"start block 0x%x block count 0x%x", *bno, *len);
+		"start block 0x%x block count 0x%x", irec.ar_startblock,
+		irec.ar_blockcount);
 	return -EFSCORRUPTED;
 }
 
@@ -3450,8 +3478,10 @@ xfs_alloc_query_range_helper(
 	struct xfs_alloc_query_range_info	*query = priv;
 	struct xfs_alloc_rec_incore		irec;
 
-	irec.ar_startblock = be32_to_cpu(rec->alloc.ar_startblock);
-	irec.ar_blockcount = be32_to_cpu(rec->alloc.ar_blockcount);
+	xfs_alloc_btrec_to_irec(rec, &irec);
+	if (xfs_alloc_check_irec(cur, &irec) != NULL)
+		return -EFSCORRUPTED;
+
 	return query->fn(cur, &irec, query->priv);
 }
 
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index f84f3966e849..becd06e5d0b8 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -170,6 +170,12 @@ xfs_alloc_get_rec(
 	xfs_extlen_t		*len,	/* output: length of extent */
 	int			*stat);	/* output: success/failure */
 
+union xfs_btree_rec;
+void xfs_alloc_btrec_to_irec(const union xfs_btree_rec *rec,
+		struct xfs_alloc_rec_incore *irec);
+xfs_failaddr_t xfs_alloc_check_irec(struct xfs_btree_cur *cur,
+		const struct xfs_alloc_rec_incore *irec);
+
 int xfs_read_agf(struct xfs_perag *pag, struct xfs_trans *tp, int flags,
 		struct xfs_buf **agfbpp);
 int xfs_alloc_read_agf(struct xfs_perag *pag, struct xfs_trans *tp, int flags,
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index d0509219722f..fb4f96716f6a 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -78,9 +78,11 @@ xchk_allocbt_xref_other(
 STATIC void
 xchk_allocbt_xref(
 	struct xfs_scrub	*sc,
-	xfs_agblock_t		agbno,
-	xfs_extlen_t		len)
+	const struct xfs_alloc_rec_incore *irec)
 {
+	xfs_agblock_t		agbno = irec->ar_startblock;
+	xfs_extlen_t		len = irec->ar_blockcount;
+
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
 		return;
 
@@ -93,20 +95,18 @@ xchk_allocbt_xref(
 /* Scrub a bnobt/cntbt record. */
 STATIC int
 xchk_allocbt_rec(
-	struct xchk_btree	*bs,
-	const union xfs_btree_rec *rec)
+	struct xchk_btree		*bs,
+	const union xfs_btree_rec	*rec)
 {
-	struct xfs_perag	*pag = bs->cur->bc_ag.pag;
-	xfs_agblock_t		bno;
-	xfs_extlen_t		len;
+	struct xfs_alloc_rec_incore	irec;
 
-	bno = be32_to_cpu(rec->alloc.ar_startblock);
-	len = be32_to_cpu(rec->alloc.ar_blockcount);
-
-	if (!xfs_verify_agbext(pag, bno, len))
+	xfs_alloc_btrec_to_irec(rec, &irec);
+	if (xfs_alloc_check_irec(bs->cur, &irec) != NULL) {
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+		return 0;
+	}
 
-	xchk_allocbt_xref(bs->sc, bno, len);
+	xchk_allocbt_xref(bs->sc, &irec);
 
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/8] xfs: standardize ondisk to incore conversion for inode btrees
  2022-12-30 22:11 ` [PATCHSET v24.0 0/8] xfs: standardize btree record checking code Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 4/8] xfs: return a failure address from xfs_rmap_irec_offset_unpack Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/8] xfs: standardize ondisk to incore conversion for free space btrees Darrick J. Wong
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a xfs_inobt_check_irec function to detect corruption in btree
records.  Fix all xfs_inobt_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ialloc.c       |   61 ++++++++++++++++++++++++--------------
 fs/xfs/libxfs/xfs_ialloc.h       |    2 +
 fs/xfs/libxfs/xfs_ialloc_btree.c |    2 +
 fs/xfs/libxfs/xfs_ialloc_btree.h |    2 +
 fs/xfs/scrub/ialloc.c            |   24 ++-------------
 5 files changed, 47 insertions(+), 44 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 5118dedf9267..010d1f514742 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -95,33 +95,21 @@ xfs_inobt_btrec_to_irec(
 	irec->ir_free = be64_to_cpu(rec->inobt.ir_free);
 }
 
-/*
- * Get the data from the pointed-to record.
- */
-int
-xfs_inobt_get_rec(
-	struct xfs_btree_cur		*cur,
-	struct xfs_inobt_rec_incore	*irec,
-	int				*stat)
+/* Simple checks for inode records. */
+xfs_failaddr_t
+xfs_inobt_check_irec(
+	struct xfs_btree_cur			*cur,
+	const struct xfs_inobt_rec_incore	*irec)
 {
-	struct xfs_mount		*mp = cur->bc_mp;
-	union xfs_btree_rec		*rec;
-	int				error;
 	uint64_t			realfree;
 
-	error = xfs_btree_get_rec(cur, &rec, stat);
-	if (error || *stat == 0)
-		return error;
-
-	xfs_inobt_btrec_to_irec(mp, rec, irec);
-
 	if (!xfs_verify_agino(cur->bc_ag.pag, irec->ir_startino))
-		goto out_bad_rec;
+		return __this_address;
 	if (irec->ir_count < XFS_INODES_PER_HOLEMASK_BIT ||
 	    irec->ir_count > XFS_INODES_PER_CHUNK)
-		goto out_bad_rec;
+		return __this_address;
 	if (irec->ir_freecount > XFS_INODES_PER_CHUNK)
-		goto out_bad_rec;
+		return __this_address;
 
 	/* if there are no holes, return the first available offset */
 	if (!xfs_inobt_issparse(irec->ir_holemask))
@@ -129,15 +117,41 @@ xfs_inobt_get_rec(
 	else
 		realfree = irec->ir_free & xfs_inobt_irec_to_allocmask(irec);
 	if (hweight64(realfree) != irec->ir_freecount)
+		return __this_address;
+
+	return NULL;
+}
+
+/*
+ * Get the data from the pointed-to record.
+ */
+int
+xfs_inobt_get_rec(
+	struct xfs_btree_cur		*cur,
+	struct xfs_inobt_rec_incore	*irec,
+	int				*stat)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+	union xfs_btree_rec		*rec;
+	xfs_failaddr_t			fa;
+	int				error;
+
+	error = xfs_btree_get_rec(cur, &rec, stat);
+	if (error || *stat == 0)
+		return error;
+
+	xfs_inobt_btrec_to_irec(mp, rec, irec);
+	fa = xfs_inobt_check_irec(cur, irec);
+	if (fa)
 		goto out_bad_rec;
 
 	return 0;
 
 out_bad_rec:
 	xfs_warn(mp,
-		"%s Inode BTree record corruption in AG %d detected!",
+		"%s Inode BTree record corruption in AG %d detected at %pS!",
 		cur->bc_btnum == XFS_BTNUM_INO ? "Used" : "Free",
-		cur->bc_ag.pag->pag_agno);
+		cur->bc_ag.pag->pag_agno, fa);
 	xfs_warn(mp,
 "start inode 0x%x, count 0x%x, free 0x%x freemask 0x%llx, holemask 0x%x",
 		irec->ir_startino, irec->ir_count, irec->ir_freecount,
@@ -2705,6 +2719,9 @@ xfs_ialloc_count_inodes_rec(
 	struct xfs_ialloc_count_inodes	*ci = priv;
 
 	xfs_inobt_btrec_to_irec(cur->bc_mp, rec, &irec);
+	if (xfs_inobt_check_irec(cur, &irec) != NULL)
+		return -EFSCORRUPTED;
+
 	ci->count += irec.ir_count;
 	ci->freecount += irec.ir_freecount;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index 9bbbca6ac4ed..fa67bb090c01 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -92,6 +92,8 @@ union xfs_btree_rec;
 void xfs_inobt_btrec_to_irec(struct xfs_mount *mp,
 		const union xfs_btree_rec *rec,
 		struct xfs_inobt_rec_incore *irec);
+xfs_failaddr_t xfs_inobt_check_irec(struct xfs_btree_cur *cur,
+		const struct xfs_inobt_rec_incore *irec);
 int xfs_ialloc_has_inodes_at_extent(struct xfs_btree_cur *cur,
 		xfs_agblock_t bno, xfs_extlen_t len, bool *exists);
 int xfs_ialloc_has_inode_record(struct xfs_btree_cur *cur, xfs_agino_t low,
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index fb10760fd686..e849faae405a 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -610,7 +610,7 @@ xfs_iallocbt_maxlevels_ondisk(void)
  */
 uint64_t
 xfs_inobt_irec_to_allocmask(
-	struct xfs_inobt_rec_incore	*rec)
+	const struct xfs_inobt_rec_incore	*rec)
 {
 	uint64_t			bitmap = 0;
 	uint64_t			inodespbit;
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h
index 26451cb76b98..6d8d6bcd594d 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.h
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.h
@@ -55,7 +55,7 @@ struct xfs_btree_cur *xfs_inobt_stage_cursor(struct xfs_mount *mp,
 extern int xfs_inobt_maxrecs(struct xfs_mount *, int, int);
 
 /* ir_holemask to inode allocation bitmap conversion */
-uint64_t xfs_inobt_irec_to_allocmask(struct xfs_inobt_rec_incore *);
+uint64_t xfs_inobt_irec_to_allocmask(const struct xfs_inobt_rec_incore *irec);
 
 #if defined(DEBUG) || defined(XFS_WARN)
 int xfs_inobt_rec_check_count(struct xfs_mount *,
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index fd5bc289de4c..9aec5a793397 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -119,15 +119,6 @@ xchk_iallocbt_chunk(
 	return true;
 }
 
-/* Count the number of free inodes. */
-static unsigned int
-xchk_iallocbt_freecount(
-	xfs_inofree_t			freemask)
-{
-	BUILD_BUG_ON(sizeof(freemask) != sizeof(__u64));
-	return hweight64(freemask);
-}
-
 /*
  * Check that an inode's allocation status matches ir_free in the inobt
  * record.  First we try querying the in-core inode state, and if the inode
@@ -431,24 +422,17 @@ xchk_iallocbt_rec(
 	int				holecount;
 	int				i;
 	int				error = 0;
-	unsigned int			real_freecount;
 	uint16_t			holemask;
 
 	xfs_inobt_btrec_to_irec(mp, rec, &irec);
-
-	if (irec.ir_count > XFS_INODES_PER_CHUNK ||
-	    irec.ir_freecount > XFS_INODES_PER_CHUNK)
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
-	real_freecount = irec.ir_freecount +
-			(XFS_INODES_PER_CHUNK - irec.ir_count);
-	if (real_freecount != xchk_iallocbt_freecount(irec.ir_free))
+	if (xfs_inobt_check_irec(bs->cur, &irec) != NULL) {
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+		return 0;
+	}
 
 	agino = irec.ir_startino;
 	/* Record has to be properly aligned within the AG. */
-	if (!xfs_verify_agino(pag, agino) ||
-	    !xfs_verify_agino(pag, agino + XFS_INODES_PER_CHUNK - 1)) {
+	if (!xfs_verify_agino(pag, agino + XFS_INODES_PER_CHUNK - 1)) {
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 		goto out;
 	}


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 3/8] xfs: standardize ondisk to incore conversion for refcount btrees
  2022-12-30 22:11 ` [PATCHSET v24.0 0/8] xfs: standardize btree record checking code Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 1/8] xfs: standardize ondisk to incore conversion for free space btrees Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 7/8] xfs: complain about bad records in query_range helpers Darrick J. Wong
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a xfs_refcount_check_irec function to detect corruption in btree
records.  Fix all xfs_refcount_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_refcount.c |   45 +++++++++++++++++++++++++++++-------------
 fs/xfs/libxfs/xfs_refcount.h |    2 ++
 fs/xfs/scrub/refcount.c      |   14 +++----------
 3 files changed, 36 insertions(+), 25 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 6dc968618e66..b77dea10c8bd 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -120,6 +120,30 @@ xfs_refcount_btrec_to_irec(
 	irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
 }
 
+/* Simple checks for refcount records. */
+xfs_failaddr_t
+xfs_refcount_check_irec(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_refcount_irec	*irec)
+{
+	struct xfs_perag		*pag = cur->bc_ag.pag;
+
+	if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN)
+		return __this_address;
+
+	if (!xfs_refcount_check_domain(irec))
+		return __this_address;
+
+	/* check for valid extent range, including overflow */
+	if (!xfs_verify_agbext(pag, irec->rc_startblock, irec->rc_blockcount))
+		return __this_address;
+
+	if (irec->rc_refcount == 0 || irec->rc_refcount > MAXREFCOUNT)
+		return __this_address;
+
+	return NULL;
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -132,6 +156,7 @@ xfs_refcount_get_rec(
 	struct xfs_mount		*mp = cur->bc_mp;
 	struct xfs_perag		*pag = cur->bc_ag.pag;
 	union xfs_btree_rec		*rec;
+	xfs_failaddr_t			fa;
 	int				error;
 
 	error = xfs_btree_get_rec(cur, &rec, stat);
@@ -139,17 +164,8 @@ xfs_refcount_get_rec(
 		return error;
 
 	xfs_refcount_btrec_to_irec(rec, irec);
-	if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN)
-		goto out_bad_rec;
-
-	if (!xfs_refcount_check_domain(irec))
-		goto out_bad_rec;
-
-	/* check for valid extent range, including overflow */
-	if (!xfs_verify_agbext(pag, irec->rc_startblock, irec->rc_blockcount))
-		goto out_bad_rec;
-
-	if (irec->rc_refcount == 0 || irec->rc_refcount > MAXREFCOUNT)
+	fa = xfs_refcount_check_irec(cur, irec);
+	if (fa)
 		goto out_bad_rec;
 
 	trace_xfs_refcount_get(cur->bc_mp, pag->pag_agno, irec);
@@ -157,8 +173,8 @@ xfs_refcount_get_rec(
 
 out_bad_rec:
 	xfs_warn(mp,
-		"Refcount BTree record corruption in AG %d detected!",
-		pag->pag_agno);
+		"Refcount BTree record corruption in AG %d detected at %pS!",
+		pag->pag_agno, fa);
 	xfs_warn(mp,
 		"Start block 0x%x, block count 0x%x, references 0x%x",
 		irec->rc_startblock, irec->rc_blockcount, irec->rc_refcount);
@@ -1871,7 +1887,8 @@ xfs_refcount_recover_extent(
 	INIT_LIST_HEAD(&rr->rr_list);
 	xfs_refcount_btrec_to_irec(rec, &rr->rr_rrec);
 
-	if (XFS_IS_CORRUPT(cur->bc_mp,
+	if (xfs_refcount_check_irec(cur, &rr->rr_rrec) != NULL ||
+	    XFS_IS_CORRUPT(cur->bc_mp,
 			   rr->rr_rrec.rc_domain != XFS_REFC_DOMAIN_COW)) {
 		kfree(rr);
 		return -EFSCORRUPTED;
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index c89f0fcd1ee3..fc0b58d4c379 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -117,6 +117,8 @@ extern int xfs_refcount_has_record(struct xfs_btree_cur *cur,
 union xfs_btree_rec;
 extern void xfs_refcount_btrec_to_irec(const union xfs_btree_rec *rec,
 		struct xfs_refcount_irec *irec);
+xfs_failaddr_t xfs_refcount_check_irec(struct xfs_btree_cur *cur,
+		const struct xfs_refcount_irec *irec);
 extern int xfs_refcount_insert(struct xfs_btree_cur *cur,
 		struct xfs_refcount_irec *irec, int *stat);
 
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 9423aad28511..c2ae5a328a6d 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -340,24 +340,16 @@ xchk_refcountbt_rec(
 {
 	struct xfs_refcount_irec irec;
 	xfs_agblock_t		*cow_blocks = bs->private;
-	struct xfs_perag	*pag = bs->cur->bc_ag.pag;
 
 	xfs_refcount_btrec_to_irec(rec, &irec);
-
-	/* Check the domain and refcount are not incompatible. */
-	if (!xfs_refcount_check_domain(&irec))
+	if (xfs_refcount_check_irec(bs->cur, &irec) != NULL) {
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+		return 0;
+	}
 
 	if (irec.rc_domain == XFS_REFC_DOMAIN_COW)
 		(*cow_blocks) += irec.rc_blockcount;
 
-	/* Check the extent. */
-	if (!xfs_verify_agbext(pag, irec.rc_startblock, irec.rc_blockcount))
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
-	if (irec.rc_refcount == 0)
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
 	xchk_refcountbt_xref(bs->sc, &irec);
 
 	return 0;


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 4/8] xfs: return a failure address from xfs_rmap_irec_offset_unpack
  2022-12-30 22:11 ` [PATCHSET v24.0 0/8] xfs: standardize btree record checking code Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/8] xfs: standardize ondisk to incore conversion for inode btrees Darrick J. Wong
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Currently, xfs_rmap_irec_offset_unpack returns only 0 or -EFSCORRUPTED.
Change this function to return the code address of a failed conversion
in preparation for the next patch, which standardizes localized record
checking and reporting code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rmap.c |    9 ++++-----
 fs/xfs/libxfs/xfs_rmap.h |    9 +++++----
 fs/xfs/scrub/rmap.c      |   11 +++++------
 3 files changed, 14 insertions(+), 15 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index c2624d11f041..830b38337cd5 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -193,7 +193,7 @@ xfs_rmap_delete(
 }
 
 /* Convert an internal btree record to an rmap record. */
-int
+xfs_failaddr_t
 xfs_rmap_btrec_to_irec(
 	const union xfs_btree_rec	*rec,
 	struct xfs_rmap_irec		*irec)
@@ -2320,11 +2320,10 @@ xfs_rmap_query_range_helper(
 {
 	struct xfs_rmap_query_range_info	*query = priv;
 	struct xfs_rmap_irec			irec;
-	int					error;
 
-	error = xfs_rmap_btrec_to_irec(rec, &irec);
-	if (error)
-		return error;
+	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL)
+		return -EFSCORRUPTED;
+
 	return query->fn(cur, &irec, query->priv);
 }
 
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index 1472ae570a8a..6a08c403e8b7 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -62,13 +62,14 @@ xfs_rmap_irec_offset_pack(
 	return x;
 }
 
-static inline int
+static inline xfs_failaddr_t
 xfs_rmap_irec_offset_unpack(
 	__u64			offset,
 	struct xfs_rmap_irec	*irec)
 {
 	if (offset & ~(XFS_RMAP_OFF_MASK | XFS_RMAP_OFF_FLAGS))
-		return -EFSCORRUPTED;
+		return __this_address;
+
 	irec->rm_offset = XFS_RMAP_OFF(offset);
 	irec->rm_flags = 0;
 	if (offset & XFS_RMAP_OFF_ATTR_FORK)
@@ -77,7 +78,7 @@ xfs_rmap_irec_offset_unpack(
 		irec->rm_flags |= XFS_RMAP_BMBT_BLOCK;
 	if (offset & XFS_RMAP_OFF_UNWRITTEN)
 		irec->rm_flags |= XFS_RMAP_UNWRITTEN;
-	return 0;
+	return NULL;
 }
 
 static inline void
@@ -192,7 +193,7 @@ int xfs_rmap_lookup_le_range(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 int xfs_rmap_compare(const struct xfs_rmap_irec *a,
 		const struct xfs_rmap_irec *b);
 union xfs_btree_rec;
-int xfs_rmap_btrec_to_irec(const union xfs_btree_rec *rec,
+xfs_failaddr_t xfs_rmap_btrec_to_irec(const union xfs_btree_rec *rec,
 		struct xfs_rmap_irec *irec);
 int xfs_rmap_has_record(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, bool *exists);
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index afc4f840b6bc..94650f11a4a5 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -100,11 +100,11 @@ xchk_rmapbt_rec(
 	bool			is_unwritten;
 	bool			is_bmbt;
 	bool			is_attr;
-	int			error;
 
-	error = xfs_rmap_btrec_to_irec(rec, &irec);
-	if (!xchk_btree_process_error(bs->sc, bs->cur, 0, &error))
-		goto out;
+	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL) {
+		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+		return 0;
+	}
 
 	/* Check extent. */
 	if (irec.rm_startblock + irec.rm_blockcount <= irec.rm_startblock)
@@ -159,8 +159,7 @@ xchk_rmapbt_rec(
 	}
 
 	xchk_rmapbt_xref(bs->sc, &irec);
-out:
-	return error;
+	return 0;
 }
 
 /* Scrub the rmap btree for some AG. */


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 5/8] xfs: standardize ondisk to incore conversion for rmap btrees
  2022-12-30 22:11 ` [PATCHSET v24.0 0/8] xfs: standardize btree record checking code Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 8/8] xfs: complain about bad file mapping records in the ondisk bmbt Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a xfs_rmap_check_irec function to detect corruption in btree
records.  Fix all xfs_rmap_btrec_to_irec callsites to call the new
helper and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rmap.c |   72 ++++++++++++++++++++++++++++------------------
 fs/xfs/libxfs/xfs_rmap.h |    3 ++
 fs/xfs/scrub/rmap.c      |   39 +------------------------
 3 files changed, 49 insertions(+), 65 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 830b38337cd5..5c7b081cef87 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -205,51 +205,66 @@ xfs_rmap_btrec_to_irec(
 			irec);
 }
 
-/*
- * Get the data from the pointed-to record.
- */
-int
-xfs_rmap_get_rec(
-	struct xfs_btree_cur	*cur,
-	struct xfs_rmap_irec	*irec,
-	int			*stat)
+/* Simple checks for rmap records. */
+xfs_failaddr_t
+xfs_rmap_check_irec(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*irec)
 {
-	struct xfs_mount	*mp = cur->bc_mp;
-	struct xfs_perag	*pag = cur->bc_ag.pag;
-	union xfs_btree_rec	*rec;
-	int			error;
-
-	error = xfs_btree_get_rec(cur, &rec, stat);
-	if (error || !*stat)
-		return error;
-
-	if (xfs_rmap_btrec_to_irec(rec, irec))
-		goto out_bad_rec;
+	struct xfs_mount		*mp = cur->bc_mp;
 
 	if (irec->rm_blockcount == 0)
-		goto out_bad_rec;
+		return __this_address;
 	if (irec->rm_startblock <= XFS_AGFL_BLOCK(mp)) {
 		if (irec->rm_owner != XFS_RMAP_OWN_FS)
-			goto out_bad_rec;
+			return __this_address;
 		if (irec->rm_blockcount != XFS_AGFL_BLOCK(mp) + 1)
-			goto out_bad_rec;
+			return __this_address;
 	} else {
 		/* check for valid extent range, including overflow */
-		if (!xfs_verify_agbext(pag, irec->rm_startblock,
-					    irec->rm_blockcount))
-			goto out_bad_rec;
+		if (!xfs_verify_agbext(cur->bc_ag.pag, irec->rm_startblock,
+						       irec->rm_blockcount))
+			return __this_address;
 	}
 
 	if (!(xfs_verify_ino(mp, irec->rm_owner) ||
 	      (irec->rm_owner <= XFS_RMAP_OWN_FS &&
 	       irec->rm_owner >= XFS_RMAP_OWN_MIN)))
+		return __this_address;
+
+	return NULL;
+}
+
+/*
+ * Get the data from the pointed-to record.
+ */
+int
+xfs_rmap_get_rec(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*irec,
+	int			*stat)
+{
+	struct xfs_mount	*mp = cur->bc_mp;
+	struct xfs_perag	*pag = cur->bc_ag.pag;
+	union xfs_btree_rec	*rec;
+	xfs_failaddr_t		fa;
+	int			error;
+
+	error = xfs_btree_get_rec(cur, &rec, stat);
+	if (error || !*stat)
+		return error;
+
+	fa = xfs_rmap_btrec_to_irec(rec, irec);
+	if (!fa)
+		fa = xfs_rmap_check_irec(cur, irec);
+	if (fa)
 		goto out_bad_rec;
 
 	return 0;
 out_bad_rec:
 	xfs_warn(mp,
-		"Reverse Mapping BTree record corruption in AG %d detected!",
-		pag->pag_agno);
+		"Reverse Mapping BTree record corruption in AG %d detected at %pS!",
+		pag->pag_agno, fa);
 	xfs_warn(mp,
 		"Owner 0x%llx, flags 0x%x, start block 0x%x block count 0x%x",
 		irec->rm_owner, irec->rm_flags, irec->rm_startblock,
@@ -2321,7 +2336,8 @@ xfs_rmap_query_range_helper(
 	struct xfs_rmap_query_range_info	*query = priv;
 	struct xfs_rmap_irec			irec;
 
-	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL)
+	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL ||
+	    xfs_rmap_check_irec(cur, &irec) != NULL)
 		return -EFSCORRUPTED;
 
 	return query->fn(cur, &irec, query->priv);
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index 6a08c403e8b7..7fb298bcc15f 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -195,6 +195,9 @@ int xfs_rmap_compare(const struct xfs_rmap_irec *a,
 union xfs_btree_rec;
 xfs_failaddr_t xfs_rmap_btrec_to_irec(const union xfs_btree_rec *rec,
 		struct xfs_rmap_irec *irec);
+xfs_failaddr_t xfs_rmap_check_irec(struct xfs_btree_cur *cur,
+		const struct xfs_rmap_irec *irec);
+
 int xfs_rmap_has_record(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, bool *exists);
 int xfs_rmap_record_exists(struct xfs_btree_cur *cur, xfs_agblock_t bno,
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 94650f11a4a5..610b16f77e7e 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -93,43 +93,18 @@ xchk_rmapbt_rec(
 	struct xchk_btree	*bs,
 	const union xfs_btree_rec *rec)
 {
-	struct xfs_mount	*mp = bs->cur->bc_mp;
 	struct xfs_rmap_irec	irec;
-	struct xfs_perag	*pag = bs->cur->bc_ag.pag;
 	bool			non_inode;
 	bool			is_unwritten;
 	bool			is_bmbt;
 	bool			is_attr;
 
-	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL) {
+	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL ||
+	    xfs_rmap_check_irec(bs->cur, &irec) != NULL) {
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 		return 0;
 	}
 
-	/* Check extent. */
-	if (irec.rm_startblock + irec.rm_blockcount <= irec.rm_startblock)
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
-	if (irec.rm_owner == XFS_RMAP_OWN_FS) {
-		/*
-		 * xfs_verify_agbno returns false for static fs metadata.
-		 * Since that only exists at the start of the AG, validate
-		 * that by hand.
-		 */
-		if (irec.rm_startblock != 0 ||
-		    irec.rm_blockcount != XFS_AGFL_BLOCK(mp) + 1)
-			xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-	} else {
-		/*
-		 * Otherwise we must point somewhere past the static metadata
-		 * but before the end of the FS.  Run the regular check.
-		 */
-		if (!xfs_verify_agbno(pag, irec.rm_startblock) ||
-		    !xfs_verify_agbno(pag, irec.rm_startblock +
-				irec.rm_blockcount - 1))
-			xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-	}
-
 	/* Check flags. */
 	non_inode = XFS_RMAP_NON_INODE_OWNER(irec.rm_owner);
 	is_bmbt = irec.rm_flags & XFS_RMAP_BMBT_BLOCK;
@@ -148,16 +123,6 @@ xchk_rmapbt_rec(
 	if (non_inode && (is_bmbt || is_unwritten || is_attr))
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 
-	if (!non_inode) {
-		if (!xfs_verify_ino(mp, irec.rm_owner))
-			xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-	} else {
-		/* Non-inode owner within the magic values? */
-		if (irec.rm_owner <= XFS_RMAP_OWN_MIN ||
-		    irec.rm_owner > XFS_RMAP_OWN_FS)
-			xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-	}
-
 	xchk_rmapbt_xref(bs->sc, &irec);
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 6/8] xfs: standardize ondisk to incore conversion for bmap btrees
  2022-12-30 22:11 ` [PATCHSET v24.0 0/8] xfs: standardize btree record checking code Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 7/8] xfs: complain about bad records in query_range helpers Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 8/8] xfs: complain about bad file mapping records in the ondisk bmbt Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 5/8] xfs: standardize ondisk to incore conversion for rmap btrees Darrick J. Wong
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Fix all xfs_bmbt_disk_get_all callsites to call xfs_bmap_validate_extent
and bubble up corruption reports.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/bmap.c |    6 ++++++
 1 file changed, 6 insertions(+)


diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 5c4b25585b8c..575f2c80d055 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -471,6 +471,12 @@ xchk_bmapbt_rec(
 		return 0;
 
 	xfs_bmbt_disk_get_all(&rec->bmbt, &irec);
+	if (xfs_bmap_validate_extent(ip, info->whichfork, &irec) != NULL) {
+		xchk_fblock_set_corrupt(bs->sc, info->whichfork,
+				irec.br_startoff);
+		return 0;
+	}
+
 	if (!xfs_iext_lookup_extent(ip, ifp, irec.br_startoff, &icur,
 				&iext_irec) ||
 	    irec.br_startoff != iext_irec.br_startoff ||


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 7/8] xfs: complain about bad records in query_range helpers
  2022-12-30 22:11 ` [PATCHSET v24.0 0/8] xfs: standardize btree record checking code Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 3/8] xfs: standardize ondisk to incore conversion for refcount btrees Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 6/8] xfs: standardize ondisk to incore conversion for bmap btrees Darrick J. Wong
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

For every btree type except for the bmbt, refactor the code that
complains about bad records into a helper and make the ->query_range
helpers call it so that corruptions found via that avenue are logged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c    |   38 +++++++++++++++++++++++---------------
 fs/xfs/libxfs/xfs_ialloc.c   |   38 ++++++++++++++++++++++++--------------
 fs/xfs/libxfs/xfs_refcount.c |   32 +++++++++++++++++++-------------
 fs/xfs/libxfs/xfs_rmap.c     |   40 +++++++++++++++++++++++++---------------
 4 files changed, 91 insertions(+), 57 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 13b668673243..8dcefff1db33 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -265,6 +265,24 @@ xfs_alloc_check_irec(
 	return NULL;
 }
 
+static inline int
+xfs_alloc_complain_bad_rec(
+	struct xfs_btree_cur		*cur,
+	xfs_failaddr_t			fa,
+	const struct xfs_alloc_rec_incore *irec)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+
+	xfs_warn(mp,
+		"%s Freespace BTree record corruption in AG %d detected at %pS!",
+		cur->bc_btnum == XFS_BTNUM_BNO ? "Block" : "Size",
+		cur->bc_ag.pag->pag_agno, fa);
+	xfs_warn(mp,
+		"start block 0x%x block count 0x%x", irec->ar_startblock,
+		irec->ar_blockcount);
+	return -EFSCORRUPTED;
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -276,8 +294,6 @@ xfs_alloc_get_rec(
 	int			*stat)	/* output: success/failure */
 {
 	struct xfs_alloc_rec_incore irec;
-	struct xfs_mount	*mp = cur->bc_mp;
-	struct xfs_perag	*pag = cur->bc_ag.pag;
 	union xfs_btree_rec	*rec;
 	xfs_failaddr_t		fa;
 	int			error;
@@ -289,21 +305,11 @@ xfs_alloc_get_rec(
 	xfs_alloc_btrec_to_irec(rec, &irec);
 	fa = xfs_alloc_check_irec(cur, &irec);
 	if (fa)
-		goto out_bad_rec;
+		return xfs_alloc_complain_bad_rec(cur, fa, &irec);
 
 	*bno = irec.ar_startblock;
 	*len = irec.ar_blockcount;
 	return 0;
-
-out_bad_rec:
-	xfs_warn(mp,
-		"%s Freespace BTree record corruption in AG %d detected at %pS!",
-		cur->bc_btnum == XFS_BTNUM_BNO ? "Block" : "Size",
-		pag->pag_agno, fa);
-	xfs_warn(mp,
-		"start block 0x%x block count 0x%x", irec.ar_startblock,
-		irec.ar_blockcount);
-	return -EFSCORRUPTED;
 }
 
 /*
@@ -3477,10 +3483,12 @@ xfs_alloc_query_range_helper(
 {
 	struct xfs_alloc_query_range_info	*query = priv;
 	struct xfs_alloc_rec_incore		irec;
+	xfs_failaddr_t				fa;
 
 	xfs_alloc_btrec_to_irec(rec, &irec);
-	if (xfs_alloc_check_irec(cur, &irec) != NULL)
-		return -EFSCORRUPTED;
+	fa = xfs_alloc_check_irec(cur, &irec);
+	if (fa)
+		return xfs_alloc_complain_bad_rec(cur, fa, &irec);
 
 	return query->fn(cur, &irec, query->priv);
 }
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 010d1f514742..b6f76935504e 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -122,6 +122,25 @@ xfs_inobt_check_irec(
 	return NULL;
 }
 
+static inline int
+xfs_inobt_complain_bad_rec(
+	struct xfs_btree_cur		*cur,
+	xfs_failaddr_t			fa,
+	const struct xfs_inobt_rec_incore *irec)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+
+	xfs_warn(mp,
+		"%s Inode BTree record corruption in AG %d detected at %pS!",
+		cur->bc_btnum == XFS_BTNUM_INO ? "Used" : "Free",
+		cur->bc_ag.pag->pag_agno, fa);
+	xfs_warn(mp,
+"start inode 0x%x, count 0x%x, free 0x%x freemask 0x%llx, holemask 0x%x",
+		irec->ir_startino, irec->ir_count, irec->ir_freecount,
+		irec->ir_free, irec->ir_holemask);
+	return -EFSCORRUPTED;
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -143,20 +162,9 @@ xfs_inobt_get_rec(
 	xfs_inobt_btrec_to_irec(mp, rec, irec);
 	fa = xfs_inobt_check_irec(cur, irec);
 	if (fa)
-		goto out_bad_rec;
+		return xfs_inobt_complain_bad_rec(cur, fa, irec);
 
 	return 0;
-
-out_bad_rec:
-	xfs_warn(mp,
-		"%s Inode BTree record corruption in AG %d detected at %pS!",
-		cur->bc_btnum == XFS_BTNUM_INO ? "Used" : "Free",
-		cur->bc_ag.pag->pag_agno, fa);
-	xfs_warn(mp,
-"start inode 0x%x, count 0x%x, free 0x%x freemask 0x%llx, holemask 0x%x",
-		irec->ir_startino, irec->ir_count, irec->ir_freecount,
-		irec->ir_free, irec->ir_holemask);
-	return -EFSCORRUPTED;
 }
 
 /*
@@ -2717,10 +2725,12 @@ xfs_ialloc_count_inodes_rec(
 {
 	struct xfs_inobt_rec_incore	irec;
 	struct xfs_ialloc_count_inodes	*ci = priv;
+	xfs_failaddr_t			fa;
 
 	xfs_inobt_btrec_to_irec(cur->bc_mp, rec, &irec);
-	if (xfs_inobt_check_irec(cur, &irec) != NULL)
-		return -EFSCORRUPTED;
+	fa = xfs_inobt_check_irec(cur, &irec);
+	if (fa)
+		return xfs_inobt_complain_bad_rec(cur, fa, &irec);
 
 	ci->count += irec.ir_count;
 	ci->freecount += irec.ir_freecount;
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index b77dea10c8bd..335f84bef81c 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -144,6 +144,23 @@ xfs_refcount_check_irec(
 	return NULL;
 }
 
+static inline int
+xfs_refcount_complain_bad_rec(
+	struct xfs_btree_cur		*cur,
+	xfs_failaddr_t			fa,
+	const struct xfs_refcount_irec	*irec)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+
+	xfs_warn(mp,
+ "Refcount BTree record corruption in AG %d detected at %pS!",
+				cur->bc_ag.pag->pag_agno, fa);
+	xfs_warn(mp,
+		"Start block 0x%x, block count 0x%x, references 0x%x",
+		irec->rc_startblock, irec->rc_blockcount, irec->rc_refcount);
+	return -EFSCORRUPTED;
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -153,8 +170,6 @@ xfs_refcount_get_rec(
 	struct xfs_refcount_irec	*irec,
 	int				*stat)
 {
-	struct xfs_mount		*mp = cur->bc_mp;
-	struct xfs_perag		*pag = cur->bc_ag.pag;
 	union xfs_btree_rec		*rec;
 	xfs_failaddr_t			fa;
 	int				error;
@@ -166,19 +181,10 @@ xfs_refcount_get_rec(
 	xfs_refcount_btrec_to_irec(rec, irec);
 	fa = xfs_refcount_check_irec(cur, irec);
 	if (fa)
-		goto out_bad_rec;
+		return xfs_refcount_complain_bad_rec(cur, fa, irec);
 
-	trace_xfs_refcount_get(cur->bc_mp, pag->pag_agno, irec);
+	trace_xfs_refcount_get(cur->bc_mp, cur->bc_ag.pag->pag_agno, irec);
 	return 0;
-
-out_bad_rec:
-	xfs_warn(mp,
-		"Refcount BTree record corruption in AG %d detected at %pS!",
-		pag->pag_agno, fa);
-	xfs_warn(mp,
-		"Start block 0x%x, block count 0x%x, references 0x%x",
-		irec->rc_startblock, irec->rc_blockcount, irec->rc_refcount);
-	return -EFSCORRUPTED;
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 5c7b081cef87..641114a023f2 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -235,6 +235,24 @@ xfs_rmap_check_irec(
 	return NULL;
 }
 
+static inline int
+xfs_rmap_complain_bad_rec(
+	struct xfs_btree_cur		*cur,
+	xfs_failaddr_t			fa,
+	const struct xfs_rmap_irec	*irec)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+
+	xfs_warn(mp,
+		"Reverse Mapping BTree record corruption in AG %d detected at %pS!",
+		cur->bc_ag.pag->pag_agno, fa);
+	xfs_warn(mp,
+		"Owner 0x%llx, flags 0x%x, start block 0x%x block count 0x%x",
+		irec->rm_owner, irec->rm_flags, irec->rm_startblock,
+		irec->rm_blockcount);
+	return -EFSCORRUPTED;
+}
+
 /*
  * Get the data from the pointed-to record.
  */
@@ -244,8 +262,6 @@ xfs_rmap_get_rec(
 	struct xfs_rmap_irec	*irec,
 	int			*stat)
 {
-	struct xfs_mount	*mp = cur->bc_mp;
-	struct xfs_perag	*pag = cur->bc_ag.pag;
 	union xfs_btree_rec	*rec;
 	xfs_failaddr_t		fa;
 	int			error;
@@ -258,18 +274,9 @@ xfs_rmap_get_rec(
 	if (!fa)
 		fa = xfs_rmap_check_irec(cur, irec);
 	if (fa)
-		goto out_bad_rec;
+		return xfs_rmap_complain_bad_rec(cur, fa, irec);
 
 	return 0;
-out_bad_rec:
-	xfs_warn(mp,
-		"Reverse Mapping BTree record corruption in AG %d detected at %pS!",
-		pag->pag_agno, fa);
-	xfs_warn(mp,
-		"Owner 0x%llx, flags 0x%x, start block 0x%x block count 0x%x",
-		irec->rm_owner, irec->rm_flags, irec->rm_startblock,
-		irec->rm_blockcount);
-	return -EFSCORRUPTED;
 }
 
 struct xfs_find_left_neighbor_info {
@@ -2335,10 +2342,13 @@ xfs_rmap_query_range_helper(
 {
 	struct xfs_rmap_query_range_info	*query = priv;
 	struct xfs_rmap_irec			irec;
+	xfs_failaddr_t				fa;
 
-	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL ||
-	    xfs_rmap_check_irec(cur, &irec) != NULL)
-		return -EFSCORRUPTED;
+	fa = xfs_rmap_btrec_to_irec(rec, &irec);
+	if (!fa)
+		fa = xfs_rmap_check_irec(cur, &irec);
+	if (fa)
+		return xfs_rmap_complain_bad_rec(cur, fa, &irec);
 
 	return query->fn(cur, &irec, query->priv);
 }


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 8/8] xfs: complain about bad file mapping records in the ondisk bmbt
  2022-12-30 22:11 ` [PATCHSET v24.0 0/8] xfs: standardize btree record checking code Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 6/8] xfs: standardize ondisk to incore conversion for bmap btrees Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 5/8] xfs: standardize ondisk to incore conversion for rmap btrees Darrick J. Wong
  7 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Similar to what we've just done for the other btrees, create a function
to log corrupt bmbt records and call it whenever we encounter a bad
record in the ondisk btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c       |   31 ++++++++++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_bmap.h       |    2 ++
 fs/xfs/libxfs/xfs_inode_fork.c |    3 ++-
 3 files changed, 34 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 45dfa5a56154..d9083cbeb20e 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -1099,6 +1099,34 @@ struct xfs_iread_state {
 	xfs_extnum_t		loaded;
 };
 
+int
+xfs_bmap_complain_bad_rec(
+	struct xfs_inode		*ip,
+	int				whichfork,
+	xfs_failaddr_t			fa,
+	const struct xfs_bmbt_irec	*irec)
+{
+	struct xfs_mount		*mp = ip->i_mount;
+	const char			*forkname;
+
+	switch (whichfork) {
+	case XFS_DATA_FORK:	forkname = "data"; break;
+	case XFS_ATTR_FORK:	forkname = "attr"; break;
+	case XFS_COW_FORK:	forkname = "CoW"; break;
+	default:		forkname = "???"; break;
+	}
+
+	xfs_warn(mp,
+ "Bmap BTree record corruption in inode 0x%llx %s fork detected at %pS!",
+				ip->i_ino, forkname, fa);
+	xfs_warn(mp,
+		"Offset 0x%llx, start block 0x%llx, block count 0x%llx state 0x%x",
+		irec->br_startoff, irec->br_startblock, irec->br_blockcount,
+		irec->br_state);
+
+	return -EFSCORRUPTED;
+}
+
 /* Stuff every bmbt record from this block into the incore extent map. */
 static int
 xfs_iread_bmbt_block(
@@ -1141,7 +1169,8 @@ xfs_iread_bmbt_block(
 			xfs_inode_verifier_error(ip, -EFSCORRUPTED,
 					"xfs_iread_extents(2)", frp,
 					sizeof(*frp), fa);
-			return -EFSCORRUPTED;
+			return xfs_bmap_complain_bad_rec(ip, whichfork, fa,
+					&new);
 		}
 		xfs_iext_insert(ip, &ir->icur, &new,
 				xfs_bmap_fork_to_state(whichfork));
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 0cd86781fcd5..7af24f2ef8a2 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -258,6 +258,8 @@ static inline uint32_t xfs_bmap_fork_to_state(int whichfork)
 
 xfs_failaddr_t xfs_bmap_validate_extent(struct xfs_inode *ip, int whichfork,
 		struct xfs_bmbt_irec *irec);
+int xfs_bmap_complain_bad_rec(struct xfs_inode *ip, int whichfork,
+		xfs_failaddr_t fa, const struct xfs_bmbt_irec *irec);
 
 int	xfs_bmapi_remap(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t bno, xfs_filblks_t len, xfs_fsblock_t startblock,
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 6b21760184d9..ff37eecec4b0 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -140,7 +140,8 @@ xfs_iformat_extents(
 				xfs_inode_verifier_error(ip, -EFSCORRUPTED,
 						"xfs_iformat_extents(2)",
 						dp, sizeof(*dp), fa);
-				return -EFSCORRUPTED;
+				return xfs_bmap_complain_bad_rec(ip, whichfork,
+						fa, &new);
 			}
 
 			xfs_iext_insert(ip, &icur, &new, state);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/3] xfs: hoist scrub record checks into libxfs
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (5 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/8] xfs: standardize btree record checking code Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/3] xfs: hoist inode record alignment checks from scrub Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: fix rmap btree key flag handling Darrick J. Wong
                   ` (15 subsequent siblings)
  22 siblings, 3 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

There are a few things about btree records that scrub checked but the
libxfs _get_rec functions didn't.  Move these bits into libxfs so that
everyone can benefit.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=btree-hoist-scrub-checks

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=btree-hoist-scrub-checks
---
 fs/xfs/libxfs/xfs_ialloc.c |    4 ++++
 fs/xfs/libxfs/xfs_rmap.c   |   27 +++++++++++++++++++++++++++
 fs/xfs/scrub/ialloc.c      |    6 ------
 fs/xfs/scrub/rmap.c        |   22 ----------------------
 4 files changed, 31 insertions(+), 28 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/3] xfs: hoist rmap record flag checks from scrub
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: hoist scrub record checks into libxfs Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/3] xfs: hoist inode record alignment checks from scrub Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/3] xfs: hoist rmap record flag " Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move the rmap record flag checks from xchk_rmapbt_rec into
xfs_rmap_check_irec so that they are applied everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rmap.c |   22 ++++++++++++++++++++++
 fs/xfs/scrub/rmap.c      |   22 ----------------------
 2 files changed, 22 insertions(+), 22 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 641114a023f2..e66ecd794a84 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -212,6 +212,10 @@ xfs_rmap_check_irec(
 	const struct xfs_rmap_irec	*irec)
 {
 	struct xfs_mount		*mp = cur->bc_mp;
+	bool				is_inode;
+	bool				is_unwritten;
+	bool				is_bmbt;
+	bool				is_attr;
 
 	if (irec->rm_blockcount == 0)
 		return __this_address;
@@ -232,6 +236,24 @@ xfs_rmap_check_irec(
 	       irec->rm_owner >= XFS_RMAP_OWN_MIN)))
 		return __this_address;
 
+	/* Check flags. */
+	is_inode = !XFS_RMAP_NON_INODE_OWNER(irec->rm_owner);
+	is_bmbt = irec->rm_flags & XFS_RMAP_BMBT_BLOCK;
+	is_attr = irec->rm_flags & XFS_RMAP_ATTR_FORK;
+	is_unwritten = irec->rm_flags & XFS_RMAP_UNWRITTEN;
+
+	if (is_bmbt && irec->rm_offset != 0)
+		return __this_address;
+
+	if (!is_inode && irec->rm_offset != 0)
+		return __this_address;
+
+	if (is_unwritten && (is_bmbt || !is_inode || is_attr))
+		return __this_address;
+
+	if (!is_inode && (is_bmbt || is_unwritten || is_attr))
+		return __this_address;
+
 	return NULL;
 }
 
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 610b16f77e7e..a039008dc078 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -94,10 +94,6 @@ xchk_rmapbt_rec(
 	const union xfs_btree_rec *rec)
 {
 	struct xfs_rmap_irec	irec;
-	bool			non_inode;
-	bool			is_unwritten;
-	bool			is_bmbt;
-	bool			is_attr;
 
 	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL ||
 	    xfs_rmap_check_irec(bs->cur, &irec) != NULL) {
@@ -105,24 +101,6 @@ xchk_rmapbt_rec(
 		return 0;
 	}
 
-	/* Check flags. */
-	non_inode = XFS_RMAP_NON_INODE_OWNER(irec.rm_owner);
-	is_bmbt = irec.rm_flags & XFS_RMAP_BMBT_BLOCK;
-	is_attr = irec.rm_flags & XFS_RMAP_ATTR_FORK;
-	is_unwritten = irec.rm_flags & XFS_RMAP_UNWRITTEN;
-
-	if (is_bmbt && irec.rm_offset != 0)
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
-	if (non_inode && irec.rm_offset != 0)
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
-	if (is_unwritten && (is_bmbt || non_inode || is_attr))
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
-	if (non_inode && (is_bmbt || is_unwritten || is_attr))
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
 	xchk_rmapbt_xref(bs->sc, &irec);
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/3] xfs: hoist rmap record flag checks from scrub
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: hoist scrub record checks into libxfs Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/3] xfs: hoist inode record alignment checks from scrub Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/3] " Darrick J. Wong
  2 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move the rmap record flag checks from xchk_rmapbt_rec into
xfs_rmap_check_irec so that they are applied everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rmap.c |    5 +++++
 1 file changed, 5 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index e66ecd794a84..da008d317f83 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -254,6 +254,11 @@ xfs_rmap_check_irec(
 	if (!is_inode && (is_bmbt || is_unwritten || is_attr))
 		return __this_address;
 
+	/* Check for a valid fork offset, if applicable. */
+	if (is_inode && !is_bmbt &&
+	    !xfs_verify_fileext(mp, irec->rm_offset, irec->rm_blockcount))
+		return __this_address;
+
 	return NULL;
 }
 


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 3/3] xfs: hoist inode record alignment checks from scrub
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: hoist scrub record checks into libxfs Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/3] xfs: hoist rmap record flag " Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/3] " Darrick J. Wong
  2 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move the inobt record alignment checks from xchk_iallocbt_rec into
xfs_inobt_check_irec so that they are applied everywhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ialloc.c |    4 ++++
 fs/xfs/scrub/ialloc.c      |    6 ------
 2 files changed, 4 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index b6f76935504e..2451db4c687c 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -103,8 +103,12 @@ xfs_inobt_check_irec(
 {
 	uint64_t			realfree;
 
+	/* Record has to be properly aligned within the AG. */
 	if (!xfs_verify_agino(cur->bc_ag.pag, irec->ir_startino))
 		return __this_address;
+	if (!xfs_verify_agino(cur->bc_ag.pag,
+				irec->ir_startino + XFS_INODES_PER_CHUNK - 1))
+		return __this_address;
 	if (irec->ir_count < XFS_INODES_PER_HOLEMASK_BIT ||
 	    irec->ir_count > XFS_INODES_PER_CHUNK)
 		return __this_address;
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index 9aec5a793397..b85f0cd00bc2 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -413,7 +413,6 @@ xchk_iallocbt_rec(
 	const union xfs_btree_rec	*rec)
 {
 	struct xfs_mount		*mp = bs->cur->bc_mp;
-	struct xfs_perag		*pag = bs->cur->bc_ag.pag;
 	struct xchk_iallocbt		*iabt = bs->private;
 	struct xfs_inobt_rec_incore	irec;
 	uint64_t			holes;
@@ -431,11 +430,6 @@ xchk_iallocbt_rec(
 	}
 
 	agino = irec.ir_startino;
-	/* Record has to be properly aligned within the AG. */
-	if (!xfs_verify_agino(pag, agino + XFS_INODES_PER_CHUNK - 1)) {
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-		goto out;
-	}
 
 	xchk_iallocbt_rec_alignment(bs, &irec);
 	if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/2] xfs: fix rmap btree key flag handling
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (6 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: hoist scrub record checks into libxfs Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/2] xfs: fix rm_offset flag handling in rmap keys Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/2] xfs: detect unwritten bit set in rmapbt node block keys Darrick J. Wong
  2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: enhance btree key scrubbing Darrick J. Wong
                   ` (14 subsequent siblings)
  22 siblings, 2 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

This series fixes numerous flag handling bugs in the rmapbt key code.
The most serious transgression is that key comparisons completely strip
out all flag bits from rm_offset, including the ones that participate in
record lookups.  The second problem is that for years we've been letting
the unwritten flag (which is an attribute of a specific record and not
part of the record key) escape from leaf records into key records.

The solution to the second problem is to filter attribute flags when
creating keys from records, and the solution to the first problem is to
preserve *only* the flags used for key lookups.  The ATTR and BMBT flags
are a part of the lookup key, and the UNWRITTEN flag is a record
attribute.

This has worked for years without generating user complaints because
ATTR and BMBT extents cannot be shared, so key comparisons succeed
solely on rm_startblock.  Only file data fork extents can be shared, and
those records never set any of the three flag bits, so comparisons that
dig into rm_owner and rm_offset work just fine.

A filesystem written with an unpatched kernel and mounted on a patched
kernel will work correctly because the ATTR/BMBT flags have been
conveyed into keys correctly all along, and we still ignore the
UNWRITTEN flag in any key record.  This was what doomed my previous
attempt to correct this problem in 2019.

A filesystem written with a patched kernel and mounted on an unpatched
kernel will also work correctly because unpatched kernels ignore all
flags.

With this patchset applied, the scrub code gains the ability to detect
rmap btrees with incorrectly set attr and bmbt flags in the key records.
After three years of testing, I haven't encountered any problems.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=rmap-btree-fix-key-handling

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=rmap-btree-fix-key-handling
---
 fs/xfs/libxfs/xfs_rmap_btree.c |   40 +++++++++++++++++++++++-------
 fs/xfs/scrub/btree.c           |   10 ++++++++
 fs/xfs/scrub/btree.h           |    2 ++
 fs/xfs/scrub/rmap.c            |   53 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 95 insertions(+), 10 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/2] xfs: fix rm_offset flag handling in rmap keys
  2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: fix rmap btree key flag handling Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/2] xfs: detect unwritten bit set in rmapbt node block keys Darrick J. Wong
  1 sibling, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Keys for extent interval records in the reverse mapping btree are
supposed to be computed as follows:

(physical block, owner, fork, is_btree, offset)

This provides users the ability to look up a reverse mapping from a file
block mapping record -- start with the physical block; then if there are
multiple records for the same block, move on to the owner; then the
inode fork type; and so on to the file offset.

Unfortunately, the code that creates rmap lookup keys from rmap records
forgot to mask off the record attribute flags, leading to ondisk keys
that look like this:

(physical block, owner, fork, is_btree, unwritten state, offset)

Fortunately, this has all worked ok for the past six years because the
key comparison functions incorrectly ignore the fork/bmbt/unwritten
information that's encoded in the on-disk offset.  This means that
lookup comparisons are only done with:

(physical block, owner, offset)

Queries can (theoretically) return incorrect results because of this
omission.  On consistent filesystems this isn't an issue because xattr
and bmbt blocks cannot be shared and hence the comparisons succeed
purely on the contents of the rm_startblock field.  For the one case
where we support sharing (written data fork blocks) all flag bits are
zero, so the omission in the comparison has no ill effects.

Unfortunately, this bug prevents scrub from detecting incorrect fork and
bmbt flag bits in the rmap btree, so we really do need to fix the
compare code.  Old filesystems with the unwritten bit erroneously set in
the rmap key struct will work fine on new kernels since we still ignore
the unwritten bit.  New filesystems on older kernels will work fine
since the old kernels never paid attention to the unwritten bit.

A previous version of this patch forgot to keep the (un)written state
flag masked during the comparison and caused a major regression in
5.9.x since unwritten extent conversion can update an rmap record
without requiring key updates.

Note that blocks cannot go directly from data fork to attr fork without
being deallocated and reallocated, nor can they be added to or removed
from a bmbt without a free/alloc cycle, so this should not cause any
regressions.

Found by fuzzing keys[1].attrfork = ones on xfs/371.

Fixes: 4b8ed67794fe ("xfs: add rmap btree operations")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rmap_btree.c |   40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 12c26c42c162..e18f89a68da9 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -156,6 +156,16 @@ xfs_rmapbt_get_maxrecs(
 	return cur->bc_mp->m_rmap_mxr[level != 0];
 }
 
+/*
+ * Convert the ondisk record's offset field into the ondisk key's offset field.
+ * Fork and bmbt are significant parts of the rmap record key, but written
+ * status is merely a record attribute.
+ */
+static inline __be64 ondisk_rec_offset_to_key(const union xfs_btree_rec *rec)
+{
+	return rec->rmap.rm_offset & ~cpu_to_be64(XFS_RMAP_OFF_UNWRITTEN);
+}
+
 STATIC void
 xfs_rmapbt_init_key_from_rec(
 	union xfs_btree_key		*key,
@@ -163,7 +173,7 @@ xfs_rmapbt_init_key_from_rec(
 {
 	key->rmap.rm_startblock = rec->rmap.rm_startblock;
 	key->rmap.rm_owner = rec->rmap.rm_owner;
-	key->rmap.rm_offset = rec->rmap.rm_offset;
+	key->rmap.rm_offset = ondisk_rec_offset_to_key(rec);
 }
 
 /*
@@ -186,7 +196,7 @@ xfs_rmapbt_init_high_key_from_rec(
 	key->rmap.rm_startblock = rec->rmap.rm_startblock;
 	be32_add_cpu(&key->rmap.rm_startblock, adj);
 	key->rmap.rm_owner = rec->rmap.rm_owner;
-	key->rmap.rm_offset = rec->rmap.rm_offset;
+	key->rmap.rm_offset = ondisk_rec_offset_to_key(rec);
 	if (XFS_RMAP_NON_INODE_OWNER(be64_to_cpu(rec->rmap.rm_owner)) ||
 	    XFS_RMAP_IS_BMBT_BLOCK(be64_to_cpu(rec->rmap.rm_offset)))
 		return;
@@ -219,6 +229,16 @@ xfs_rmapbt_init_ptr_from_cur(
 	ptr->s = agf->agf_roots[cur->bc_btnum];
 }
 
+/*
+ * Mask the appropriate parts of the ondisk key field for a key comparison.
+ * Fork and bmbt are significant parts of the rmap record key, but written
+ * status is merely a record attribute.
+ */
+static inline uint64_t offset_keymask(uint64_t offset)
+{
+	return offset & ~XFS_RMAP_OFF_UNWRITTEN;
+}
+
 STATIC int64_t
 xfs_rmapbt_key_diff(
 	struct xfs_btree_cur		*cur,
@@ -240,8 +260,8 @@ xfs_rmapbt_key_diff(
 	else if (y > x)
 		return -1;
 
-	x = XFS_RMAP_OFF(be64_to_cpu(kp->rm_offset));
-	y = rec->rm_offset;
+	x = offset_keymask(be64_to_cpu(kp->rm_offset));
+	y = offset_keymask(xfs_rmap_irec_offset_pack(rec));
 	if (x > y)
 		return 1;
 	else if (y > x)
@@ -272,8 +292,8 @@ xfs_rmapbt_diff_two_keys(
 	else if (y > x)
 		return -1;
 
-	x = XFS_RMAP_OFF(be64_to_cpu(kp1->rm_offset));
-	y = XFS_RMAP_OFF(be64_to_cpu(kp2->rm_offset));
+	x = offset_keymask(be64_to_cpu(kp1->rm_offset));
+	y = offset_keymask(be64_to_cpu(kp2->rm_offset));
 	if (x > y)
 		return 1;
 	else if (y > x)
@@ -387,8 +407,8 @@ xfs_rmapbt_keys_inorder(
 		return 1;
 	else if (a > b)
 		return 0;
-	a = XFS_RMAP_OFF(be64_to_cpu(k1->rmap.rm_offset));
-	b = XFS_RMAP_OFF(be64_to_cpu(k2->rmap.rm_offset));
+	a = offset_keymask(be64_to_cpu(k1->rmap.rm_offset));
+	b = offset_keymask(be64_to_cpu(k2->rmap.rm_offset));
 	if (a <= b)
 		return 1;
 	return 0;
@@ -417,8 +437,8 @@ xfs_rmapbt_recs_inorder(
 		return 1;
 	else if (a > b)
 		return 0;
-	a = XFS_RMAP_OFF(be64_to_cpu(r1->rmap.rm_offset));
-	b = XFS_RMAP_OFF(be64_to_cpu(r2->rmap.rm_offset));
+	a = offset_keymask(be64_to_cpu(r1->rmap.rm_offset));
+	b = offset_keymask(be64_to_cpu(r2->rmap.rm_offset));
 	if (a <= b)
 		return 1;
 	return 0;


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/2] xfs: detect unwritten bit set in rmapbt node block keys
  2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: fix rmap btree key flag handling Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/2] xfs: fix rm_offset flag handling in rmap keys Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  1 sibling, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In the last patch, we changed the rmapbt code to remove the UNWRITTEN
bit when creating an rmapbt key from an rmapbt record, and we changed
the rmapbt key comparison code to start considering the ATTR and BMBT
flags during lookup.  This brought the behavior of the rmapbt
implementation in line with its specification.

However, there may exist filesystems that have the unwritten bit still
set in the rmapbt keys.  We should detect these situations and flag the
rmapbt as one that would benefit from optimization.  Eventually, online
repair will be able to do something in response to this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/btree.c |   10 +++++++++
 fs/xfs/scrub/btree.h |    2 ++
 fs/xfs/scrub/rmap.c  |   53 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 65 insertions(+)


diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index ebbf1c5fd0c6..634c504bac20 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -119,6 +119,16 @@ xchk_btree_xref_set_corrupt(
 			__return_address);
 }
 
+void
+xchk_btree_set_preen(
+	struct xfs_scrub	*sc,
+	struct xfs_btree_cur	*cur,
+	int			level)
+{
+	__xchk_btree_set_corrupt(sc, cur, level, XFS_SCRUB_OFLAG_PREEN,
+			__return_address);
+}
+
 /*
  * Make sure this record is in order and doesn't stray outside of the parent
  * keys.
diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
index da61a53a0b61..26c499925b5e 100644
--- a/fs/xfs/scrub/btree.h
+++ b/fs/xfs/scrub/btree.h
@@ -19,6 +19,8 @@ bool xchk_btree_xref_process_error(struct xfs_scrub *sc,
 /* Check for btree corruption. */
 void xchk_btree_set_corrupt(struct xfs_scrub *sc,
 		struct xfs_btree_cur *cur, int level);
+void xchk_btree_set_preen(struct xfs_scrub *sc, struct xfs_btree_cur *cur,
+		int level);
 
 /* Check for btree xref discrepancies. */
 void xchk_btree_xref_set_corrupt(struct xfs_scrub *sc,
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index a039008dc078..215730a9d9bf 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -87,6 +87,58 @@ xchk_rmapbt_xref(
 		xchk_rmapbt_xref_refc(sc, irec);
 }
 
+/*
+ * Check for bogus UNWRITTEN flags in the rmapbt node block keys.
+ *
+ * In reverse mapping records, the file mapping extent state
+ * (XFS_RMAP_OFF_UNWRITTEN) is a record attribute, not a key field.  It is not
+ * involved in lookups in any way.  In older kernels, the functions that
+ * convert rmapbt records to keys forgot to filter out the extent state bit,
+ * even though the key comparison functions have filtered the flag correctly.
+ * If we spot an rmap key with the unwritten bit set in rm_offset, we should
+ * mark the btree as needing optimization to rebuild the btree without those
+ * flags.
+ */
+STATIC void
+xchk_rmapbt_check_unwritten_in_keyflags(
+	struct xchk_btree	*bs)
+{
+	struct xfs_scrub	*sc = bs->sc;
+	struct xfs_btree_cur	*cur = bs->cur;
+	struct xfs_btree_block	*keyblock;
+	union xfs_btree_key	*lkey, *hkey;
+	__be64			badflag = cpu_to_be64(XFS_RMAP_OFF_UNWRITTEN);
+	unsigned int		level;
+
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_PREEN)
+		return;
+
+	for (level = 1; level < cur->bc_nlevels; level++) {
+		struct xfs_buf	*bp;
+		unsigned int	ptr;
+
+		/* Only check the first time we've seen this node block. */
+		if (cur->bc_levels[level].ptr > 1)
+			continue;
+
+		keyblock = xfs_btree_get_block(cur, level, &bp);
+		for (ptr = 1; ptr <= be16_to_cpu(keyblock->bb_numrecs); ptr++) {
+			lkey = xfs_btree_key_addr(cur, ptr, keyblock);
+
+			if (lkey->rmap.rm_offset & badflag) {
+				xchk_btree_set_preen(sc, cur, level);
+				break;
+			}
+
+			hkey = xfs_btree_high_key_addr(cur, ptr, keyblock);
+			if (hkey->rmap.rm_offset & badflag) {
+				xchk_btree_set_preen(sc, cur, level);
+				break;
+			}
+		}
+	}
+}
+
 /* Scrub an rmapbt record. */
 STATIC int
 xchk_rmapbt_rec(
@@ -101,6 +153,7 @@ xchk_rmapbt_rec(
 		return 0;
 	}
 
+	xchk_rmapbt_check_unwritten_in_keyflags(bs);
 	xchk_rmapbt_xref(bs->sc, &irec);
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/2] xfs: enhance btree key scrubbing
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (7 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: fix rmap btree key flag handling Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/2] xfs: always scrub record/key order of interior records Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/2] xfs: check btree keys reflect the child block Darrick J. Wong
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect incorrect gaps in refcount btree Darrick J. Wong
                   ` (13 subsequent siblings)
  22 siblings, 2 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs

Hi all,

This series fixes the scrub btree block checker to ensure that the keys
in the parent block accurately represent the block, and check the
ordering of all interior key records.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-btree-key-enhancements
---
 fs/xfs/scrub/btree.c |   63 ++++++++++++++++++++++++++++++++++++++++++++------
 fs/xfs/scrub/btree.h |    8 ++++++
 2 files changed, 63 insertions(+), 8 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/2] xfs: check btree keys reflect the child block
  2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: enhance btree key scrubbing Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/2] xfs: always scrub record/key order of interior records Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  1 sibling, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When scrub is checking a non-root btree block, it should make sure that
the keys in the parent btree block accurately capture the keyspace that
the child block stores.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/btree.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 48 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 634c504bac20..615f52e56f4e 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -529,6 +529,48 @@ xchk_btree_check_minrecs(
 		xchk_btree_set_corrupt(bs->sc, cur, level);
 }
 
+/*
+ * If this btree block has a parent, make sure that the parent's keys capture
+ * the keyspace contained in this block.
+ */
+STATIC void
+xchk_btree_block_check_keys(
+	struct xchk_btree	*bs,
+	int			level,
+	struct xfs_btree_block	*block)
+{
+	union xfs_btree_key	block_key;
+	union xfs_btree_key	*block_high_key;
+	union xfs_btree_key	*parent_low_key, *parent_high_key;
+	struct xfs_btree_cur	*cur = bs->cur;
+	struct xfs_btree_block	*parent_block;
+	struct xfs_buf		*bp;
+
+	if (level == cur->bc_nlevels - 1)
+		return;
+
+	xfs_btree_get_keys(cur, block, &block_key);
+
+	/* Make sure the low key of this block matches the parent. */
+	parent_block = xfs_btree_get_block(cur, level + 1, &bp);
+	parent_low_key = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr,
+			parent_block);
+	if (cur->bc_ops->diff_two_keys(cur, &block_key, parent_low_key)) {
+		xchk_btree_set_corrupt(bs->sc, bs->cur, level);
+		return;
+	}
+
+	if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
+		return;
+
+	/* Make sure the high key of this block matches the parent. */
+	parent_high_key = xfs_btree_high_key_addr(cur,
+			cur->bc_levels[level + 1].ptr, parent_block);
+	block_high_key = xfs_btree_high_key_from_key(cur, &block_key);
+	if (cur->bc_ops->diff_two_keys(cur, block_high_key, parent_high_key))
+		xchk_btree_set_corrupt(bs->sc, bs->cur, level);
+}
+
 /*
  * Grab and scrub a btree block given a btree pointer.  Returns block
  * and buffer pointers (if applicable) if they're ok to use.
@@ -580,7 +622,12 @@ xchk_btree_get_block(
 	 * Check the block's siblings; this function absorbs error codes
 	 * for us.
 	 */
-	return xchk_btree_block_check_siblings(bs, *pblock);
+	error = xchk_btree_block_check_siblings(bs, *pblock);
+	if (error)
+		return error;
+
+	xchk_btree_block_check_keys(bs, level, *pblock);
+	return 0;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/2] xfs: always scrub record/key order of interior records
  2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: enhance btree key scrubbing Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/2] xfs: check btree keys reflect the child block Darrick J. Wong
  1 sibling, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In commit d47fef9342d0, we removed the firstrec and firstkey fields of
struct xchk_btree because Christoph thought they were unnecessary
because we could use the record index in the btree cursor.  This is
incorrect because bc_ptrs (now bc_levels[].ptr) tracks the cursor
position within a specific btree block, not within the entire level.

The end result is that scrub no longer detects situations where the
rightmost record of a block is identical to the leftmost record of that
block's right sibling.  Fix this regression by reintroducing record
validity booleans so that order checking skips *only* the leftmost
record/key in each level.

Fixes: d47fef9342d0 ("xfs: don't track firstrec/firstkey separately in xchk_btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/btree.c |   14 ++++++++------
 fs/xfs/scrub/btree.h |    8 +++++++-
 2 files changed, 15 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 615f52e56f4e..2dfa3e1d5841 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -151,11 +151,12 @@ xchk_btree_rec(
 
 	trace_xchk_btree_rec(bs->sc, cur, 0);
 
-	/* If this isn't the first record, are they in order? */
-	if (cur->bc_levels[0].ptr > 1 &&
+	/* Are all records across all record blocks in order? */
+	if (bs->lastrec_valid &&
 	    !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec))
 		xchk_btree_set_corrupt(bs->sc, cur, 0);
 	memcpy(&bs->lastrec, rec, cur->bc_ops->rec_len);
+	bs->lastrec_valid = true;
 
 	if (cur->bc_nlevels == 1)
 		return;
@@ -198,11 +199,12 @@ xchk_btree_key(
 
 	trace_xchk_btree_key(bs->sc, cur, level);
 
-	/* If this isn't the first key, are they in order? */
-	if (cur->bc_levels[level].ptr > 1 &&
-	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level - 1], key))
+	/* Are all low keys across all node blocks in order? */
+	if (bs->lastkey[level - 1].valid &&
+	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level - 1].key, key))
 		xchk_btree_set_corrupt(bs->sc, cur, level);
-	memcpy(&bs->lastkey[level - 1], key, cur->bc_ops->key_len);
+	memcpy(&bs->lastkey[level - 1].key, key, cur->bc_ops->key_len);
+	bs->lastkey[level - 1].valid = true;
 
 	if (level + 1 >= cur->bc_nlevels)
 		return;
diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
index 26c499925b5e..def8da9c4b4a 100644
--- a/fs/xfs/scrub/btree.h
+++ b/fs/xfs/scrub/btree.h
@@ -31,6 +31,11 @@ typedef int (*xchk_btree_rec_fn)(
 	struct xchk_btree		*bs,
 	const union xfs_btree_rec	*rec);
 
+struct xchk_btree_key {
+	union xfs_btree_key		key;
+	bool				valid;
+};
+
 struct xchk_btree {
 	/* caller-provided scrub state */
 	struct xfs_scrub		*sc;
@@ -40,11 +45,12 @@ struct xchk_btree {
 	void				*private;
 
 	/* internal scrub state */
+	bool				lastrec_valid;
 	union xfs_btree_rec		lastrec;
 	struct list_head		to_check;
 
 	/* this element must come last! */
-	union xfs_btree_key		lastkey[];
+	struct xchk_btree_key		lastkey[];
 };
 
 /*


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/6] xfs: detect incorrect gaps in refcount btree
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (8 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: enhance btree key scrubbing Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/6] xfs: replace xfs_btree_has_record with a general keyspace scanner Darrick J. Wong
                     ` (5 more replies)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/4] xfs: detect incorrect gaps in inode btree Darrick J. Wong
                   ` (12 subsequent siblings)
  22 siblings, 6 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, Dave Chinner, linux-xfs

Hi all,

The next few patchsets address a deficiency in scrub that I found while
QAing the refcount btree scrubber.  If there's a gap between refcount
records, we need to cross-reference that gap with the reverse mappings
to ensure that there are no overlapping records in the rmap btree.  If
we find any, then the refcount btree is not consistent.  This is not a
property that is specific to the refcount btree; they all need to have
this sort of keyspace scanning logic to detect inconsistencies.

To do this accurately, we need to be able to scan the keyspace of a
btree (which we already do) to be able to tell the caller if the
keyspace is empty, sparse, or fully covered by records.  The first few
patches add the keyspace scanner to the generic btree code, along with
the ability to mask off parts of btree keys because when we scan the
rmapbt, we only care about space usage, not the owners.

The final patch closes the scanning gap in the refcountbt scanner.

v23.1: create helpers for the key extraction and comparison functions,
       improve documentation, and eliminate the ->mask_key indirect
       calls

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-detect-refcount-gaps
---
 fs/xfs/libxfs/xfs_alloc.c          |   11 +-
 fs/xfs/libxfs/xfs_alloc.h          |    4 -
 fs/xfs/libxfs/xfs_alloc_btree.c    |   28 ++++-
 fs/xfs/libxfs/xfs_bmap_btree.c     |   19 +++
 fs/xfs/libxfs/xfs_btree.c          |  208 ++++++++++++++++++++++++++----------
 fs/xfs/libxfs/xfs_btree.h          |  141 ++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_ialloc_btree.c   |   22 +++-
 fs/xfs/libxfs/xfs_refcount.c       |   11 +-
 fs/xfs/libxfs/xfs_refcount.h       |    4 -
 fs/xfs/libxfs/xfs_refcount_btree.c |   21 +++-
 fs/xfs/libxfs/xfs_rmap.c           |   15 ++-
 fs/xfs/libxfs/xfs_rmap.h           |    4 -
 fs/xfs/libxfs/xfs_rmap_btree.c     |   61 ++++++++---
 fs/xfs/libxfs/xfs_types.h          |   12 ++
 fs/xfs/scrub/agheader.c            |    5 +
 fs/xfs/scrub/alloc.c               |    7 +
 fs/xfs/scrub/bmap.c                |   11 +-
 fs/xfs/scrub/btree.c               |   24 ++--
 fs/xfs/scrub/ialloc.c              |    2 
 fs/xfs/scrub/inode.c               |    1 
 fs/xfs/scrub/refcount.c            |  124 ++++++++++++++++++++-
 fs/xfs/scrub/rmap.c                |    6 +
 fs/xfs/scrub/scrub.h               |    2 
 23 files changed, 612 insertions(+), 131 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/6] xfs: refactor converting btree irec to btree key
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect incorrect gaps in refcount btree Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/6] xfs: replace xfs_btree_has_record with a general keyspace scanner Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/6] xfs: refactor ->diff_two_keys callsites Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 5/6] xfs: check the reference counts of gaps in the refcount btree Darrick J. Wong
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

We keep doing these conversions to support btree queries, so refactor
this into a helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_btree.c |   23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 35f574421670..d02634c44bff 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4923,6 +4923,19 @@ xfs_btree_overlapped_query_range(
 	return error;
 }
 
+static inline void
+xfs_btree_key_from_irec(
+	struct xfs_btree_cur		*cur,
+	union xfs_btree_key		*key,
+	const union xfs_btree_irec	*irec)
+{
+	union xfs_btree_rec		rec;
+
+	cur->bc_rec = *irec;
+	cur->bc_ops->init_rec_from_cur(cur, &rec);
+	cur->bc_ops->init_key_from_rec(key, &rec);
+}
+
 /*
  * Query a btree for all records overlapping a given interval of keys.  The
  * supplied function will be called with each record found; return one of the
@@ -4937,18 +4950,12 @@ xfs_btree_query_range(
 	xfs_btree_query_range_fn	fn,
 	void				*priv)
 {
-	union xfs_btree_rec		rec;
 	union xfs_btree_key		low_key;
 	union xfs_btree_key		high_key;
 
 	/* Find the keys of both ends of the interval. */
-	cur->bc_rec = *high_rec;
-	cur->bc_ops->init_rec_from_cur(cur, &rec);
-	cur->bc_ops->init_key_from_rec(&high_key, &rec);
-
-	cur->bc_rec = *low_rec;
-	cur->bc_ops->init_rec_from_cur(cur, &rec);
-	cur->bc_ops->init_key_from_rec(&low_key, &rec);
+	xfs_btree_key_from_irec(cur, &high_key, high_rec);
+	xfs_btree_key_from_irec(cur, &low_key, low_rec);
 
 	/* Enforce low key < high key. */
 	if (cur->bc_ops->diff_two_keys(cur, &low_key, &high_key) > 0)


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/6] xfs: refactor ->diff_two_keys callsites
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect incorrect gaps in refcount btree Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/6] xfs: replace xfs_btree_has_record with a general keyspace scanner Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/6] xfs: refactor converting btree irec to btree key Darrick J. Wong
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create wrapper functions around ->diff_two_keys so that we don't have to
remember what the return values mean, and adjust some of the code
comments to reflect the longtime code behavior.  We're going to
introduce more uses of ->diff_two_keys in the next patch, so reduce the
cognitive load for readers by doing this refactoring now.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree.c |   61 +++++++++++++++++++--------------------------
 fs/xfs/libxfs/xfs_btree.h |   55 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/btree.c      |   24 +++++++++---------
 3 files changed, 93 insertions(+), 47 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index d02634c44bff..7661d5bc1650 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2067,8 +2067,7 @@ xfs_btree_get_leaf_keys(
 		for (n = 2; n <= xfs_btree_get_numrecs(block); n++) {
 			rec = xfs_btree_rec_addr(cur, n, block);
 			cur->bc_ops->init_high_key_from_rec(&hkey, rec);
-			if (cur->bc_ops->diff_two_keys(cur, &hkey, &max_hkey)
-					> 0)
+			if (xfs_btree_keycmp_gt(cur, &hkey, &max_hkey))
 				max_hkey = hkey;
 		}
 
@@ -2096,7 +2095,7 @@ xfs_btree_get_node_keys(
 		max_hkey = xfs_btree_high_key_addr(cur, 1, block);
 		for (n = 2; n <= xfs_btree_get_numrecs(block); n++) {
 			hkey = xfs_btree_high_key_addr(cur, n, block);
-			if (cur->bc_ops->diff_two_keys(cur, hkey, max_hkey) > 0)
+			if (xfs_btree_keycmp_gt(cur, hkey, max_hkey))
 				max_hkey = hkey;
 		}
 
@@ -2183,8 +2182,8 @@ __xfs_btree_updkeys(
 		nlkey = xfs_btree_key_addr(cur, ptr, block);
 		nhkey = xfs_btree_high_key_addr(cur, ptr, block);
 		if (!force_all &&
-		    !(cur->bc_ops->diff_two_keys(cur, nlkey, lkey) != 0 ||
-		      cur->bc_ops->diff_two_keys(cur, nhkey, hkey) != 0))
+		    xfs_btree_keycmp_eq(cur, nlkey, lkey) &&
+		    xfs_btree_keycmp_eq(cur, nhkey, hkey))
 			break;
 		xfs_btree_copy_keys(cur, nlkey, lkey, 1);
 		xfs_btree_log_keys(cur, bp, ptr, ptr);
@@ -4702,7 +4701,6 @@ xfs_btree_simple_query_range(
 {
 	union xfs_btree_rec		*recp;
 	union xfs_btree_key		rec_key;
-	int64_t				diff;
 	int				stat;
 	bool				firstrec = true;
 	int				error;
@@ -4732,20 +4730,17 @@ xfs_btree_simple_query_range(
 		if (error || !stat)
 			break;
 
-		/* Skip if high_key(rec) < low_key. */
+		/* Skip if low_key > high_key(rec). */
 		if (firstrec) {
 			cur->bc_ops->init_high_key_from_rec(&rec_key, recp);
 			firstrec = false;
-			diff = cur->bc_ops->diff_two_keys(cur, low_key,
-					&rec_key);
-			if (diff > 0)
+			if (xfs_btree_keycmp_gt(cur, low_key, &rec_key))
 				goto advloop;
 		}
 
-		/* Stop if high_key < low_key(rec). */
+		/* Stop if low_key(rec) > high_key. */
 		cur->bc_ops->init_key_from_rec(&rec_key, recp);
-		diff = cur->bc_ops->diff_two_keys(cur, &rec_key, high_key);
-		if (diff > 0)
+		if (xfs_btree_keycmp_gt(cur, &rec_key, high_key))
 			break;
 
 		/* Callback */
@@ -4799,8 +4794,6 @@ xfs_btree_overlapped_query_range(
 	union xfs_btree_key		*hkp;
 	union xfs_btree_rec		*recp;
 	struct xfs_btree_block		*block;
-	int64_t				ldiff;
-	int64_t				hdiff;
 	int				level;
 	struct xfs_buf			*bp;
 	int				i;
@@ -4840,25 +4833,23 @@ xfs_btree_overlapped_query_range(
 					block);
 
 			cur->bc_ops->init_high_key_from_rec(&rec_hkey, recp);
-			ldiff = cur->bc_ops->diff_two_keys(cur, &rec_hkey,
-					low_key);
-
 			cur->bc_ops->init_key_from_rec(&rec_key, recp);
-			hdiff = cur->bc_ops->diff_two_keys(cur, high_key,
-					&rec_key);
 
 			/*
+			 * If (query's high key < record's low key), then there
+			 * are no more interesting records in this block.  Pop
+			 * up to the leaf level to find more record blocks.
+			 *
 			 * If (record's high key >= query's low key) and
 			 *    (query's high key >= record's low key), then
 			 * this record overlaps the query range; callback.
 			 */
-			if (ldiff >= 0 && hdiff >= 0) {
-				error = fn(cur, recp, priv);
-				if (error)
-					break;
-			} else if (hdiff < 0) {
-				/* Record is larger than high key; pop. */
+			if (xfs_btree_keycmp_lt(cur, high_key, &rec_key))
 				goto pop_up;
+			if (xfs_btree_keycmp_ge(cur, &rec_hkey, low_key)) {
+				error = fn(cur, recp, priv);
+				if (error)
+					break;
 			}
 			cur->bc_levels[level].ptr++;
 			continue;
@@ -4870,15 +4861,18 @@ xfs_btree_overlapped_query_range(
 				block);
 		pp = xfs_btree_ptr_addr(cur, cur->bc_levels[level].ptr, block);
 
-		ldiff = cur->bc_ops->diff_two_keys(cur, hkp, low_key);
-		hdiff = cur->bc_ops->diff_two_keys(cur, high_key, lkp);
-
 		/*
+		 * If (query's high key < pointer's low key), then there are no
+		 * more interesting keys in this block.  Pop up one leaf level
+		 * to continue looking for records.
+		 *
 		 * If (pointer's high key >= query's low key) and
 		 *    (query's high key >= pointer's low key), then
 		 * this record overlaps the query range; follow pointer.
 		 */
-		if (ldiff >= 0 && hdiff >= 0) {
+		if (xfs_btree_keycmp_lt(cur, high_key, lkp))
+			goto pop_up;
+		if (xfs_btree_keycmp_ge(cur, hkp, low_key)) {
 			level--;
 			error = xfs_btree_lookup_get_block(cur, level, pp,
 					&block);
@@ -4893,9 +4887,6 @@ xfs_btree_overlapped_query_range(
 #endif
 			cur->bc_levels[level].ptr = 1;
 			continue;
-		} else if (hdiff < 0) {
-			/* The low key is larger than the upper range; pop. */
-			goto pop_up;
 		}
 		cur->bc_levels[level].ptr++;
 	}
@@ -4957,8 +4948,8 @@ xfs_btree_query_range(
 	xfs_btree_key_from_irec(cur, &high_key, high_rec);
 	xfs_btree_key_from_irec(cur, &low_key, low_rec);
 
-	/* Enforce low key < high key. */
-	if (cur->bc_ops->diff_two_keys(cur, &low_key, &high_key) > 0)
+	/* Enforce low key <= high key. */
+	if (!xfs_btree_keycmp_le(cur, &low_key, &high_key))
 		return -EINVAL;
 
 	if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 29c4b4ccb909..f5aa4b893ee7 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -546,6 +546,61 @@ int xfs_btree_has_record(struct xfs_btree_cur *cur,
 bool xfs_btree_has_more_records(struct xfs_btree_cur *cur);
 struct xfs_ifork *xfs_btree_ifork_ptr(struct xfs_btree_cur *cur);
 
+/* Key comparison helpers */
+static inline bool
+xfs_btree_keycmp_lt(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return cur->bc_ops->diff_two_keys(cur, key1, key2) < 0;
+}
+
+static inline bool
+xfs_btree_keycmp_gt(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return cur->bc_ops->diff_two_keys(cur, key1, key2) > 0;
+}
+
+static inline bool
+xfs_btree_keycmp_eq(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return cur->bc_ops->diff_two_keys(cur, key1, key2) == 0;
+}
+
+static inline bool
+xfs_btree_keycmp_le(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return !xfs_btree_keycmp_gt(cur, key1, key2);
+}
+
+static inline bool
+xfs_btree_keycmp_ge(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return !xfs_btree_keycmp_lt(cur, key1, key2);
+}
+
+static inline bool
+xfs_btree_keycmp_ne(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return !xfs_btree_keycmp_eq(cur, key1, key2);
+}
+
 /* Does this cursor point to the last block in the given level? */
 static inline bool
 xfs_btree_islastblock(
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 2dfa3e1d5841..8ae42dff632f 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -161,20 +161,20 @@ xchk_btree_rec(
 	if (cur->bc_nlevels == 1)
 		return;
 
-	/* Is this at least as large as the parent low key? */
+	/* Is low_key(rec) at least as large as the parent low key? */
 	cur->bc_ops->init_key_from_rec(&key, rec);
 	keyblock = xfs_btree_get_block(cur, 1, &bp);
 	keyp = xfs_btree_key_addr(cur, cur->bc_levels[1].ptr, keyblock);
-	if (cur->bc_ops->diff_two_keys(cur, &key, keyp) < 0)
+	if (xfs_btree_keycmp_lt(cur, &key, keyp))
 		xchk_btree_set_corrupt(bs->sc, cur, 1);
 
 	if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
 		return;
 
-	/* Is this no larger than the parent high key? */
+	/* Is high_key(rec) no larger than the parent high key? */
 	cur->bc_ops->init_high_key_from_rec(&hkey, rec);
 	keyp = xfs_btree_high_key_addr(cur, cur->bc_levels[1].ptr, keyblock);
-	if (cur->bc_ops->diff_two_keys(cur, keyp, &hkey) < 0)
+	if (xfs_btree_keycmp_lt(cur, keyp, &hkey))
 		xchk_btree_set_corrupt(bs->sc, cur, 1);
 }
 
@@ -209,20 +209,20 @@ xchk_btree_key(
 	if (level + 1 >= cur->bc_nlevels)
 		return;
 
-	/* Is this at least as large as the parent low key? */
+	/* Is this block's low key at least as large as the parent low key? */
 	keyblock = xfs_btree_get_block(cur, level + 1, &bp);
 	keyp = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr, keyblock);
-	if (cur->bc_ops->diff_two_keys(cur, key, keyp) < 0)
+	if (xfs_btree_keycmp_lt(cur, key, keyp))
 		xchk_btree_set_corrupt(bs->sc, cur, level);
 
 	if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
 		return;
 
-	/* Is this no larger than the parent high key? */
+	/* Is this block's high key no larger than the parent high key? */
 	key = xfs_btree_high_key_addr(cur, cur->bc_levels[level].ptr, block);
 	keyp = xfs_btree_high_key_addr(cur, cur->bc_levels[level + 1].ptr,
 			keyblock);
-	if (cur->bc_ops->diff_two_keys(cur, keyp, key) < 0)
+	if (xfs_btree_keycmp_lt(cur, keyp, key))
 		xchk_btree_set_corrupt(bs->sc, cur, level);
 }
 
@@ -557,7 +557,7 @@ xchk_btree_block_check_keys(
 	parent_block = xfs_btree_get_block(cur, level + 1, &bp);
 	parent_low_key = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr,
 			parent_block);
-	if (cur->bc_ops->diff_two_keys(cur, &block_key, parent_low_key)) {
+	if (xfs_btree_keycmp_ne(cur, &block_key, parent_low_key)) {
 		xchk_btree_set_corrupt(bs->sc, bs->cur, level);
 		return;
 	}
@@ -569,7 +569,7 @@ xchk_btree_block_check_keys(
 	parent_high_key = xfs_btree_high_key_addr(cur,
 			cur->bc_levels[level + 1].ptr, parent_block);
 	block_high_key = xfs_btree_high_key_from_key(cur, &block_key);
-	if (cur->bc_ops->diff_two_keys(cur, block_high_key, parent_high_key))
+	if (xfs_btree_keycmp_ne(cur, block_high_key, parent_high_key))
 		xchk_btree_set_corrupt(bs->sc, bs->cur, level);
 }
 
@@ -661,7 +661,7 @@ xchk_btree_block_keys(
 	parent_keys = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr,
 			parent_block);
 
-	if (cur->bc_ops->diff_two_keys(cur, &block_keys, parent_keys) != 0)
+	if (xfs_btree_keycmp_ne(cur, &block_keys, parent_keys))
 		xchk_btree_set_corrupt(bs->sc, cur, 1);
 
 	if (!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
@@ -672,7 +672,7 @@ xchk_btree_block_keys(
 	high_pk = xfs_btree_high_key_addr(cur, cur->bc_levels[level + 1].ptr,
 			parent_block);
 
-	if (cur->bc_ops->diff_two_keys(cur, high_bk, high_pk) != 0)
+	if (xfs_btree_keycmp_ne(cur, high_bk, high_pk))
 		xchk_btree_set_corrupt(bs->sc, cur, 1);
 }
 


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 3/6] xfs: replace xfs_btree_has_record with a general keyspace scanner
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect incorrect gaps in refcount btree Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/6] xfs: refactor ->diff_two_keys callsites Darrick J. Wong
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The current implementation of xfs_btree_has_record returns true if it
finds /any/ record within the given range.  Unfortunately, that's not
sufficient for scrub.  We want to be able to tell if a range of keyspace
for a btree is devoid of records, is totally mapped to records, or is
somewhere in between.  By forcing this to be a boolean, we conflated
sparseness and fullness, which caused scrub to return incorrect results.
Fix the API so that we can tell the caller which of those three is the
current state.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c          |   11 ++--
 fs/xfs/libxfs/xfs_alloc.h          |    4 +
 fs/xfs/libxfs/xfs_alloc_btree.c    |   12 ++++
 fs/xfs/libxfs/xfs_bmap_btree.c     |   11 ++++
 fs/xfs/libxfs/xfs_btree.c          |  108 ++++++++++++++++++++++++++++++++----
 fs/xfs/libxfs/xfs_btree.h          |   44 ++++++++++++++-
 fs/xfs/libxfs/xfs_ialloc_btree.c   |   12 ++++
 fs/xfs/libxfs/xfs_refcount.c       |   11 ++--
 fs/xfs/libxfs/xfs_refcount.h       |    4 +
 fs/xfs/libxfs/xfs_refcount_btree.c |   11 ++++
 fs/xfs/libxfs/xfs_rmap.c           |   12 +++-
 fs/xfs/libxfs/xfs_rmap.h           |    4 +
 fs/xfs/libxfs/xfs_rmap_btree.c     |   16 +++++
 fs/xfs/libxfs/xfs_types.h          |   12 ++++
 fs/xfs/scrub/alloc.c               |    6 +-
 fs/xfs/scrub/refcount.c            |    8 +--
 fs/xfs/scrub/rmap.c                |    6 +-
 17 files changed, 249 insertions(+), 43 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 8dcefff1db33..31f61d88878d 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -3530,13 +3530,16 @@ xfs_alloc_query_all(
 	return xfs_btree_query_all(cur, xfs_alloc_query_range_helper, &query);
 }
 
-/* Is there a record covering a given extent? */
+/*
+ * Scan part of the keyspace of the free space and tell us if the area has no
+ * records, is fully mapped by records, or is partially filled.
+ */
 int
-xfs_alloc_has_record(
+xfs_alloc_has_records(
 	struct xfs_btree_cur	*cur,
 	xfs_agblock_t		bno,
 	xfs_extlen_t		len,
-	bool			*exists)
+	enum xbtree_recpacking	*outcome)
 {
 	union xfs_btree_irec	low;
 	union xfs_btree_irec	high;
@@ -3546,7 +3549,7 @@ xfs_alloc_has_record(
 	memset(&high, 0xFF, sizeof(high));
 	high.a.ar_startblock = bno + len - 1;
 
-	return xfs_btree_has_record(cur, &low, &high, exists);
+	return xfs_btree_has_records(cur, &low, &high, outcome);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index becd06e5d0b8..6d17f8d36a37 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -202,8 +202,8 @@ int xfs_alloc_query_range(struct xfs_btree_cur *cur,
 int xfs_alloc_query_all(struct xfs_btree_cur *cur, xfs_alloc_query_range_fn fn,
 		void *priv);
 
-int xfs_alloc_has_record(struct xfs_btree_cur *cur, xfs_agblock_t bno,
-		xfs_extlen_t len, bool *exist);
+int xfs_alloc_has_records(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		xfs_extlen_t len, enum xbtree_recpacking *outcome);
 
 typedef int (*xfs_agfl_walk_fn)(struct xfs_mount *mp, xfs_agblock_t bno,
 		void *priv);
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 0e78e00e02f9..46fe70ab0a0e 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -423,6 +423,16 @@ xfs_cntbt_recs_inorder(
 		 be32_to_cpu(r2->alloc.ar_startblock));
 }
 
+STATIC enum xbtree_key_contig
+xfs_allocbt_keys_contiguous(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return xbtree_key_contig(be32_to_cpu(key1->alloc.ar_startblock),
+				 be32_to_cpu(key2->alloc.ar_startblock));
+}
+
 static const struct xfs_btree_ops xfs_bnobt_ops = {
 	.rec_len		= sizeof(xfs_alloc_rec_t),
 	.key_len		= sizeof(xfs_alloc_key_t),
@@ -443,6 +453,7 @@ static const struct xfs_btree_ops xfs_bnobt_ops = {
 	.diff_two_keys		= xfs_bnobt_diff_two_keys,
 	.keys_inorder		= xfs_bnobt_keys_inorder,
 	.recs_inorder		= xfs_bnobt_recs_inorder,
+	.keys_contiguous	= xfs_allocbt_keys_contiguous,
 };
 
 static const struct xfs_btree_ops xfs_cntbt_ops = {
@@ -465,6 +476,7 @@ static const struct xfs_btree_ops xfs_cntbt_ops = {
 	.diff_two_keys		= xfs_cntbt_diff_two_keys,
 	.keys_inorder		= xfs_cntbt_keys_inorder,
 	.recs_inorder		= xfs_cntbt_recs_inorder,
+	.keys_contiguous	= NULL, /* not needed right now */
 };
 
 /* Allocate most of a new allocation btree cursor. */
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index cfa052d40105..45b5696fe8cb 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -518,6 +518,16 @@ xfs_bmbt_recs_inorder(
 		xfs_bmbt_disk_get_startoff(&r2->bmbt);
 }
 
+STATIC enum xbtree_key_contig
+xfs_bmbt_keys_contiguous(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return xbtree_key_contig(be64_to_cpu(key1->bmbt.br_startoff),
+				 be64_to_cpu(key2->bmbt.br_startoff));
+}
+
 static const struct xfs_btree_ops xfs_bmbt_ops = {
 	.rec_len		= sizeof(xfs_bmbt_rec_t),
 	.key_len		= sizeof(xfs_bmbt_key_t),
@@ -538,6 +548,7 @@ static const struct xfs_btree_ops xfs_bmbt_ops = {
 	.buf_ops		= &xfs_bmbt_buf_ops,
 	.keys_inorder		= xfs_bmbt_keys_inorder,
 	.recs_inorder		= xfs_bmbt_recs_inorder,
+	.keys_contiguous	= xfs_bmbt_keys_contiguous,
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 7661d5bc1650..2258af10e41a 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -5011,34 +5011,116 @@ xfs_btree_diff_two_ptrs(
 	return (int64_t)be32_to_cpu(a->s) - be32_to_cpu(b->s);
 }
 
-/* If there's an extent, we're done. */
+struct xfs_btree_has_records {
+	/* Keys for the start and end of the range we want to know about. */
+	union xfs_btree_key		start_key;
+	union xfs_btree_key		end_key;
+
+	/* Highest record key we've seen so far. */
+	union xfs_btree_key		high_key;
+
+	enum xbtree_recpacking		outcome;
+};
+
 STATIC int
-xfs_btree_has_record_helper(
+xfs_btree_has_records_helper(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_rec	*rec,
 	void				*priv)
 {
-	return -ECANCELED;
+	union xfs_btree_key		rec_key;
+	union xfs_btree_key		rec_high_key;
+	struct xfs_btree_has_records	*info = priv;
+	enum xbtree_key_contig		key_contig;
+
+	cur->bc_ops->init_key_from_rec(&rec_key, rec);
+
+	if (info->outcome == XBTREE_RECPACKING_EMPTY) {
+		info->outcome = XBTREE_RECPACKING_SPARSE;
+
+		/*
+		 * If the first record we find does not overlap the start key,
+		 * then there is a hole at the start of the search range.
+		 * Classify this as sparse and stop immediately.
+		 */
+		if (xfs_btree_keycmp_lt(cur, &info->start_key, &rec_key))
+			return -ECANCELED;
+	} else {
+		/*
+		 * If a subsequent record does not overlap with the any record
+		 * we've seen so far, there is a hole in the middle of the
+		 * search range.  Classify this as sparse and stop.
+		 * If the keys overlap and this btree does not allow overlap,
+		 * signal corruption.
+		 */
+		key_contig = cur->bc_ops->keys_contiguous(cur, &info->high_key,
+					&rec_key);
+		if (key_contig == XBTREE_KEY_OVERLAP &&
+				!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
+			return -EFSCORRUPTED;
+		if (key_contig == XBTREE_KEY_GAP)
+			return -ECANCELED;
+	}
+
+	/*
+	 * If high_key(rec) is larger than any other high key we've seen,
+	 * remember it for later.
+	 */
+	cur->bc_ops->init_high_key_from_rec(&rec_high_key, rec);
+	if (xfs_btree_keycmp_gt(cur, &rec_high_key, &info->high_key))
+		info->high_key = rec_high_key; /* struct copy */
+
+	return 0;
 }
 
-/* Is there a record covering a given range of keys? */
+/*
+ * Scan part of the keyspace of a btree and tell us if that keyspace does not
+ * map to any records; is fully mapped to records; or is partially mapped to
+ * records.  This is the btree record equivalent to determining if a file is
+ * sparse.
+ */
 int
-xfs_btree_has_record(
+xfs_btree_has_records(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_irec	*low,
 	const union xfs_btree_irec	*high,
-	bool				*exists)
+	enum xbtree_recpacking		*outcome)
 {
+	struct xfs_btree_has_records	info = {
+		.outcome		= XBTREE_RECPACKING_EMPTY,
+	};
 	int				error;
 
+	/* Not all btrees support this operation. */
+	if (!cur->bc_ops->keys_contiguous) {
+		ASSERT(0);
+		return -EOPNOTSUPP;
+	}
+
+	xfs_btree_key_from_irec(cur, &info.start_key, low);
+	xfs_btree_key_from_irec(cur, &info.end_key, high);
+
 	error = xfs_btree_query_range(cur, low, high,
-			&xfs_btree_has_record_helper, NULL);
-	if (error == -ECANCELED) {
-		*exists = true;
-		return 0;
-	}
-	*exists = false;
-	return error;
+			xfs_btree_has_records_helper, &info);
+	if (error == -ECANCELED)
+		goto out;
+	if (error)
+		return error;
+
+	if (info.outcome == XBTREE_RECPACKING_EMPTY)
+		goto out;
+
+	/*
+	 * If the largest high_key(rec) we saw during the walk is greater than
+	 * the end of the search range, classify this as full.  Otherwise,
+	 * there is a hole at the end of the search range.
+	 */
+	if (xfs_btree_keycmp_ge(cur, &info.high_key, &info.end_key))
+		info.outcome = XBTREE_RECPACKING_FULL;
+
+out:
+	*outcome = info.outcome;
+	return 0;
 }
 
 /* Are there more records in this btree? */
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index f5aa4b893ee7..66431f351bb2 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -90,6 +90,27 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
 #define XFS_BTREE_STATS_ADD(cur, stat, val)	\
 	XFS_STATS_ADD_OFF((cur)->bc_mp, (cur)->bc_statoff + __XBTS_ ## stat, val)
 
+enum xbtree_key_contig {
+	XBTREE_KEY_GAP = 0,
+	XBTREE_KEY_CONTIGUOUS,
+	XBTREE_KEY_OVERLAP,
+};
+
+/*
+ * Decide if these two numeric btree key fields are contiguous, overlapping,
+ * or if there's a gap between them.  @x should be the field from the high
+ * key and @y should be the field from the low key.
+ */
+static inline enum xbtree_key_contig xbtree_key_contig(uint64_t x, uint64_t y)
+{
+	x++;
+	if (x < y)
+		return XBTREE_KEY_GAP;
+	if (x == y)
+		return XBTREE_KEY_CONTIGUOUS;
+	return XBTREE_KEY_OVERLAP;
+}
+
 struct xfs_btree_ops {
 	/* size of the key and record structures */
 	size_t	key_len;
@@ -157,6 +178,19 @@ struct xfs_btree_ops {
 	int	(*recs_inorder)(struct xfs_btree_cur *cur,
 				const union xfs_btree_rec *r1,
 				const union xfs_btree_rec *r2);
+
+	/*
+	 * Are these two btree keys immediately adjacent?
+	 *
+	 * Given two btree keys @key1 and @key2, decide if it is impossible for
+	 * there to be a third btree key K satisfying the relationship
+	 * @key1 < K < @key2.  To determine if two btree records are
+	 * immediately adjacent, @key1 should be the high key of the first
+	 * record and @key2 should be the low key of the second record.
+	 */
+	enum xbtree_key_contig (*keys_contiguous)(struct xfs_btree_cur *cur,
+			       const union xfs_btree_key *key1,
+			       const union xfs_btree_key *key2);
 };
 
 /*
@@ -540,9 +574,15 @@ void xfs_btree_get_keys(struct xfs_btree_cur *cur,
 		struct xfs_btree_block *block, union xfs_btree_key *key);
 union xfs_btree_key *xfs_btree_high_key_from_key(struct xfs_btree_cur *cur,
 		union xfs_btree_key *key);
-int xfs_btree_has_record(struct xfs_btree_cur *cur,
+typedef bool (*xfs_btree_key_gap_fn)(struct xfs_btree_cur *cur,
+		const union xfs_btree_key *key1,
+		const union xfs_btree_key *key2);
+
+int xfs_btree_has_records(struct xfs_btree_cur *cur,
 		const union xfs_btree_irec *low,
-		const union xfs_btree_irec *high, bool *exists);
+		const union xfs_btree_irec *high,
+		enum xbtree_recpacking *outcome);
+
 bool xfs_btree_has_more_records(struct xfs_btree_cur *cur);
 struct xfs_ifork *xfs_btree_ifork_ptr(struct xfs_btree_cur *cur);
 
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index e849faae405a..e59bd6d3db03 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -383,6 +383,16 @@ xfs_inobt_recs_inorder(
 		be32_to_cpu(r2->inobt.ir_startino);
 }
 
+STATIC enum xbtree_key_contig
+xfs_inobt_keys_contiguous(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return xbtree_key_contig(be32_to_cpu(key1->inobt.ir_startino),
+				 be32_to_cpu(key2->inobt.ir_startino));
+}
+
 static const struct xfs_btree_ops xfs_inobt_ops = {
 	.rec_len		= sizeof(xfs_inobt_rec_t),
 	.key_len		= sizeof(xfs_inobt_key_t),
@@ -402,6 +412,7 @@ static const struct xfs_btree_ops xfs_inobt_ops = {
 	.diff_two_keys		= xfs_inobt_diff_two_keys,
 	.keys_inorder		= xfs_inobt_keys_inorder,
 	.recs_inorder		= xfs_inobt_recs_inorder,
+	.keys_contiguous	= xfs_inobt_keys_contiguous,
 };
 
 static const struct xfs_btree_ops xfs_finobt_ops = {
@@ -423,6 +434,7 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
 	.diff_two_keys		= xfs_inobt_diff_two_keys,
 	.keys_inorder		= xfs_inobt_keys_inorder,
 	.recs_inorder		= xfs_inobt_recs_inorder,
+	.keys_contiguous	= xfs_inobt_keys_contiguous,
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 335f84bef81c..94377b59ba44 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1998,14 +1998,17 @@ xfs_refcount_recover_cow_leftovers(
 	return error;
 }
 
-/* Is there a record covering a given extent? */
+/*
+ * Scan part of the keyspace of the refcount records and tell us if the area
+ * has no records, is fully mapped by records, or is partially filled.
+ */
 int
-xfs_refcount_has_record(
+xfs_refcount_has_records(
 	struct xfs_btree_cur	*cur,
 	enum xfs_refc_domain	domain,
 	xfs_agblock_t		bno,
 	xfs_extlen_t		len,
-	bool			*exists)
+	enum xbtree_recpacking	*outcome)
 {
 	union xfs_btree_irec	low;
 	union xfs_btree_irec	high;
@@ -2016,7 +2019,7 @@ xfs_refcount_has_record(
 	high.rc.rc_startblock = bno + len - 1;
 	low.rc.rc_domain = high.rc.rc_domain = domain;
 
-	return xfs_btree_has_record(cur, &low, &high, exists);
+	return xfs_btree_has_records(cur, &low, &high, outcome);
 }
 
 int __init
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index fc0b58d4c379..783cd89ca195 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -111,9 +111,9 @@ extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
  */
 #define XFS_REFCOUNT_ITEM_OVERHEAD	32
 
-extern int xfs_refcount_has_record(struct xfs_btree_cur *cur,
+extern int xfs_refcount_has_records(struct xfs_btree_cur *cur,
 		enum xfs_refc_domain domain, xfs_agblock_t bno,
-		xfs_extlen_t len, bool *exists);
+		xfs_extlen_t len, enum xbtree_recpacking *outcome);
 union xfs_btree_rec;
 extern void xfs_refcount_btrec_to_irec(const union xfs_btree_rec *rec,
 		struct xfs_refcount_irec *irec);
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index f5bdac3cf19f..26e28ac24238 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -300,6 +300,16 @@ xfs_refcountbt_recs_inorder(
 		be32_to_cpu(r2->refc.rc_startblock);
 }
 
+STATIC enum xbtree_key_contig
+xfs_refcountbt_keys_contiguous(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	return xbtree_key_contig(be32_to_cpu(key1->refc.rc_startblock),
+				 be32_to_cpu(key2->refc.rc_startblock));
+}
+
 static const struct xfs_btree_ops xfs_refcountbt_ops = {
 	.rec_len		= sizeof(struct xfs_refcount_rec),
 	.key_len		= sizeof(struct xfs_refcount_key),
@@ -319,6 +329,7 @@ static const struct xfs_btree_ops xfs_refcountbt_ops = {
 	.diff_two_keys		= xfs_refcountbt_diff_two_keys,
 	.keys_inorder		= xfs_refcountbt_keys_inorder,
 	.recs_inorder		= xfs_refcountbt_recs_inorder,
+	.keys_contiguous	= xfs_refcountbt_keys_contiguous,
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index da008d317f83..e616b964f11c 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -2709,13 +2709,17 @@ xfs_rmap_compare(
 		return 0;
 }
 
-/* Is there a record covering a given extent? */
+/*
+ * Scan the physical storage part of the keyspace of the reverse mapping index
+ * and tell us if the area has no records, is fully mapped by records, or is
+ * partially filled.
+ */
 int
-xfs_rmap_has_record(
+xfs_rmap_has_records(
 	struct xfs_btree_cur	*cur,
 	xfs_agblock_t		bno,
 	xfs_extlen_t		len,
-	bool			*exists)
+	enum xbtree_recpacking	*outcome)
 {
 	union xfs_btree_irec	low;
 	union xfs_btree_irec	high;
@@ -2725,7 +2729,7 @@ xfs_rmap_has_record(
 	memset(&high, 0xFF, sizeof(high));
 	high.r.rm_startblock = bno + len - 1;
 
-	return xfs_btree_has_record(cur, &low, &high, exists);
+	return xfs_btree_has_records(cur, &low, &high, outcome);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index 7fb298bcc15f..4cbe50cf522e 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -198,8 +198,8 @@ xfs_failaddr_t xfs_rmap_btrec_to_irec(const union xfs_btree_rec *rec,
 xfs_failaddr_t xfs_rmap_check_irec(struct xfs_btree_cur *cur,
 		const struct xfs_rmap_irec *irec);
 
-int xfs_rmap_has_record(struct xfs_btree_cur *cur, xfs_agblock_t bno,
-		xfs_extlen_t len, bool *exists);
+int xfs_rmap_has_records(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+		xfs_extlen_t len, enum xbtree_recpacking *outcome);
 int xfs_rmap_record_exists(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, const struct xfs_owner_info *oinfo,
 		bool *has_rmap);
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index e18f89a68da9..1733865026d4 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -444,6 +444,21 @@ xfs_rmapbt_recs_inorder(
 	return 0;
 }
 
+STATIC enum xbtree_key_contig
+xfs_rmapbt_keys_contiguous(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2)
+{
+	/*
+	 * We only support checking contiguity of the physical space component.
+	 * If any callers ever need more specificity than that, they'll have to
+	 * implement it here.
+	 */
+	return xbtree_key_contig(be32_to_cpu(key1->rmap.rm_startblock),
+				 be32_to_cpu(key2->rmap.rm_startblock));
+}
+
 static const struct xfs_btree_ops xfs_rmapbt_ops = {
 	.rec_len		= sizeof(struct xfs_rmap_rec),
 	.key_len		= 2 * sizeof(struct xfs_rmap_key),
@@ -463,6 +478,7 @@ static const struct xfs_btree_ops xfs_rmapbt_ops = {
 	.diff_two_keys		= xfs_rmapbt_diff_two_keys,
 	.keys_inorder		= xfs_rmapbt_keys_inorder,
 	.recs_inorder		= xfs_rmapbt_recs_inorder,
+	.keys_contiguous	= xfs_rmapbt_keys_contiguous,
 };
 
 static struct xfs_btree_cur *
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 5ebdda7e1078..851220021484 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -204,6 +204,18 @@ enum xfs_ag_resv_type {
 	XFS_AG_RESV_RMAPBT,
 };
 
+/* Results of scanning a btree keyspace to check occupancy. */
+enum xbtree_recpacking {
+	/* None of the keyspace maps to records. */
+	XBTREE_RECPACKING_EMPTY = 0,
+
+	/* Some, but not all, of the keyspace maps to records. */
+	XBTREE_RECPACKING_SPARSE,
+
+	/* The entire keyspace maps to records. */
+	XBTREE_RECPACKING_FULL,
+};
+
 /*
  * Type verifier functions
  */
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index fb4f96716f6a..c72001f6bad9 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -144,15 +144,15 @@ xchk_xref_is_used_space(
 	xfs_agblock_t		agbno,
 	xfs_extlen_t		len)
 {
-	bool			is_freesp;
+	enum xbtree_recpacking	outcome;
 	int			error;
 
 	if (!sc->sa.bno_cur || xchk_skip_xref(sc->sm))
 		return;
 
-	error = xfs_alloc_has_record(sc->sa.bno_cur, agbno, len, &is_freesp);
+	error = xfs_alloc_has_records(sc->sa.bno_cur, agbno, len, &outcome);
 	if (!xchk_should_check_xref(sc, &error, &sc->sa.bno_cur))
 		return;
-	if (is_freesp)
+	if (outcome != XBTREE_RECPACKING_EMPTY)
 		xchk_btree_xref_set_corrupt(sc, sc->sa.bno_cur, 0);
 }
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index c2ae5a328a6d..220b2850659e 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -457,16 +457,16 @@ xchk_xref_is_not_shared(
 	xfs_agblock_t		agbno,
 	xfs_extlen_t		len)
 {
-	bool			shared;
+	enum xbtree_recpacking	outcome;
 	int			error;
 
 	if (!sc->sa.refc_cur || xchk_skip_xref(sc->sm))
 		return;
 
-	error = xfs_refcount_has_record(sc->sa.refc_cur, XFS_REFC_DOMAIN_SHARED,
-			agbno, len, &shared);
+	error = xfs_refcount_has_records(sc->sa.refc_cur,
+			XFS_REFC_DOMAIN_SHARED, agbno, len, &outcome);
 	if (!xchk_should_check_xref(sc, &error, &sc->sa.refc_cur))
 		return;
-	if (shared)
+	if (outcome != XBTREE_RECPACKING_EMPTY)
 		xchk_btree_xref_set_corrupt(sc, sc->sa.refc_cur, 0);
 }
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 215730a9d9bf..9ac3bc760d6c 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -219,15 +219,15 @@ xchk_xref_has_no_owner(
 	xfs_agblock_t		bno,
 	xfs_extlen_t		len)
 {
-	bool			has_rmap;
+	enum xbtree_recpacking	outcome;
 	int			error;
 
 	if (!sc->sa.rmap_cur || xchk_skip_xref(sc->sm))
 		return;
 
-	error = xfs_rmap_has_record(sc->sa.rmap_cur, bno, len, &has_rmap);
+	error = xfs_rmap_has_records(sc->sa.rmap_cur, bno, len, &outcome);
 	if (!xchk_should_check_xref(sc, &error, &sc->sa.rmap_cur))
 		return;
-	if (has_rmap)
+	if (outcome != XBTREE_RECPACKING_EMPTY)
 		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
 }


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 4/6] xfs: implement masked btree key comparisons for _has_records scans
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect incorrect gaps in refcount btree Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 5/6] xfs: check the reference counts of gaps in the refcount btree Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 6/6] xfs: ensure that all metadata and data blocks are not cow staging extents Darrick J. Wong
  5 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

For keyspace fullness scans, we want to be able to mask off the parts of
the key that we don't care about.  For most btree types we /do/ want the
full keyspace, but for checking that a given space usage also has a full
complement of rmapbt records (even if different/multiple owners) we need
this masking so that we only track sparseness of rm_startblock, not the
whole keyspace (which is extremely sparse).

Augment the ->diff_two_keys and ->keys_contiguous helpers to take a
third union xfs_btree_key argument, and wire up xfs_rmap_has_records to
pass this through.  This third "mask" argument should contain a nonzero
value in each structure field that should be used in the key comparisons
done during the scan.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c          |    2 +
 fs/xfs/libxfs/xfs_alloc_btree.c    |   18 ++++++++++---
 fs/xfs/libxfs/xfs_bmap_btree.c     |   10 ++++++-
 fs/xfs/libxfs/xfs_btree.c          |   24 ++++++++++++++---
 fs/xfs/libxfs/xfs_btree.h          |   50 ++++++++++++++++++++++++++++++++----
 fs/xfs/libxfs/xfs_ialloc_btree.c   |   12 ++++++---
 fs/xfs/libxfs/xfs_refcount.c       |    2 +
 fs/xfs/libxfs/xfs_refcount_btree.c |   12 ++++++---
 fs/xfs/libxfs/xfs_rmap.c           |    5 +++-
 fs/xfs/libxfs/xfs_rmap_btree.c     |   47 +++++++++++++++++++++++-----------
 10 files changed, 142 insertions(+), 40 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 31f61d88878d..e0ddae7a62ec 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -3549,7 +3549,7 @@ xfs_alloc_has_records(
 	memset(&high, 0xFF, sizeof(high));
 	high.a.ar_startblock = bno + len - 1;
 
-	return xfs_btree_has_records(cur, &low, &high, outcome);
+	return xfs_btree_has_records(cur, &low, &high, NULL, outcome);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 46fe70ab0a0e..a91e2a81ba2c 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -260,20 +260,27 @@ STATIC int64_t
 xfs_bnobt_diff_two_keys(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*k1,
-	const union xfs_btree_key	*k2)
+	const union xfs_btree_key	*k2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->alloc.ar_startblock);
+
 	return (int64_t)be32_to_cpu(k1->alloc.ar_startblock) -
-			  be32_to_cpu(k2->alloc.ar_startblock);
+			be32_to_cpu(k2->alloc.ar_startblock);
 }
 
 STATIC int64_t
 xfs_cntbt_diff_two_keys(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*k1,
-	const union xfs_btree_key	*k2)
+	const union xfs_btree_key	*k2,
+	const union xfs_btree_key	*mask)
 {
 	int64_t				diff;
 
+	ASSERT(!mask || (mask->alloc.ar_blockcount &&
+			 mask->alloc.ar_startblock));
+
 	diff =  be32_to_cpu(k1->alloc.ar_blockcount) -
 		be32_to_cpu(k2->alloc.ar_blockcount);
 	if (diff)
@@ -427,8 +434,11 @@ STATIC enum xbtree_key_contig
 xfs_allocbt_keys_contiguous(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*key1,
-	const union xfs_btree_key	*key2)
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->alloc.ar_startblock);
+
 	return xbtree_key_contig(be32_to_cpu(key1->alloc.ar_startblock),
 				 be32_to_cpu(key2->alloc.ar_startblock));
 }
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 45b5696fe8cb..e53c5bd42e86 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -400,11 +400,14 @@ STATIC int64_t
 xfs_bmbt_diff_two_keys(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*k1,
-	const union xfs_btree_key	*k2)
+	const union xfs_btree_key	*k2,
+	const union xfs_btree_key	*mask)
 {
 	uint64_t			a = be64_to_cpu(k1->bmbt.br_startoff);
 	uint64_t			b = be64_to_cpu(k2->bmbt.br_startoff);
 
+	ASSERT(!mask || mask->bmbt.br_startoff);
+
 	/*
 	 * Note: This routine previously casted a and b to int64 and subtracted
 	 * them to generate a result.  This lead to problems if b was the
@@ -522,8 +525,11 @@ STATIC enum xbtree_key_contig
 xfs_bmbt_keys_contiguous(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*key1,
-	const union xfs_btree_key	*key2)
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->bmbt.br_startoff);
+
 	return xbtree_key_contig(be64_to_cpu(key1->bmbt.br_startoff),
 				 be64_to_cpu(key2->bmbt.br_startoff));
 }
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 2258af10e41a..99b79de7efcd 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -5016,6 +5016,9 @@ struct xfs_btree_has_records {
 	union xfs_btree_key		start_key;
 	union xfs_btree_key		end_key;
 
+	/* Mask for key comparisons, if desired. */
+	const union xfs_btree_key	*key_mask;
+
 	/* Highest record key we've seen so far. */
 	union xfs_btree_key		high_key;
 
@@ -5043,7 +5046,8 @@ xfs_btree_has_records_helper(
 		 * then there is a hole at the start of the search range.
 		 * Classify this as sparse and stop immediately.
 		 */
-		if (xfs_btree_keycmp_lt(cur, &info->start_key, &rec_key))
+		if (xfs_btree_masked_keycmp_lt(cur, &info->start_key, &rec_key,
+					info->key_mask))
 			return -ECANCELED;
 	} else {
 		/*
@@ -5054,7 +5058,7 @@ xfs_btree_has_records_helper(
 		 * signal corruption.
 		 */
 		key_contig = cur->bc_ops->keys_contiguous(cur, &info->high_key,
-					&rec_key);
+					&rec_key, info->key_mask);
 		if (key_contig == XBTREE_KEY_OVERLAP &&
 				!(cur->bc_flags & XFS_BTREE_OVERLAPPING))
 			return -EFSCORRUPTED;
@@ -5067,7 +5071,8 @@ xfs_btree_has_records_helper(
 	 * remember it for later.
 	 */
 	cur->bc_ops->init_high_key_from_rec(&rec_high_key, rec);
-	if (xfs_btree_keycmp_gt(cur, &rec_high_key, &info->high_key))
+	if (xfs_btree_masked_keycmp_gt(cur, &rec_high_key, &info->high_key,
+				info->key_mask))
 		info->high_key = rec_high_key; /* struct copy */
 
 	return 0;
@@ -5078,16 +5083,26 @@ xfs_btree_has_records_helper(
  * map to any records; is fully mapped to records; or is partially mapped to
  * records.  This is the btree record equivalent to determining if a file is
  * sparse.
+ *
+ * For most btree types, the record scan should use all available btree key
+ * fields to compare the keys encountered.  These callers should pass NULL for
+ * @mask.  However, some callers (e.g.  scanning physical space in the rmapbt)
+ * want to ignore some part of the btree record keyspace when performing the
+ * comparison.  These callers should pass in a union xfs_btree_key object with
+ * the fields that *should* be a part of the comparison set to any nonzero
+ * value, and the rest zeroed.
  */
 int
 xfs_btree_has_records(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_irec	*low,
 	const union xfs_btree_irec	*high,
+	const union xfs_btree_key	*mask,
 	enum xbtree_recpacking		*outcome)
 {
 	struct xfs_btree_has_records	info = {
 		.outcome		= XBTREE_RECPACKING_EMPTY,
+		.key_mask		= mask,
 	};
 	int				error;
 
@@ -5115,7 +5130,8 @@ xfs_btree_has_records(
 	 * the end of the search range, classify this as full.  Otherwise,
 	 * there is a hole at the end of the search range.
 	 */
-	if (xfs_btree_keycmp_ge(cur, &info.high_key, &info.end_key))
+	if (xfs_btree_masked_keycmp_ge(cur, &info.high_key, &info.end_key,
+				mask))
 		info.outcome = XBTREE_RECPACKING_FULL;
 
 out:
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 66431f351bb2..a2aa36b23e25 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -161,11 +161,14 @@ struct xfs_btree_ops {
 
 	/*
 	 * Difference between key2 and key1 -- positive if key1 > key2,
-	 * negative if key1 < key2, and zero if equal.
+	 * negative if key1 < key2, and zero if equal.  If the @mask parameter
+	 * is non NULL, each key field to be used in the comparison must
+	 * contain a nonzero value.
 	 */
 	int64_t (*diff_two_keys)(struct xfs_btree_cur *cur,
 				 const union xfs_btree_key *key1,
-				 const union xfs_btree_key *key2);
+				 const union xfs_btree_key *key2,
+				 const union xfs_btree_key *mask);
 
 	const struct xfs_buf_ops	*buf_ops;
 
@@ -187,10 +190,13 @@ struct xfs_btree_ops {
 	 * @key1 < K < @key2.  To determine if two btree records are
 	 * immediately adjacent, @key1 should be the high key of the first
 	 * record and @key2 should be the low key of the second record.
+	 * If the @mask parameter is non NULL, each key field to be used in the
+	 * comparison must contain a nonzero value.
 	 */
 	enum xbtree_key_contig (*keys_contiguous)(struct xfs_btree_cur *cur,
 			       const union xfs_btree_key *key1,
-			       const union xfs_btree_key *key2);
+			       const union xfs_btree_key *key2,
+			       const union xfs_btree_key *mask);
 };
 
 /*
@@ -581,6 +587,7 @@ typedef bool (*xfs_btree_key_gap_fn)(struct xfs_btree_cur *cur,
 int xfs_btree_has_records(struct xfs_btree_cur *cur,
 		const union xfs_btree_irec *low,
 		const union xfs_btree_irec *high,
+		const union xfs_btree_key *mask,
 		enum xbtree_recpacking *outcome);
 
 bool xfs_btree_has_more_records(struct xfs_btree_cur *cur);
@@ -593,7 +600,7 @@ xfs_btree_keycmp_lt(
 	const union xfs_btree_key	*key1,
 	const union xfs_btree_key	*key2)
 {
-	return cur->bc_ops->diff_two_keys(cur, key1, key2) < 0;
+	return cur->bc_ops->diff_two_keys(cur, key1, key2, NULL) < 0;
 }
 
 static inline bool
@@ -602,7 +609,7 @@ xfs_btree_keycmp_gt(
 	const union xfs_btree_key	*key1,
 	const union xfs_btree_key	*key2)
 {
-	return cur->bc_ops->diff_two_keys(cur, key1, key2) > 0;
+	return cur->bc_ops->diff_two_keys(cur, key1, key2, NULL) > 0;
 }
 
 static inline bool
@@ -611,7 +618,7 @@ xfs_btree_keycmp_eq(
 	const union xfs_btree_key	*key1,
 	const union xfs_btree_key	*key2)
 {
-	return cur->bc_ops->diff_two_keys(cur, key1, key2) == 0;
+	return cur->bc_ops->diff_two_keys(cur, key1, key2, NULL) == 0;
 }
 
 static inline bool
@@ -641,6 +648,37 @@ xfs_btree_keycmp_ne(
 	return !xfs_btree_keycmp_eq(cur, key1, key2);
 }
 
+/* Masked key comparison helpers */
+static inline bool
+xfs_btree_masked_keycmp_lt(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
+{
+	return cur->bc_ops->diff_two_keys(cur, key1, key2, mask) < 0;
+}
+
+static inline bool
+xfs_btree_masked_keycmp_gt(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
+{
+	return cur->bc_ops->diff_two_keys(cur, key1, key2, mask) > 0;
+}
+
+static inline bool
+xfs_btree_masked_keycmp_ge(
+	struct xfs_btree_cur		*cur,
+	const union xfs_btree_key	*key1,
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
+{
+	return !xfs_btree_masked_keycmp_lt(cur, key1, key2, mask);
+}
+
 /* Does this cursor point to the last block in the given level? */
 static inline bool
 xfs_btree_islastblock(
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index e59bd6d3db03..2b7571d50afb 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -269,10 +269,13 @@ STATIC int64_t
 xfs_inobt_diff_two_keys(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*k1,
-	const union xfs_btree_key	*k2)
+	const union xfs_btree_key	*k2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->inobt.ir_startino);
+
 	return (int64_t)be32_to_cpu(k1->inobt.ir_startino) -
-			  be32_to_cpu(k2->inobt.ir_startino);
+			be32_to_cpu(k2->inobt.ir_startino);
 }
 
 static xfs_failaddr_t
@@ -387,8 +390,11 @@ STATIC enum xbtree_key_contig
 xfs_inobt_keys_contiguous(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*key1,
-	const union xfs_btree_key	*key2)
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->inobt.ir_startino);
+
 	return xbtree_key_contig(be32_to_cpu(key1->inobt.ir_startino),
 				 be32_to_cpu(key2->inobt.ir_startino));
 }
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 94377b59ba44..c1c65774dcc2 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -2019,7 +2019,7 @@ xfs_refcount_has_records(
 	high.rc.rc_startblock = bno + len - 1;
 	low.rc.rc_domain = high.rc.rc_domain = domain;
 
-	return xfs_btree_has_records(cur, &low, &high, outcome);
+	return xfs_btree_has_records(cur, &low, &high, NULL, outcome);
 }
 
 int __init
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 26e28ac24238..2ec45e2ffbe1 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -202,10 +202,13 @@ STATIC int64_t
 xfs_refcountbt_diff_two_keys(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*k1,
-	const union xfs_btree_key	*k2)
+	const union xfs_btree_key	*k2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->refc.rc_startblock);
+
 	return (int64_t)be32_to_cpu(k1->refc.rc_startblock) -
-			  be32_to_cpu(k2->refc.rc_startblock);
+			be32_to_cpu(k2->refc.rc_startblock);
 }
 
 STATIC xfs_failaddr_t
@@ -304,8 +307,11 @@ STATIC enum xbtree_key_contig
 xfs_refcountbt_keys_contiguous(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*key1,
-	const union xfs_btree_key	*key2)
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->refc.rc_startblock);
+
 	return xbtree_key_contig(be32_to_cpu(key1->refc.rc_startblock),
 				 be32_to_cpu(key2->refc.rc_startblock));
 }
diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index e616b964f11c..308b81f321eb 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -2721,6 +2721,9 @@ xfs_rmap_has_records(
 	xfs_extlen_t		len,
 	enum xbtree_recpacking	*outcome)
 {
+	union xfs_btree_key	mask = {
+		.rmap.rm_startblock = cpu_to_be32(-1U),
+	};
 	union xfs_btree_irec	low;
 	union xfs_btree_irec	high;
 
@@ -2729,7 +2732,7 @@ xfs_rmap_has_records(
 	memset(&high, 0xFF, sizeof(high));
 	high.r.rm_startblock = bno + len - 1;
 
-	return xfs_btree_has_records(cur, &low, &high, outcome);
+	return xfs_btree_has_records(cur, &low, &high, &mask, outcome);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 1733865026d4..2c90a05ca814 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -273,31 +273,43 @@ STATIC int64_t
 xfs_rmapbt_diff_two_keys(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*k1,
-	const union xfs_btree_key	*k2)
+	const union xfs_btree_key	*k2,
+	const union xfs_btree_key	*mask)
 {
 	const struct xfs_rmap_key	*kp1 = &k1->rmap;
 	const struct xfs_rmap_key	*kp2 = &k2->rmap;
 	int64_t				d;
 	__u64				x, y;
 
+	/* Doesn't make sense to mask off the physical space part */
+	ASSERT(!mask || mask->rmap.rm_startblock);
+
 	d = (int64_t)be32_to_cpu(kp1->rm_startblock) -
-		       be32_to_cpu(kp2->rm_startblock);
+		     be32_to_cpu(kp2->rm_startblock);
 	if (d)
 		return d;
 
-	x = be64_to_cpu(kp1->rm_owner);
-	y = be64_to_cpu(kp2->rm_owner);
-	if (x > y)
-		return 1;
-	else if (y > x)
-		return -1;
+	if (!mask || mask->rmap.rm_owner) {
+		x = be64_to_cpu(kp1->rm_owner);
+		y = be64_to_cpu(kp2->rm_owner);
+		if (x > y)
+			return 1;
+		else if (y > x)
+			return -1;
+	}
+
+	if (!mask || mask->rmap.rm_offset) {
+		/* Doesn't make sense to allow offset but not owner */
+		ASSERT(!mask || mask->rmap.rm_owner);
+
+		x = offset_keymask(be64_to_cpu(kp1->rm_offset));
+		y = offset_keymask(be64_to_cpu(kp2->rm_offset));
+		if (x > y)
+			return 1;
+		else if (y > x)
+			return -1;
+	}
 
-	x = offset_keymask(be64_to_cpu(kp1->rm_offset));
-	y = offset_keymask(be64_to_cpu(kp2->rm_offset));
-	if (x > y)
-		return 1;
-	else if (y > x)
-		return -1;
 	return 0;
 }
 
@@ -448,13 +460,18 @@ STATIC enum xbtree_key_contig
 xfs_rmapbt_keys_contiguous(
 	struct xfs_btree_cur		*cur,
 	const union xfs_btree_key	*key1,
-	const union xfs_btree_key	*key2)
+	const union xfs_btree_key	*key2,
+	const union xfs_btree_key	*mask)
 {
+	ASSERT(!mask || mask->rmap.rm_startblock);
+
 	/*
 	 * We only support checking contiguity of the physical space component.
 	 * If any callers ever need more specificity than that, they'll have to
 	 * implement it here.
 	 */
+	ASSERT(!mask || (!mask->rmap.rm_owner && !mask->rmap.rm_offset));
+
 	return xbtree_key_contig(be32_to_cpu(key1->rmap.rm_startblock),
 				 be32_to_cpu(key2->rmap.rm_startblock));
 }


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 5/6] xfs: check the reference counts of gaps in the refcount btree
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect incorrect gaps in refcount btree Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 1/6] xfs: refactor converting btree irec to btree key Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 4/6] xfs: implement masked btree key comparisons for _has_records scans Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 6/6] xfs: ensure that all metadata and data blocks are not cow staging extents Darrick J. Wong
  5 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Gaps in the reference count btree are also significant -- for these
regions, there must not be any overlapping reverse mappings.  We don't
currently check this, so make the refcount scrubber more complete.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/refcount.c |   95 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 90 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 220b2850659e..10ef377873f6 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -332,6 +332,64 @@ xchk_refcountbt_xref(
 	xchk_refcountbt_xref_rmap(sc, irec);
 }
 
+struct xchk_refcbt_records {
+	/* The next AG block where we aren't expecting shared extents. */
+	xfs_agblock_t		next_unshared_agbno;
+
+	/* Number of CoW blocks we expect. */
+	xfs_agblock_t		cow_blocks;
+
+	/* Was the last record a shared or CoW staging extent? */
+	enum xfs_refc_domain	prev_domain;
+};
+
+STATIC int
+xchk_refcountbt_rmap_check_gap(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*priv)
+{
+	xfs_agblock_t			*next_bno = priv;
+
+	if (*next_bno != NULLAGBLOCK && rec->rm_startblock < *next_bno)
+		return -ECANCELED;
+
+	*next_bno = rec->rm_startblock + rec->rm_blockcount;
+	return 0;
+}
+
+/*
+ * Make sure that a gap in the reference count records does not correspond to
+ * overlapping records (i.e. shared extents) in the reverse mappings.
+ */
+static inline void
+xchk_refcountbt_xref_gaps(
+	struct xfs_scrub	*sc,
+	struct xchk_refcbt_records *rrc,
+	xfs_agblock_t		bno)
+{
+	struct xfs_rmap_irec	low;
+	struct xfs_rmap_irec	high;
+	xfs_agblock_t		next_bno = NULLAGBLOCK;
+	int			error;
+
+	if (bno <= rrc->next_unshared_agbno || !sc->sa.rmap_cur ||
+            xchk_skip_xref(sc->sm))
+		return;
+
+	memset(&low, 0, sizeof(low));
+	low.rm_startblock = rrc->next_unshared_agbno;
+	memset(&high, 0xFF, sizeof(high));
+	high.rm_startblock = bno - 1;
+
+	error = xfs_rmap_query_range(sc->sa.rmap_cur, &low, &high,
+			xchk_refcountbt_rmap_check_gap, &next_bno);
+	if (error == -ECANCELED)
+		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
+	else
+		xchk_should_check_xref(sc, &error, &sc->sa.rmap_cur);
+}
+
 /* Scrub a refcountbt record. */
 STATIC int
 xchk_refcountbt_rec(
@@ -339,7 +397,7 @@ xchk_refcountbt_rec(
 	const union xfs_btree_rec *rec)
 {
 	struct xfs_refcount_irec irec;
-	xfs_agblock_t		*cow_blocks = bs->private;
+	struct xchk_refcbt_records *rrc = bs->private;
 
 	xfs_refcount_btrec_to_irec(rec, &irec);
 	if (xfs_refcount_check_irec(bs->cur, &irec) != NULL) {
@@ -348,10 +406,27 @@ xchk_refcountbt_rec(
 	}
 
 	if (irec.rc_domain == XFS_REFC_DOMAIN_COW)
-		(*cow_blocks) += irec.rc_blockcount;
+		rrc->cow_blocks += irec.rc_blockcount;
+
+	/* Shared records always come before CoW records. */
+	if (irec.rc_domain == XFS_REFC_DOMAIN_SHARED &&
+	    rrc->prev_domain == XFS_REFC_DOMAIN_COW)
+		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+	rrc->prev_domain = irec.rc_domain;
 
 	xchk_refcountbt_xref(bs->sc, &irec);
 
+	/*
+	 * If this is a record for a shared extent, check that all blocks
+	 * between the previous record and this one have at most one reverse
+	 * mapping.
+	 */
+	if (irec.rc_domain == XFS_REFC_DOMAIN_SHARED) {
+		xchk_refcountbt_xref_gaps(bs->sc, rrc, irec.rc_startblock);
+		rrc->next_unshared_agbno = irec.rc_startblock +
+					   irec.rc_blockcount;
+	}
+
 	return 0;
 }
 
@@ -393,15 +468,25 @@ int
 xchk_refcountbt(
 	struct xfs_scrub	*sc)
 {
-	xfs_agblock_t		cow_blocks = 0;
+	struct xchk_refcbt_records rrc = {
+		.cow_blocks		= 0,
+		.next_unshared_agbno	= 0,
+		.prev_domain		= XFS_REFC_DOMAIN_SHARED,
+	};
 	int			error;
 
 	error = xchk_btree(sc, sc->sa.refc_cur, xchk_refcountbt_rec,
-			&XFS_RMAP_OINFO_REFC, &cow_blocks);
+			&XFS_RMAP_OINFO_REFC, &rrc);
 	if (error)
 		return error;
 
-	xchk_refcount_xref_rmap(sc, cow_blocks);
+	/*
+	 * Check that all blocks between the last refcount > 1 record and the
+	 * end of the AG have at most one reverse mapping.
+	 */
+	xchk_refcountbt_xref_gaps(sc, &rrc, sc->mp->m_sb.sb_agblocks);
+
+	xchk_refcount_xref_rmap(sc, rrc.cow_blocks);
 
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 6/6] xfs: ensure that all metadata and data blocks are not cow staging extents
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect incorrect gaps in refcount btree Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 4/6] xfs: implement masked btree key comparisons for _has_records scans Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  5 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure that all filesystem metadata blocks and file data blocks are
not also marked as CoW staging extents.  The extra checking added here
was inspired by an actual VM host filesystem corruption incident due to
bugs in the CoW handling of 4.x kernels.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/agheader.c |    5 +++++
 fs/xfs/scrub/alloc.c    |    1 +
 fs/xfs/scrub/bmap.c     |   11 ++++++++---
 fs/xfs/scrub/ialloc.c   |    2 +-
 fs/xfs/scrub/inode.c    |    1 +
 fs/xfs/scrub/refcount.c |   21 +++++++++++++++++++++
 fs/xfs/scrub/scrub.h    |    2 ++
 7 files changed, 39 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index 3dd9151a20ad..520ec054e4a6 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -53,6 +53,7 @@ xchk_superblock_xref(
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
 	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_xref_is_not_shared(sc, agbno, 1);
+	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 
 	/* scrub teardown will take care of sc->sa for us */
 }
@@ -517,6 +518,7 @@ xchk_agf_xref(
 	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_agf_xref_btreeblks(sc);
 	xchk_xref_is_not_shared(sc, agbno, 1);
+	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 	xchk_agf_xref_refcblks(sc);
 
 	/* scrub teardown will take care of sc->sa for us */
@@ -644,6 +646,7 @@ xchk_agfl_block_xref(
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
 	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_AG);
 	xchk_xref_is_not_shared(sc, agbno, 1);
+	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 }
 
 /* Scrub an AGFL block. */
@@ -700,6 +703,7 @@ xchk_agfl_xref(
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
 	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_xref_is_not_shared(sc, agbno, 1);
+	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 
 	/*
 	 * Scrub teardown will take care of sc->sa for us.  Leave sc->sa
@@ -855,6 +859,7 @@ xchk_agi_xref(
 	xchk_agi_xref_icounts(sc);
 	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_xref_is_not_shared(sc, agbno, 1);
+	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 	xchk_agi_xref_fiblocks(sc);
 
 	/* scrub teardown will take care of sc->sa for us */
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index c72001f6bad9..e9f8d29544aa 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -90,6 +90,7 @@ xchk_allocbt_xref(
 	xchk_xref_is_not_inode_chunk(sc, agbno, len);
 	xchk_xref_has_no_owner(sc, agbno, len);
 	xchk_xref_is_not_shared(sc, agbno, len);
+	xchk_xref_is_not_cow_staging(sc, agbno, len);
 }
 
 /* Scrub a bnobt/cntbt record. */
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 575f2c80d055..abc2da0b1824 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -328,12 +328,17 @@ xchk_bmap_iextent_xref(
 	xchk_bmap_xref_rmap(info, irec, agbno);
 	switch (info->whichfork) {
 	case XFS_DATA_FORK:
-		if (xfs_is_reflink_inode(info->sc->ip))
-			break;
-		fallthrough;
+		if (!xfs_is_reflink_inode(info->sc->ip))
+			xchk_xref_is_not_shared(info->sc, agbno,
+					irec->br_blockcount);
+		xchk_xref_is_not_cow_staging(info->sc, agbno,
+				irec->br_blockcount);
+		break;
 	case XFS_ATTR_FORK:
 		xchk_xref_is_not_shared(info->sc, agbno,
 				irec->br_blockcount);
+		xchk_xref_is_not_cow_staging(info->sc, agbno,
+				irec->br_blockcount);
 		break;
 	case XFS_COW_FORK:
 		xchk_xref_is_cow_staging(info->sc, agbno,
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index b85f0cd00bc2..5f04030b86c8 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -115,7 +115,7 @@ xchk_iallocbt_chunk(
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 
 	xchk_iallocbt_chunk_xref(bs->sc, irec, agino, bno, len);
-
+	xchk_xref_is_not_cow_staging(bs->sc, bno, len);
 	return true;
 }
 
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 8c972ee15a30..95694eca3851 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -558,6 +558,7 @@ xchk_inode_xref(
 	xchk_inode_xref_finobt(sc, ino);
 	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_INODES);
 	xchk_xref_is_not_shared(sc, agbno, 1);
+	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 	xchk_inode_xref_bmap(sc, dip);
 
 out_free:
diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index 10ef377873f6..e99c1e1246f8 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -555,3 +555,24 @@ xchk_xref_is_not_shared(
 	if (outcome != XBTREE_RECPACKING_EMPTY)
 		xchk_btree_xref_set_corrupt(sc, sc->sa.refc_cur, 0);
 }
+
+/* xref check that the extent is not being used for CoW staging. */
+void
+xchk_xref_is_not_cow_staging(
+	struct xfs_scrub	*sc,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		len)
+{
+	enum xbtree_recpacking	outcome;
+	int			error;
+
+	if (!sc->sa.refc_cur || xchk_skip_xref(sc->sm))
+		return;
+
+	error = xfs_refcount_has_records(sc->sa.refc_cur, XFS_REFC_DOMAIN_COW,
+			agbno, len, &outcome);
+	if (!xchk_should_check_xref(sc, &error, &sc->sa.refc_cur))
+		return;
+	if (outcome != XBTREE_RECPACKING_EMPTY)
+		xchk_btree_xref_set_corrupt(sc, sc->sa.refc_cur, 0);
+}
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 85c055c2ddc5..a331838e22ff 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -166,6 +166,8 @@ void xchk_xref_is_cow_staging(struct xfs_scrub *sc, xfs_agblock_t bno,
 		xfs_extlen_t len);
 void xchk_xref_is_not_shared(struct xfs_scrub *sc, xfs_agblock_t bno,
 		xfs_extlen_t len);
+void xchk_xref_is_not_cow_staging(struct xfs_scrub *sc, xfs_agblock_t bno,
+		xfs_extlen_t len);
 #ifdef CONFIG_XFS_RT
 void xchk_xref_is_used_rt_space(struct xfs_scrub *sc, xfs_rtblock_t rtbno,
 		xfs_extlen_t len);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/4] xfs: detect incorrect gaps in inode btree
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (9 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect incorrect gaps in refcount btree Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/4] xfs: clean up broken eearly-exit code in the inode btree scrubber Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: detect incorrect gaps in rmap btree Darrick J. Wong
                   ` (11 subsequent siblings)
  22 siblings, 4 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

This series continues the corrections for a couple of problems I found
in the inode btree scrubber.  The first problem is that we don't
directly check the inobt records have a direct correspondence with the
finobt records, and vice versa.  The second problem occurs on
filesystems with sparse inode chunks -- the cross-referencing we do
detects sparseness, but it doesn't actually check the consistency
between the inobt hole records and the rmap data.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-detect-inobt-gaps
---
 fs/xfs/libxfs/xfs_ialloc.c |   84 ++++++++------
 fs/xfs/libxfs/xfs_ialloc.h |    5 -
 fs/xfs/scrub/ialloc.c      |  268 ++++++++++++++++++++++++++++++++++++--------
 3 files changed, 269 insertions(+), 88 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/4] xfs: remove pointless shadow variable from xfs_difree_inobt
  2022-12-30 22:11 ` [PATCHSET v24.0 0/4] xfs: detect incorrect gaps in inode btree Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/4] xfs: clean up broken eearly-exit code in the inode btree scrubber Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 4/4] xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan results Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/4] xfs: directly cross-reference the inode btrees with each other Darrick J. Wong
  3 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In xfs_difree_inobt, the pag passed in was previously used to look up
the AGI buffer.  There's no need to extract it again, so remove the
shadow variable and shut up -Wshadow.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ialloc.c |    2 --
 1 file changed, 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 2451db4c687c..aab83f17d1a5 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1980,8 +1980,6 @@ xfs_difree_inobt(
 	 */
 	if (!xfs_has_ikeep(mp) && rec.ir_free == XFS_INOBT_ALL_FREE &&
 	    mp->m_sb.sb_inopblock <= XFS_INODES_PER_CHUNK) {
-		struct xfs_perag	*pag = agbp->b_pag;
-
 		xic->deleted = true;
 		xic->first_ino = XFS_AGINO_TO_INO(mp, pag->pag_agno,
 				rec.ir_startino);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/4] xfs: clean up broken eearly-exit code in the inode btree scrubber
  2022-12-30 22:11 ` [PATCHSET v24.0 0/4] xfs: detect incorrect gaps in inode btree Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/4] xfs: remove pointless shadow variable from xfs_difree_inobt Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Corrupt inode chunks should cause us to exit early after setting the
CORRUPT flag on the scrub state.  While we're at it, collapse trivial
helpers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/ialloc.c |   50 +++++++++++++++++++++----------------------------
 1 file changed, 21 insertions(+), 29 deletions(-)


diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index 5f04030b86c8..e5ce6a055ffe 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -79,43 +79,32 @@ xchk_iallocbt_chunk_xref_other(
 		xchk_btree_xref_set_corrupt(sc, *pcur, 0);
 }
 
-/* Cross-reference with the other btrees. */
-STATIC void
-xchk_iallocbt_chunk_xref(
-	struct xfs_scrub		*sc,
+/* Is this chunk worth checking and cross-referencing? */
+STATIC bool
+xchk_iallocbt_chunk(
+	struct xchk_btree		*bs,
 	struct xfs_inobt_rec_incore	*irec,
 	xfs_agino_t			agino,
-	xfs_agblock_t			agbno,
 	xfs_extlen_t			len)
 {
-	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
-		return;
+	struct xfs_scrub		*sc = bs->sc;
+	struct xfs_mount		*mp = bs->cur->bc_mp;
+	struct xfs_perag		*pag = bs->cur->bc_ag.pag;
+	xfs_agblock_t			agbno;
+
+	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
+
+	if (!xfs_verify_agbext(pag, agbno, len))
+		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return false;
 
 	xchk_xref_is_used_space(sc, agbno, len);
 	xchk_iallocbt_chunk_xref_other(sc, irec, agino);
 	xchk_xref_is_owned_by(sc, agbno, len, &XFS_RMAP_OINFO_INODES);
 	xchk_xref_is_not_shared(sc, agbno, len);
-}
-
-/* Is this chunk worth checking? */
-STATIC bool
-xchk_iallocbt_chunk(
-	struct xchk_btree		*bs,
-	struct xfs_inobt_rec_incore	*irec,
-	xfs_agino_t			agino,
-	xfs_extlen_t			len)
-{
-	struct xfs_mount		*mp = bs->cur->bc_mp;
-	struct xfs_perag		*pag = bs->cur->bc_ag.pag;
-	xfs_agblock_t			bno;
-
-	bno = XFS_AGINO_TO_AGBNO(mp, agino);
-
-	if (!xfs_verify_agbext(pag, bno, len))
-		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
-
-	xchk_iallocbt_chunk_xref(bs->sc, irec, agino, bno, len);
-	xchk_xref_is_not_cow_staging(bs->sc, bno, len);
+	xchk_xref_is_not_cow_staging(sc, agbno, len);
 	return true;
 }
 
@@ -463,7 +452,7 @@ xchk_iallocbt_rec(
 		if (holemask & 1)
 			holecount += XFS_INODES_PER_HOLEMASK_BIT;
 		else if (!xchk_iallocbt_chunk(bs, &irec, agino, len))
-			break;
+			goto out;
 		holemask >>= 1;
 		agino += XFS_INODES_PER_HOLEMASK_BIT;
 	}
@@ -473,6 +462,9 @@ xchk_iallocbt_rec(
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 
 check_clusters:
+	if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		goto out;
+
 	error = xchk_iallocbt_check_clusters(bs, &irec);
 	if (error)
 		goto out;


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 3/4] xfs: directly cross-reference the inode btrees with each other
  2022-12-30 22:11 ` [PATCHSET v24.0 0/4] xfs: detect incorrect gaps in inode btree Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 4/4] xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan results Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  3 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Improve the cross-referencing of the two inode btrees by directly
checking the free and hole state of each inode with the other btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/ialloc.c |  225 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 198 insertions(+), 27 deletions(-)


diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index e5ce6a055ffe..65a6c01df235 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -51,32 +51,201 @@ struct xchk_iallocbt {
 };
 
 /*
- * If we're checking the finobt, cross-reference with the inobt.
- * Otherwise we're checking the inobt; if there is an finobt, make sure
- * we have a record or not depending on freecount.
+ * Does the finobt have a record for this inode with the same hole/free state?
+ * This is a bit complicated because of the following:
+ *
+ * - The finobt need not have a record if all inodes in the inobt record are
+ *   allocated.
+ * - The finobt need not have a record if all inodes in the inobt record are
+ *   free.
+ * - The finobt need not have a record if the inobt record says this is a hole.
+ *   This likely doesn't happen in practice.
  */
-static inline void
-xchk_iallocbt_chunk_xref_other(
+STATIC int
+xchk_inobt_xref_finobt(
+	struct xfs_scrub	*sc,
+	struct xfs_inobt_rec_incore *irec,
+	xfs_agino_t		agino,
+	bool			free,
+	bool			hole)
+{
+	struct xfs_inobt_rec_incore frec;
+	struct xfs_btree_cur	*cur = sc->sa.fino_cur;
+	bool			ffree, fhole;
+	unsigned int		frec_idx, fhole_idx;
+	int			has_record;
+	int			error;
+
+	ASSERT(cur->bc_btnum == XFS_BTNUM_FINO);
+
+	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &has_record);
+	if (error)
+		return error;
+	if (!has_record)
+		goto no_record;
+
+	error = xfs_inobt_get_rec(cur, &frec, &has_record);
+	if (!has_record)
+		return -EFSCORRUPTED;
+
+	if (frec.ir_startino + XFS_INODES_PER_CHUNK <= agino)
+		goto no_record;
+
+	/* There's a finobt record; free and hole status must match. */
+	frec_idx = agino - frec.ir_startino;
+	ffree = frec.ir_free & (1ULL << frec_idx);
+	fhole_idx = frec_idx / XFS_INODES_PER_HOLEMASK_BIT;
+	fhole = frec.ir_holemask & (1U << fhole_idx);
+
+	if (ffree != free)
+		xchk_btree_xref_set_corrupt(sc, cur, 0);
+	if (fhole != hole)
+		xchk_btree_xref_set_corrupt(sc, cur, 0);
+	return 0;
+
+no_record:
+	/* inobt record is fully allocated */
+	if (irec->ir_free == 0)
+		return 0;
+
+	/* inobt record is totally unallocated */
+	if (irec->ir_free == XFS_INOBT_ALL_FREE)
+		return 0;
+
+	/* inobt record says this is a hole */
+	if (hole)
+		return 0;
+
+	/* finobt doesn't care about allocated inodes */
+	if (!free)
+		return 0;
+
+	xchk_btree_xref_set_corrupt(sc, cur, 0);
+	return 0;
+}
+
+/*
+ * Make sure that each inode of this part of an inobt record has the same
+ * sparse and free status as the finobt.
+ */
+STATIC void
+xchk_inobt_chunk_xref_finobt(
 	struct xfs_scrub		*sc,
 	struct xfs_inobt_rec_incore	*irec,
-	xfs_agino_t			agino)
+	xfs_agino_t			agino,
+	unsigned int			nr_inodes)
 {
-	struct xfs_btree_cur		**pcur;
-	bool				has_irec;
+	xfs_agino_t			i;
+	unsigned int			rec_idx;
 	int				error;
 
-	if (sc->sm->sm_type == XFS_SCRUB_TYPE_FINOBT)
-		pcur = &sc->sa.ino_cur;
-	else
-		pcur = &sc->sa.fino_cur;
-	if (!(*pcur))
+	ASSERT(sc->sm->sm_type == XFS_SCRUB_TYPE_INOBT);
+
+	if (!sc->sa.fino_cur || xchk_skip_xref(sc->sm))
 		return;
-	error = xfs_ialloc_has_inode_record(*pcur, agino, agino, &has_irec);
-	if (!xchk_should_check_xref(sc, &error, pcur))
+
+	for (i = agino, rec_idx = agino - irec->ir_startino;
+	     i < agino + nr_inodes;
+	     i++, rec_idx++) {
+		bool			free, hole;
+		unsigned int		hole_idx;
+
+		free = irec->ir_free & (1ULL << rec_idx);
+		hole_idx = rec_idx / XFS_INODES_PER_HOLEMASK_BIT;
+		hole = irec->ir_holemask & (1U << hole_idx);
+
+		error = xchk_inobt_xref_finobt(sc, irec, i, free, hole);
+		if (!xchk_should_check_xref(sc, &error, &sc->sa.fino_cur))
+			return;
+	}
+}
+
+/*
+ * Does the inobt have a record for this inode with the same hole/free state?
+ * The inobt must always have a record if there's a finobt record.
+ */
+STATIC int
+xchk_finobt_xref_inobt(
+	struct xfs_scrub	*sc,
+	struct xfs_inobt_rec_incore *frec,
+	xfs_agino_t		agino,
+	bool			ffree,
+	bool			fhole)
+{
+	struct xfs_inobt_rec_incore irec;
+	struct xfs_btree_cur	*cur = sc->sa.ino_cur;
+	bool			free, hole;
+	unsigned int		rec_idx, hole_idx;
+	int			has_record;
+	int			error;
+
+	ASSERT(cur->bc_btnum == XFS_BTNUM_INO);
+
+	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &has_record);
+	if (error)
+		return error;
+	if (!has_record)
+		goto no_record;
+
+	error = xfs_inobt_get_rec(cur, &irec, &has_record);
+	if (!has_record)
+		return -EFSCORRUPTED;
+
+	if (irec.ir_startino + XFS_INODES_PER_CHUNK <= agino)
+		goto no_record;
+
+	/* There's an inobt record; free and hole status must match. */
+	rec_idx = agino - irec.ir_startino;
+	free = irec.ir_free & (1ULL << rec_idx);
+	hole_idx = rec_idx / XFS_INODES_PER_HOLEMASK_BIT;
+	hole = irec.ir_holemask & (1U << hole_idx);
+
+	if (ffree != free)
+		xchk_btree_xref_set_corrupt(sc, cur, 0);
+	if (fhole != hole)
+		xchk_btree_xref_set_corrupt(sc, cur, 0);
+	return 0;
+
+no_record:
+	/* finobt should never have a record for which the inobt does not */
+	xchk_btree_xref_set_corrupt(sc, cur, 0);
+	return 0;
+}
+
+/*
+ * Make sure that each inode of this part of an finobt record has the same
+ * sparse and free status as the inobt.
+ */
+STATIC void
+xchk_finobt_chunk_xref_inobt(
+	struct xfs_scrub		*sc,
+	struct xfs_inobt_rec_incore	*frec,
+	xfs_agino_t			agino,
+	unsigned int			nr_inodes)
+{
+	xfs_agino_t			i;
+	unsigned int			rec_idx;
+	int				error;
+
+	ASSERT(sc->sm->sm_type == XFS_SCRUB_TYPE_FINOBT);
+
+	if (!sc->sa.ino_cur || xchk_skip_xref(sc->sm))
 		return;
-	if (((irec->ir_freecount > 0 && !has_irec) ||
-	     (irec->ir_freecount == 0 && has_irec)))
-		xchk_btree_xref_set_corrupt(sc, *pcur, 0);
+
+	for (i = agino, rec_idx = agino - frec->ir_startino;
+	     i < agino + nr_inodes;
+	     i++, rec_idx++) {
+		bool			ffree, fhole;
+		unsigned int		hole_idx;
+
+		ffree = frec->ir_free & (1ULL << rec_idx);
+		hole_idx = rec_idx / XFS_INODES_PER_HOLEMASK_BIT;
+		fhole = frec->ir_holemask & (1U << hole_idx);
+
+		error = xchk_finobt_xref_inobt(sc, frec, i, ffree, fhole);
+		if (!xchk_should_check_xref(sc, &error, &sc->sa.ino_cur))
+			return;
+	}
 }
 
 /* Is this chunk worth checking and cross-referencing? */
@@ -85,14 +254,16 @@ xchk_iallocbt_chunk(
 	struct xchk_btree		*bs,
 	struct xfs_inobt_rec_incore	*irec,
 	xfs_agino_t			agino,
-	xfs_extlen_t			len)
+	unsigned int			nr_inodes)
 {
 	struct xfs_scrub		*sc = bs->sc;
 	struct xfs_mount		*mp = bs->cur->bc_mp;
 	struct xfs_perag		*pag = bs->cur->bc_ag.pag;
 	xfs_agblock_t			agbno;
+	xfs_extlen_t			len;
 
 	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
+	len = XFS_B_TO_FSB(mp, nr_inodes * mp->m_sb.sb_inodesize);
 
 	if (!xfs_verify_agbext(pag, agbno, len))
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
@@ -101,7 +272,10 @@ xchk_iallocbt_chunk(
 		return false;
 
 	xchk_xref_is_used_space(sc, agbno, len);
-	xchk_iallocbt_chunk_xref_other(sc, irec, agino);
+	if (sc->sm->sm_type == XFS_SCRUB_TYPE_INOBT)
+		xchk_inobt_chunk_xref_finobt(sc, irec, agino, nr_inodes);
+	else
+		xchk_finobt_chunk_xref_inobt(sc, irec, agino, nr_inodes);
 	xchk_xref_is_owned_by(sc, agbno, len, &XFS_RMAP_OINFO_INODES);
 	xchk_xref_is_not_shared(sc, agbno, len);
 	xchk_xref_is_not_cow_staging(sc, agbno, len);
@@ -406,7 +580,6 @@ xchk_iallocbt_rec(
 	struct xfs_inobt_rec_incore	irec;
 	uint64_t			holes;
 	xfs_agino_t			agino;
-	xfs_extlen_t			len;
 	int				holecount;
 	int				i;
 	int				error = 0;
@@ -428,12 +601,11 @@ xchk_iallocbt_rec(
 
 	/* Handle non-sparse inodes */
 	if (!xfs_inobt_issparse(irec.ir_holemask)) {
-		len = XFS_B_TO_FSB(mp,
-				XFS_INODES_PER_CHUNK * mp->m_sb.sb_inodesize);
 		if (irec.ir_count != XFS_INODES_PER_CHUNK)
 			xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 
-		if (!xchk_iallocbt_chunk(bs, &irec, agino, len))
+		if (!xchk_iallocbt_chunk(bs, &irec, agino,
+					XFS_INODES_PER_CHUNK))
 			goto out;
 		goto check_clusters;
 	}
@@ -441,8 +613,6 @@ xchk_iallocbt_rec(
 	/* Check each chunk of a sparse inode cluster. */
 	holemask = irec.ir_holemask;
 	holecount = 0;
-	len = XFS_B_TO_FSB(mp,
-			XFS_INODES_PER_HOLEMASK_BIT * mp->m_sb.sb_inodesize);
 	holes = ~xfs_inobt_irec_to_allocmask(&irec);
 	if ((holes & irec.ir_free) != holes ||
 	    irec.ir_freecount > irec.ir_count)
@@ -451,7 +621,8 @@ xchk_iallocbt_rec(
 	for (i = 0; i < XFS_INOBT_HOLEMASK_BITS; i++) {
 		if (holemask & 1)
 			holecount += XFS_INODES_PER_HOLEMASK_BIT;
-		else if (!xchk_iallocbt_chunk(bs, &irec, agino, len))
+		else if (!xchk_iallocbt_chunk(bs, &irec, agino,
+					XFS_INODES_PER_HOLEMASK_BIT))
 			goto out;
 		holemask >>= 1;
 		agino += XFS_INODES_PER_HOLEMASK_BIT;


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 4/4] xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan results
  2022-12-30 22:11 ` [PATCHSET v24.0 0/4] xfs: detect incorrect gaps in inode btree Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/4] xfs: clean up broken eearly-exit code in the inode btree scrubber Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/4] xfs: remove pointless shadow variable from xfs_difree_inobt Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/4] xfs: directly cross-reference the inode btrees with each other Darrick J. Wong
  3 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Convert the xfs_ialloc_has_inodes_at_extent function to return keyfill
scan results because for a given range of inode numbers, we might have
no indexed inodes at all; the entire region might be allocated ondisk
inodes; or there might be a mix of the two.

Unfortunately, sparse inodes adds to the complexity, because each inode
record can have holes, which means that we cannot use the generic btree
_scan_keyfill function because we must look for holes in individual
records to decide the result.  On the plus side, online fsck can now
detect sub-chunk discrepancies in the inobt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ialloc.c |   82 +++++++++++++++++++++++++++-----------------
 fs/xfs/libxfs/xfs_ialloc.h |    5 +--
 fs/xfs/scrub/ialloc.c      |   17 +++++----
 3 files changed, 62 insertions(+), 42 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index aab83f17d1a5..d5de1eed97e2 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2656,44 +2656,50 @@ xfs_ialloc_read_agi(
 	return 0;
 }
 
-/* Is there an inode record covering a given range of inode numbers? */
-int
-xfs_ialloc_has_inode_record(
-	struct xfs_btree_cur	*cur,
-	xfs_agino_t		low,
-	xfs_agino_t		high,
-	bool			*exists)
+/* How many inodes are backed by inode clusters ondisk? */
+STATIC int
+xfs_ialloc_count_ondisk(
+	struct xfs_btree_cur		*cur,
+	xfs_agino_t			low,
+	xfs_agino_t			high,
+	unsigned int			*allocated)
 {
 	struct xfs_inobt_rec_incore	irec;
-	xfs_agino_t		agino;
-	uint16_t		holemask;
-	int			has_record;
-	int			i;
-	int			error;
+	unsigned int			ret = 0;
+	int				has_record;
+	int				error;
 
-	*exists = false;
 	error = xfs_inobt_lookup(cur, low, XFS_LOOKUP_LE, &has_record);
-	while (error == 0 && has_record) {
+	if (error)
+		return error;
+
+	while (has_record) {
+		unsigned int		i, hole_idx;
+
 		error = xfs_inobt_get_rec(cur, &irec, &has_record);
-		if (error || irec.ir_startino > high)
+		if (error)
+			return error;
+		if (irec.ir_startino > high)
 			break;
 
-		agino = irec.ir_startino;
-		holemask = irec.ir_holemask;
-		for (i = 0; i < XFS_INOBT_HOLEMASK_BITS; holemask >>= 1,
-				i++, agino += XFS_INODES_PER_HOLEMASK_BIT) {
-			if (holemask & 1)
+		for (i = 0; i < XFS_INODES_PER_CHUNK; i++) {
+			if (irec.ir_startino + i < low)
 				continue;
-			if (agino + XFS_INODES_PER_HOLEMASK_BIT > low &&
-					agino <= high) {
-				*exists = true;
-				return 0;
-			}
+			if (irec.ir_startino + i > high)
+				break;
+
+			hole_idx = i / XFS_INODES_PER_HOLEMASK_BIT;
+			if (!(irec.ir_holemask & (1U << hole_idx)))
+				ret++;
 		}
 
 		error = xfs_btree_increment(cur, 0, &has_record);
+		if (error)
+			return error;
 	}
-	return error;
+
+	*allocated = ret;
+	return 0;
 }
 
 /* Is there an inode record covering a given extent? */
@@ -2702,15 +2708,27 @@ xfs_ialloc_has_inodes_at_extent(
 	struct xfs_btree_cur	*cur,
 	xfs_agblock_t		bno,
 	xfs_extlen_t		len,
-	bool			*exists)
+	enum xbtree_recpacking	*outcome)
 {
-	xfs_agino_t		low;
-	xfs_agino_t		high;
+	xfs_agino_t		agino;
+	xfs_agino_t		last_agino;
+	unsigned int		allocated;
+	int			error;
 
-	low = XFS_AGB_TO_AGINO(cur->bc_mp, bno);
-	high = XFS_AGB_TO_AGINO(cur->bc_mp, bno + len) - 1;
+	agino = XFS_AGB_TO_AGINO(cur->bc_mp, bno);
+	last_agino = XFS_AGB_TO_AGINO(cur->bc_mp, bno + len) - 1;
 
-	return xfs_ialloc_has_inode_record(cur, low, high, exists);
+	error = xfs_ialloc_count_ondisk(cur, agino, last_agino, &allocated);
+	if (error)
+		return error;
+
+	if (allocated == 0)
+		*outcome = XBTREE_RECPACKING_EMPTY;
+	else if (allocated == last_agino - agino + 1)
+		*outcome = XBTREE_RECPACKING_FULL;
+	else
+		*outcome = XBTREE_RECPACKING_SPARSE;
+	return 0;
 }
 
 struct xfs_ialloc_count_inodes {
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index fa67bb090c01..fa4d506086b9 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -95,9 +95,8 @@ void xfs_inobt_btrec_to_irec(struct xfs_mount *mp,
 xfs_failaddr_t xfs_inobt_check_irec(struct xfs_btree_cur *cur,
 		const struct xfs_inobt_rec_incore *irec);
 int xfs_ialloc_has_inodes_at_extent(struct xfs_btree_cur *cur,
-		xfs_agblock_t bno, xfs_extlen_t len, bool *exists);
-int xfs_ialloc_has_inode_record(struct xfs_btree_cur *cur, xfs_agino_t low,
-		xfs_agino_t high, bool *exists);
+		xfs_agblock_t bno, xfs_extlen_t len,
+		enum xbtree_recpacking *outcome);
 int xfs_ialloc_count_inodes(struct xfs_btree_cur *cur, xfs_agino_t *count,
 		xfs_agino_t *freecount);
 int xfs_inobt_insert_rec(struct xfs_btree_cur *cur, uint16_t holemask,
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index 65a6c01df235..598112471d07 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -765,18 +765,18 @@ xchk_xref_inode_check(
 	xfs_agblock_t		agbno,
 	xfs_extlen_t		len,
 	struct xfs_btree_cur	**icur,
-	bool			should_have_inodes)
+	enum xbtree_recpacking	expected)
 {
-	bool			has_inodes;
+	enum xbtree_recpacking	outcome;
 	int			error;
 
 	if (!(*icur) || xchk_skip_xref(sc->sm))
 		return;
 
-	error = xfs_ialloc_has_inodes_at_extent(*icur, agbno, len, &has_inodes);
+	error = xfs_ialloc_has_inodes_at_extent(*icur, agbno, len, &outcome);
 	if (!xchk_should_check_xref(sc, &error, icur))
 		return;
-	if (has_inodes != should_have_inodes)
+	if (outcome != expected)
 		xchk_btree_xref_set_corrupt(sc, *icur, 0);
 }
 
@@ -787,8 +787,10 @@ xchk_xref_is_not_inode_chunk(
 	xfs_agblock_t		agbno,
 	xfs_extlen_t		len)
 {
-	xchk_xref_inode_check(sc, agbno, len, &sc->sa.ino_cur, false);
-	xchk_xref_inode_check(sc, agbno, len, &sc->sa.fino_cur, false);
+	xchk_xref_inode_check(sc, agbno, len, &sc->sa.ino_cur,
+			XBTREE_RECPACKING_EMPTY);
+	xchk_xref_inode_check(sc, agbno, len, &sc->sa.fino_cur,
+			XBTREE_RECPACKING_EMPTY);
 }
 
 /* xref check that the extent is covered by inodes */
@@ -798,5 +800,6 @@ xchk_xref_is_inode_chunk(
 	xfs_agblock_t		agbno,
 	xfs_extlen_t		len)
 {
-	xchk_xref_inode_check(sc, agbno, len, &sc->sa.ino_cur, true);
+	xchk_xref_inode_check(sc, agbno, len, &sc->sa.ino_cur,
+			XBTREE_RECPACKING_FULL);
 }


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/2] xfs: detect incorrect gaps in rmap btree
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (10 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/4] xfs: detect incorrect gaps in inode btree Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/2] xfs: teach scrub to check for sole ownership of metadata objects Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/2] xfs: ensure that single-owner file blocks are not owned by others Darrick J. Wong
  2022-12-30 22:11 ` [PATCHSET v24.0 0/4] xfs: fix iget/irele usage in online fsck Darrick J. Wong
                   ` (10 subsequent siblings)
  22 siblings, 2 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

Following in the theme of the last two patchsets, this one strengthens
the rmap btree record checking so that scrub can count the number of
space records that map to a given owner and that do not map to a given
owner.  This enables us to determine exclusive ownership of space that
can't be shared.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-detect-rmapbt-gaps
---
 fs/xfs/libxfs/xfs_rmap.c |  198 ++++++++++++++++++++++++++++++++--------------
 fs/xfs/libxfs/xfs_rmap.h |   18 +++-
 fs/xfs/scrub/agheader.c  |   10 +-
 fs/xfs/scrub/bmap.c      |   14 +++
 fs/xfs/scrub/btree.c     |    2 
 fs/xfs/scrub/ialloc.c    |    4 -
 fs/xfs/scrub/inode.c     |    2 
 fs/xfs/scrub/rmap.c      |   45 ++++++----
 fs/xfs/scrub/scrub.h     |    2 
 9 files changed, 198 insertions(+), 97 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/2] xfs: teach scrub to check for sole ownership of metadata objects
  2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: detect incorrect gaps in rmap btree Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/2] xfs: ensure that single-owner file blocks are not owned by others Darrick J. Wong
  1 sibling, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Strengthen online scrub's checking even further by enabling us to check
that a range of blocks are owned solely by a given owner.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rmap.c |  198 ++++++++++++++++++++++++++++++++--------------
 fs/xfs/libxfs/xfs_rmap.h |   18 +++-
 fs/xfs/scrub/agheader.c  |   10 +-
 fs/xfs/scrub/btree.c     |    2 
 fs/xfs/scrub/ialloc.c    |    4 -
 fs/xfs/scrub/inode.c     |    2 
 fs/xfs/scrub/rmap.c      |   45 ++++++----
 fs/xfs/scrub/scrub.h     |    2 
 8 files changed, 185 insertions(+), 96 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rmap.c b/fs/xfs/libxfs/xfs_rmap.c
index 308b81f321eb..a9f13d877822 100644
--- a/fs/xfs/libxfs/xfs_rmap.c
+++ b/fs/xfs/libxfs/xfs_rmap.c
@@ -2735,65 +2735,141 @@ xfs_rmap_has_records(
 	return xfs_btree_has_records(cur, &low, &high, &mask, outcome);
 }
 
-/*
- * Is there a record for this owner completely covering a given physical
- * extent?  If so, *has_rmap will be set to true.  If there is no record
- * or the record only covers part of the range, we set *has_rmap to false.
- * This function doesn't perform range lookups or offset checks, so it is
- * not suitable for checking data fork blocks.
- */
-int
-xfs_rmap_record_exists(
-	struct xfs_btree_cur		*cur,
-	xfs_agblock_t			bno,
-	xfs_extlen_t			len,
-	const struct xfs_owner_info	*oinfo,
-	bool				*has_rmap)
-{
-	uint64_t			owner;
-	uint64_t			offset;
-	unsigned int			flags;
-	int				has_record;
-	struct xfs_rmap_irec		irec;
-	int				error;
+struct xfs_rmap_ownercount {
+	/* Owner that we're looking for. */
+	struct xfs_rmap_irec	good;
 
-	xfs_owner_info_unpack(oinfo, &owner, &offset, &flags);
-	ASSERT(XFS_RMAP_NON_INODE_OWNER(owner) ||
-	       (flags & XFS_RMAP_BMBT_BLOCK));
+	/* rmap search keys */
+	struct xfs_rmap_irec	low;
+	struct xfs_rmap_irec	high;
 
-	error = xfs_rmap_lookup_le(cur, bno, owner, offset, flags, &irec,
-			&has_record);
-	if (error)
-		return error;
-	if (!has_record) {
-		*has_rmap = false;
-		return 0;
-	}
+	struct xfs_rmap_matches	*results;
 
-	*has_rmap = (irec.rm_owner == owner && irec.rm_startblock <= bno &&
-		     irec.rm_startblock + irec.rm_blockcount >= bno + len);
-	return 0;
-}
-
-struct xfs_rmap_key_state {
-	uint64_t			owner;
-	uint64_t			offset;
-	unsigned int			flags;
+	/* Stop early if we find a nonmatch? */
+	bool			stop_on_nonmatch;
 };
 
-/* For each rmap given, figure out if it doesn't match the key we want. */
+/* Does this rmap represent space that can have multiple owners? */
+static inline bool
+xfs_rmap_shareable(
+	struct xfs_mount		*mp,
+	const struct xfs_rmap_irec	*rmap)
+{
+	if (!xfs_has_reflink(mp))
+		return false;
+	if (XFS_RMAP_NON_INODE_OWNER(rmap->rm_owner))
+		return false;
+	if (rmap->rm_flags & (XFS_RMAP_ATTR_FORK |
+			      XFS_RMAP_BMBT_BLOCK))
+		return false;
+	return true;
+}
+
+static inline void
+xfs_rmap_ownercount_init(
+	struct xfs_rmap_ownercount	*roc,
+	xfs_agblock_t			bno,
+	xfs_extlen_t			len,
+	const struct xfs_owner_info	*oinfo,
+	struct xfs_rmap_matches		*results)
+{
+	memset(roc, 0, sizeof(*roc));
+	roc->results = results;
+
+	roc->low.rm_startblock = bno;
+	memset(&roc->high, 0xFF, sizeof(roc->high));
+	roc->high.rm_startblock = bno + len - 1;
+
+	memset(results, 0, sizeof(*results));
+	roc->good.rm_startblock = bno;
+	roc->good.rm_blockcount = len;
+	roc->good.rm_owner = oinfo->oi_owner;
+	roc->good.rm_offset = oinfo->oi_offset;
+	if (oinfo->oi_flags & XFS_OWNER_INFO_ATTR_FORK)
+		roc->good.rm_flags |= XFS_RMAP_ATTR_FORK;
+	if (oinfo->oi_flags & XFS_OWNER_INFO_BMBT_BLOCK)
+		roc->good.rm_flags |= XFS_RMAP_BMBT_BLOCK;
+}
+
+/* Figure out if this is a match for the owner. */
 STATIC int
-xfs_rmap_has_other_keys_helper(
+xfs_rmap_count_owners_helper(
 	struct xfs_btree_cur		*cur,
 	const struct xfs_rmap_irec	*rec,
 	void				*priv)
 {
-	struct xfs_rmap_key_state	*rks = priv;
+	struct xfs_rmap_ownercount	*roc = priv;
+	struct xfs_rmap_irec		check = *rec;
+	unsigned int			keyflags;
+	bool				filedata;
+	int64_t				delta;
 
-	if (rks->owner == rec->rm_owner && rks->offset == rec->rm_offset &&
-	    ((rks->flags & rec->rm_flags) & XFS_RMAP_KEY_FLAGS) == rks->flags)
-		return 0;
-	return -ECANCELED;
+	filedata = !XFS_RMAP_NON_INODE_OWNER(check.rm_owner) &&
+		   !(check.rm_flags & XFS_RMAP_BMBT_BLOCK);
+
+	/* Trim the part of check that comes before the comparison range. */
+	delta = (int64_t)roc->good.rm_startblock - check.rm_startblock;
+	if (delta > 0) {
+		check.rm_startblock += delta;
+		check.rm_blockcount -= delta;
+		if (filedata)
+			check.rm_offset += delta;
+	}
+
+	/* Trim the part of check that comes after the comparison range. */
+	delta = (check.rm_startblock + check.rm_blockcount) -
+		(roc->good.rm_startblock + roc->good.rm_blockcount);
+	if (delta > 0)
+		check.rm_blockcount -= delta;
+
+	/* Don't care about unwritten status for establishing ownership. */
+	keyflags = check.rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK);
+
+	if (check.rm_startblock	== roc->good.rm_startblock &&
+	    check.rm_blockcount	== roc->good.rm_blockcount &&
+	    check.rm_owner	== roc->good.rm_owner &&
+	    check.rm_offset	== roc->good.rm_offset &&
+	    keyflags		== roc->good.rm_flags) {
+		roc->results->matches++;
+	} else {
+		roc->results->nono_matches++;
+		if (xfs_rmap_shareable(cur->bc_mp, &roc->good) ^
+		    xfs_rmap_shareable(cur->bc_mp, &check))
+			roc->results->badno_matches++;
+	}
+
+	if (roc->results->nono_matches && roc->stop_on_nonmatch)
+		return -ECANCELED;
+
+	return 0;
+}
+
+/* Count the number of owners and non-owners of this range of blocks. */
+int
+xfs_rmap_count_owners(
+	struct xfs_btree_cur		*cur,
+	xfs_agblock_t			bno,
+	xfs_extlen_t			len,
+	const struct xfs_owner_info	*oinfo,
+	struct xfs_rmap_matches		*results)
+{
+	struct xfs_rmap_ownercount	roc;
+	int				error;
+
+	xfs_rmap_ownercount_init(&roc, bno, len, oinfo, results);
+	error = xfs_rmap_query_range(cur, &roc.low, &roc.high,
+			xfs_rmap_count_owners_helper, &roc);
+	if (error)
+		return error;
+
+	/*
+	 * There can't be any non-owner rmaps that conflict with the given
+	 * owner if we didn't find any rmaps matching the owner.
+	 */
+	if (!results->matches)
+		results->badno_matches = 0;
+
+	return 0;
 }
 
 /*
@@ -2806,28 +2882,26 @@ xfs_rmap_has_other_keys(
 	xfs_agblock_t			bno,
 	xfs_extlen_t			len,
 	const struct xfs_owner_info	*oinfo,
-	bool				*has_rmap)
+	bool				*has_other)
 {
-	struct xfs_rmap_irec		low = {0};
-	struct xfs_rmap_irec		high;
-	struct xfs_rmap_key_state	rks;
+	struct xfs_rmap_matches		res;
+	struct xfs_rmap_ownercount	roc;
 	int				error;
 
-	xfs_owner_info_unpack(oinfo, &rks.owner, &rks.offset, &rks.flags);
-	*has_rmap = false;
+	xfs_rmap_ownercount_init(&roc, bno, len, oinfo, &res);
+	roc.stop_on_nonmatch = true;
 
-	low.rm_startblock = bno;
-	memset(&high, 0xFF, sizeof(high));
-	high.rm_startblock = bno + len - 1;
-
-	error = xfs_rmap_query_range(cur, &low, &high,
-			xfs_rmap_has_other_keys_helper, &rks);
+	error = xfs_rmap_query_range(cur, &roc.low, &roc.high,
+			xfs_rmap_count_owners_helper, &roc);
 	if (error == -ECANCELED) {
-		*has_rmap = true;
+		*has_other = true;
 		return 0;
 	}
+	if (error)
+		return error;
 
-	return error;
+	*has_other = false;
+	return 0;
 }
 
 const struct xfs_owner_info XFS_RMAP_OINFO_SKIP_UPDATE = {
diff --git a/fs/xfs/libxfs/xfs_rmap.h b/fs/xfs/libxfs/xfs_rmap.h
index 4cbe50cf522e..ced605d69324 100644
--- a/fs/xfs/libxfs/xfs_rmap.h
+++ b/fs/xfs/libxfs/xfs_rmap.h
@@ -200,12 +200,24 @@ xfs_failaddr_t xfs_rmap_check_irec(struct xfs_btree_cur *cur,
 
 int xfs_rmap_has_records(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, enum xbtree_recpacking *outcome);
-int xfs_rmap_record_exists(struct xfs_btree_cur *cur, xfs_agblock_t bno,
+
+struct xfs_rmap_matches {
+	/* Number of owner matches. */
+	unsigned long long	matches;
+
+	/* Number of non-owner matches. */
+	unsigned long long	nono_matches;
+
+	/* Number of non-owner matches that conflict with the owner matches. */
+	unsigned long long	badno_matches;
+};
+
+int xfs_rmap_count_owners(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, const struct xfs_owner_info *oinfo,
-		bool *has_rmap);
+		struct xfs_rmap_matches *rmatch);
 int xfs_rmap_has_other_keys(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, const struct xfs_owner_info *oinfo,
-		bool *has_rmap);
+		bool *has_other);
 int xfs_rmap_map_raw(struct xfs_btree_cur *cur, struct xfs_rmap_irec *rmap);
 
 extern const struct xfs_owner_info XFS_RMAP_OINFO_SKIP_UPDATE;
diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index 520ec054e4a6..75de0ba4fcef 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -51,7 +51,7 @@ xchk_superblock_xref(
 
 	xchk_xref_is_used_space(sc, agbno, 1);
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
-	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
+	xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_xref_is_not_shared(sc, agbno, 1);
 	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 
@@ -515,7 +515,7 @@ xchk_agf_xref(
 	xchk_agf_xref_freeblks(sc);
 	xchk_agf_xref_cntbt(sc);
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
-	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
+	xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_agf_xref_btreeblks(sc);
 	xchk_xref_is_not_shared(sc, agbno, 1);
 	xchk_xref_is_not_cow_staging(sc, agbno, 1);
@@ -644,7 +644,7 @@ xchk_agfl_block_xref(
 
 	xchk_xref_is_used_space(sc, agbno, 1);
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
-	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_AG);
+	xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_AG);
 	xchk_xref_is_not_shared(sc, agbno, 1);
 	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 }
@@ -701,7 +701,7 @@ xchk_agfl_xref(
 
 	xchk_xref_is_used_space(sc, agbno, 1);
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
-	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
+	xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_xref_is_not_shared(sc, agbno, 1);
 	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 
@@ -857,7 +857,7 @@ xchk_agi_xref(
 	xchk_xref_is_used_space(sc, agbno, 1);
 	xchk_xref_is_not_inode_chunk(sc, agbno, 1);
 	xchk_agi_xref_icounts(sc);
-	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
+	xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_FS);
 	xchk_xref_is_not_shared(sc, agbno, 1);
 	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 	xchk_agi_xref_fiblocks(sc);
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 8ae42dff632f..24ea77e46ebd 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -402,7 +402,7 @@ xchk_btree_check_block_owner(
 	if (!bs->sc->sa.bno_cur && btnum == XFS_BTNUM_BNO)
 		bs->cur = NULL;
 
-	xchk_xref_is_owned_by(bs->sc, agbno, 1, bs->oinfo);
+	xchk_xref_is_only_owned_by(bs->sc, agbno, 1, bs->oinfo);
 	if (!bs->sc->sa.rmap_cur && btnum == XFS_BTNUM_RMAP)
 		bs->cur = NULL;
 
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index 598112471d07..f690143af0c0 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -276,7 +276,7 @@ xchk_iallocbt_chunk(
 		xchk_inobt_chunk_xref_finobt(sc, irec, agino, nr_inodes);
 	else
 		xchk_finobt_chunk_xref_inobt(sc, irec, agino, nr_inodes);
-	xchk_xref_is_owned_by(sc, agbno, len, &XFS_RMAP_OINFO_INODES);
+	xchk_xref_is_only_owned_by(sc, agbno, len, &XFS_RMAP_OINFO_INODES);
 	xchk_xref_is_not_shared(sc, agbno, len);
 	xchk_xref_is_not_cow_staging(sc, agbno, len);
 	return true;
@@ -428,7 +428,7 @@ xchk_iallocbt_check_cluster(
 		return 0;
 	}
 
-	xchk_xref_is_owned_by(bs->sc, agbno, M_IGEO(mp)->blocks_per_cluster,
+	xchk_xref_is_only_owned_by(bs->sc, agbno, M_IGEO(mp)->blocks_per_cluster,
 			&XFS_RMAP_OINFO_INODES);
 
 	/* Grab the inode cluster buffer. */
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 95694eca3851..3b272c86d0ad 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -556,7 +556,7 @@ xchk_inode_xref(
 
 	xchk_xref_is_used_space(sc, agbno, 1);
 	xchk_inode_xref_finobt(sc, ino);
-	xchk_xref_is_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_INODES);
+	xchk_xref_is_only_owned_by(sc, agbno, 1, &XFS_RMAP_OINFO_INODES);
 	xchk_xref_is_not_shared(sc, agbno, 1);
 	xchk_xref_is_not_cow_staging(sc, agbno, 1);
 	xchk_inode_xref_bmap(sc, dip);
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 9ac3bc760d6c..7b0ad8f846ab 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -167,38 +167,29 @@ xchk_rmapbt(
 			&XFS_RMAP_OINFO_AG, NULL);
 }
 
-/* xref check that the extent is owned by a given owner */
-static inline void
-xchk_xref_check_owner(
+/* xref check that the extent is owned only by a given owner */
+void
+xchk_xref_is_only_owned_by(
 	struct xfs_scrub		*sc,
 	xfs_agblock_t			bno,
 	xfs_extlen_t			len,
-	const struct xfs_owner_info	*oinfo,
-	bool				should_have_rmap)
+	const struct xfs_owner_info	*oinfo)
 {
-	bool				has_rmap;
+	struct xfs_rmap_matches		res;
 	int				error;
 
 	if (!sc->sa.rmap_cur || xchk_skip_xref(sc->sm))
 		return;
 
-	error = xfs_rmap_record_exists(sc->sa.rmap_cur, bno, len, oinfo,
-			&has_rmap);
+	error = xfs_rmap_count_owners(sc->sa.rmap_cur, bno, len, oinfo, &res);
 	if (!xchk_should_check_xref(sc, &error, &sc->sa.rmap_cur))
 		return;
-	if (has_rmap != should_have_rmap)
+	if (res.matches != 1)
+		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
+	if (res.badno_matches)
+		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
+	if (res.nono_matches)
 		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
-}
-
-/* xref check that the extent is owned by a given owner */
-void
-xchk_xref_is_owned_by(
-	struct xfs_scrub		*sc,
-	xfs_agblock_t			bno,
-	xfs_extlen_t			len,
-	const struct xfs_owner_info	*oinfo)
-{
-	xchk_xref_check_owner(sc, bno, len, oinfo, true);
 }
 
 /* xref check that the extent is not owned by a given owner */
@@ -209,7 +200,19 @@ xchk_xref_is_not_owned_by(
 	xfs_extlen_t			len,
 	const struct xfs_owner_info	*oinfo)
 {
-	xchk_xref_check_owner(sc, bno, len, oinfo, false);
+	struct xfs_rmap_matches		res;
+	int				error;
+
+	if (!sc->sa.rmap_cur || xchk_skip_xref(sc->sm))
+		return;
+
+	error = xfs_rmap_count_owners(sc->sa.rmap_cur, bno, len, oinfo, &res);
+	if (!xchk_should_check_xref(sc, &error, &sc->sa.rmap_cur))
+		return;
+	if (res.matches != 0)
+		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
+	if (res.badno_matches)
+		xchk_btree_xref_set_corrupt(sc, sc->sa.rmap_cur, 0);
 }
 
 /* xref check that the extent has no reverse mapping at all */
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index a331838e22ff..20e74179d8a7 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -156,7 +156,7 @@ void xchk_xref_is_not_inode_chunk(struct xfs_scrub *sc, xfs_agblock_t agbno,
 		xfs_extlen_t len);
 void xchk_xref_is_inode_chunk(struct xfs_scrub *sc, xfs_agblock_t agbno,
 		xfs_extlen_t len);
-void xchk_xref_is_owned_by(struct xfs_scrub *sc, xfs_agblock_t agbno,
+void xchk_xref_is_only_owned_by(struct xfs_scrub *sc, xfs_agblock_t agbno,
 		xfs_extlen_t len, const struct xfs_owner_info *oinfo);
 void xchk_xref_is_not_owned_by(struct xfs_scrub *sc, xfs_agblock_t agbno,
 		xfs_extlen_t len, const struct xfs_owner_info *oinfo);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/2] xfs: ensure that single-owner file blocks are not owned by others
  2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: detect incorrect gaps in rmap btree Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/2] xfs: teach scrub to check for sole ownership of metadata objects Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  1 sibling, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

For any file fork mapping that can only have a single owner, make sure
that there are no other rmap owners for that mapping.  This patch
requires the more detailed checking provided by xfs_rmap_count_owners so
that we can know how many rmap records for a given range of space had a
matching owner, how many had a non-matching owner, and how many
conflicted with the records that have a matching owner.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/bmap.c |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index abc2da0b1824..b195bc0e09a4 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -308,6 +308,7 @@ xchk_bmap_iextent_xref(
 	struct xchk_bmap_info	*info,
 	struct xfs_bmbt_irec	*irec)
 {
+	struct xfs_owner_info	oinfo;
 	struct xfs_mount	*mp = info->sc->mp;
 	xfs_agnumber_t		agno;
 	xfs_agblock_t		agbno;
@@ -328,19 +329,30 @@ xchk_bmap_iextent_xref(
 	xchk_bmap_xref_rmap(info, irec, agbno);
 	switch (info->whichfork) {
 	case XFS_DATA_FORK:
-		if (!xfs_is_reflink_inode(info->sc->ip))
+		if (!xfs_is_reflink_inode(info->sc->ip)) {
+			xfs_rmap_ino_owner(&oinfo, info->sc->ip->i_ino,
+					info->whichfork, irec->br_startoff);
+			xchk_xref_is_only_owned_by(info->sc, agbno,
+					irec->br_blockcount, &oinfo);
 			xchk_xref_is_not_shared(info->sc, agbno,
 					irec->br_blockcount);
+		}
 		xchk_xref_is_not_cow_staging(info->sc, agbno,
 				irec->br_blockcount);
 		break;
 	case XFS_ATTR_FORK:
+		xfs_rmap_ino_owner(&oinfo, info->sc->ip->i_ino,
+				info->whichfork, irec->br_startoff);
+		xchk_xref_is_only_owned_by(info->sc, agbno, irec->br_blockcount,
+				&oinfo);
 		xchk_xref_is_not_shared(info->sc, agbno,
 				irec->br_blockcount);
 		xchk_xref_is_not_cow_staging(info->sc, agbno,
 				irec->br_blockcount);
 		break;
 	case XFS_COW_FORK:
+		xchk_xref_is_only_owned_by(info->sc, agbno, irec->br_blockcount,
+				&XFS_RMAP_OINFO_COW);
 		xchk_xref_is_cow_staging(info->sc, agbno,
 				irec->br_blockcount);
 		xchk_xref_is_not_shared(info->sc, agbno,


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/4] xfs: fix iget/irele usage in online fsck
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (11 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: detect incorrect gaps in rmap btree Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/4] xfs: fix an inode lookup race in xchk_get_inode Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: fix iget usage in directory scrub Darrick J. Wong
                   ` (9 subsequent siblings)
  22 siblings, 4 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

This patchset fixes a handful of problems relating to how we get and
release incore inodes in the online scrub code.  The first patch fixes
how we handle DONTCACHE -- our reasons for setting (or clearing it)
depend entirely on the runtime environment at irele time.  Hence we can
refactor iget and irele to use our own wrappers that set that context
appropriately.

The second patch fixes a race between the iget call in the inode core
scrubber and other writer threads that are allocating or freeing inodes
in the same AG by changing the behavior of xchk_iget (and the inode core
scrub setup function) to return either an incore inode or the AGI buffer
so that we can be sure that the inode cannot disappear on us.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes
---
 fs/xfs/scrub/bmap.c   |    2 
 fs/xfs/scrub/common.c |  267 +++++++++++++++++++++++++++++++++++++++++--------
 fs/xfs/scrub/common.h |   10 ++
 fs/xfs/scrub/dir.c    |    2 
 fs/xfs/scrub/inode.c  |  180 ++++++++++++++++++++++++++++-----
 fs/xfs/scrub/parent.c |    9 +-
 fs/xfs/scrub/scrub.c  |    2 
 fs/xfs/xfs_icache.c   |    3 -
 fs/xfs/xfs_icache.h   |   11 +-
 9 files changed, 398 insertions(+), 88 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/4] xfs: manage inode DONTCACHE status at irele time
  2022-12-30 22:11 ` [PATCHSET v24.0 0/4] xfs: fix iget/irele usage in online fsck Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/4] xfs: fix an inode lookup race in xchk_get_inode Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 4/4] xfs: retain the AGI when we can't iget an inode to scrub the core Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/4] xfs: rename xchk_get_inode -> xchk_iget_for_scrubbing Darrick J. Wong
  3 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Right now, there are statements scattered all over the online fsck
codebase about how we can't use XFS_IGET_DONTCACHE because of concerns
about scrub's unusual practice of releasing inodes with transactions
held.

However, iget is the wrong place to handle this -- the DONTCACHE state
doesn't matter at all until we try to *release* the inode, and here we
get things wrong in multiple ways:

First, if we /do/ have a transaction, we must NOT drop the inode,
because the inode could have dirty pages, dropping the inode will
trigger writeback, and writeback can trigger a nested transaction.

Second, if the inode already had an active reference and the DONTCACHE
flag set, the icache hit when scrub grabs another ref will not clear
DONTCACHE.  This is sort of by design, since DONTCACHE is now used to
initiate cache drops so that sysadmins can change a file's access mode
between pagecache and DAX.

Third, if we do actually have the last active reference to the inode, we
can set DONTCACHE to avoid polluting the cache.  This is the /one/ case
where we actually want that flag.

Create an xchk_irele helper to encode all that logic and switch the
online fsck code to use it.  Since this now means that nearly all
scrubbers use the same xfs_iget flags, we can wrap them too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c |   52 +++++++++++++++++++++++++++++++++++++++++++++----
 fs/xfs/scrub/common.h |    3 +++
 fs/xfs/scrub/dir.c    |    2 +-
 fs/xfs/scrub/parent.c |    9 ++++----
 fs/xfs/scrub/scrub.c  |    2 +-
 5 files changed, 57 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index b21d675dd158..28c43d9f1c56 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -710,6 +710,16 @@ xchk_checkpoint_log(
 	return 0;
 }
 
+/* Verify that an inode is allocated ondisk, then return its cached inode. */
+int
+xchk_iget(
+	struct xfs_scrub	*sc,
+	xfs_ino_t		inum,
+	struct xfs_inode	**ipp)
+{
+	return xfs_iget(sc->mp, sc->tp, inum, XFS_IGET_UNTRUSTED, 0, ipp);
+}
+
 /*
  * Given an inode and the scrub control structure, grab either the
  * inode referenced in the control structure or the inode passed in.
@@ -734,8 +744,7 @@ xchk_get_inode(
 	/* Look up the inode, see if the generation number matches. */
 	if (xfs_internal_inum(mp, sc->sm->sm_ino))
 		return -ENOENT;
-	error = xfs_iget(mp, NULL, sc->sm->sm_ino,
-			XFS_IGET_UNTRUSTED | XFS_IGET_DONTCACHE, 0, &ip);
+	error = xchk_iget(sc, sc->sm->sm_ino, &ip);
 	switch (error) {
 	case -ENOENT:
 		/* Inode doesn't exist, just bail out. */
@@ -757,7 +766,7 @@ xchk_get_inode(
 		 * that it no longer exists.
 		 */
 		error = xfs_imap(sc->mp, sc->tp, sc->sm->sm_ino, &imap,
-				XFS_IGET_UNTRUSTED | XFS_IGET_DONTCACHE);
+				XFS_IGET_UNTRUSTED);
 		if (error)
 			return -ENOENT;
 		error = -EFSCORRUPTED;
@@ -770,7 +779,7 @@ xchk_get_inode(
 		return error;
 	}
 	if (VFS_I(ip)->i_generation != sc->sm->sm_gen) {
-		xfs_irele(ip);
+		xchk_irele(sc, ip);
 		return -ENOENT;
 	}
 
@@ -778,6 +787,41 @@ xchk_get_inode(
 	return 0;
 }
 
+/* Release an inode, possibly dropping it in the process. */
+void
+xchk_irele(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip)
+{
+	if (current->journal_info != NULL) {
+		ASSERT(current->journal_info == sc->tp);
+
+		/*
+		 * If we are in a transaction, we /cannot/ drop the inode
+		 * ourselves, because the VFS will trigger writeback, which
+		 * can require a transaction.  Clear DONTCACHE to force the
+		 * inode to the LRU, where someone else can take care of
+		 * dropping it.
+		 *
+		 * Note that when we grabbed our reference to the inode, it
+		 * could have had an active ref and DONTCACHE set if a sysadmin
+		 * is trying to coerce a change in file access mode.  icache
+		 * hits do not clear DONTCACHE, so we must do it here.
+		 */
+		spin_lock(&VFS_I(ip)->i_lock);
+		VFS_I(ip)->i_state &= ~I_DONTCACHE;
+		spin_unlock(&VFS_I(ip)->i_lock);
+	} else if (atomic_read(&VFS_I(ip)->i_count) == 1) {
+		/*
+		 * If this is the last reference to the inode and the caller
+		 * permits it, set DONTCACHE to avoid thrashing.
+		 */
+		d_mark_dontcache(VFS_I(ip));
+	}
+
+	xfs_irele(ip);
+}
+
 /* Set us up to scrub a file's contents. */
 int
 xchk_setup_inode_contents(
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 0efe6b947d88..7472c41d9cfe 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -137,6 +137,9 @@ int xchk_get_inode(struct xfs_scrub *sc);
 int xchk_setup_inode_contents(struct xfs_scrub *sc, unsigned int resblks);
 void xchk_buffer_recheck(struct xfs_scrub *sc, struct xfs_buf *bp);
 
+int xchk_iget(struct xfs_scrub *sc, xfs_ino_t inum, struct xfs_inode **ipp);
+void xchk_irele(struct xfs_scrub *sc, struct xfs_inode *ip);
+
 /*
  * Don't bother cross-referencing if we already found corruption or cross
  * referencing discrepancies.
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index d1b0f23c2c59..677b21c3c865 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -86,7 +86,7 @@ xchk_dir_check_ftype(
 			xfs_mode_to_ftype(VFS_I(ip)->i_mode));
 	if (ino_dtype != dtype)
 		xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
-	xfs_irele(ip);
+	xchk_irele(sdc->sc, ip);
 out:
 	return error;
 }
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index d8dff3fd8053..2696bb49324a 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -131,7 +131,6 @@ xchk_parent_validate(
 	xfs_ino_t		dnum,
 	bool			*try_again)
 {
-	struct xfs_mount	*mp = sc->mp;
 	struct xfs_inode	*dp = NULL;
 	xfs_nlink_t		expected_nlink;
 	xfs_nlink_t		nlink;
@@ -168,7 +167,7 @@ xchk_parent_validate(
 	 * -EFSCORRUPTED or -EFSBADCRC then the parent is corrupt which is a
 	 *  cross referencing error.  Any other error is an operational error.
 	 */
-	error = xfs_iget(mp, sc->tp, dnum, XFS_IGET_UNTRUSTED, 0, &dp);
+	error = xchk_iget(sc, dnum, &dp);
 	if (error == -EINVAL || error == -ENOENT) {
 		error = -EFSCORRUPTED;
 		xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error);
@@ -236,11 +235,11 @@ xchk_parent_validate(
 
 	/* Drat, parent changed.  Try again! */
 	if (dnum != dp->i_ino) {
-		xfs_irele(dp);
+		xchk_irele(sc, dp);
 		*try_again = true;
 		return 0;
 	}
-	xfs_irele(dp);
+	xchk_irele(sc, dp);
 
 	/*
 	 * '..' didn't change, so check that there was only one entry
@@ -253,7 +252,7 @@ xchk_parent_validate(
 out_unlock:
 	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
 out_rele:
-	xfs_irele(dp);
+	xchk_irele(sc, dp);
 out:
 	return error;
 }
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 7a3557a69fe0..bc9638c7a379 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -181,7 +181,7 @@ xchk_teardown(
 			xfs_iunlock(sc->ip, sc->ilock_flags);
 		if (sc->ip != ip_in &&
 		    !xfs_internal_inum(sc->mp, sc->ip->i_ino))
-			xfs_irele(sc->ip);
+			xchk_irele(sc, sc->ip);
 		sc->ip = NULL;
 	}
 	if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/4] xfs: fix an inode lookup race in xchk_get_inode
  2022-12-30 22:11 ` [PATCHSET v24.0 0/4] xfs: fix iget/irele usage in online fsck Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/4] xfs: manage inode DONTCACHE status at irele time Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In commit d658e, we tried to improve the robustnes of xchk_get_inode in
the face of EINVAL returns from iget by calling xfs_imap to see if the
inobt itself thinks that the inode is allocated.  Unfortunately, that
commit didn't consider the possibility that the inode gets allocated
after iget but before imap.  In this case, the imap call will succeed,
but we turn that into a corruption error and tell userspace the inode is
corrupt.

Avoid this false corruption report by grabbing the AGI header and
retrying the iget before calling imap.  If the iget succeeds, we can
proceed with the usual scrub-by-handle code.  Fix all the incorrect
comments too, since unreadable/corrupt inodes no longer result in EINVAL
returns.

Fixes: d658e72b4a09 ("xfs: distinguish between corrupt inode and invalid inum in xfs_scrub_get_inode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c |  207 ++++++++++++++++++++++++++++++++++++++++---------
 fs/xfs/scrub/common.h |    4 +
 fs/xfs/xfs_icache.c   |    3 -
 fs/xfs/xfs_icache.h   |   11 ++-
 4 files changed, 182 insertions(+), 43 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 28c43d9f1c56..70ee293bc58f 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -635,6 +635,14 @@ xchk_ag_init(
 
 /* Per-scrubber setup functions */
 
+void
+xchk_trans_cancel(
+	struct xfs_scrub	*sc)
+{
+	xfs_trans_cancel(sc->tp);
+	sc->tp = NULL;
+}
+
 /*
  * Grab an empty transaction so that we can re-grab locked buffers if
  * one of our btrees turns out to be cyclic.
@@ -720,6 +728,84 @@ xchk_iget(
 	return xfs_iget(sc->mp, sc->tp, inum, XFS_IGET_UNTRUSTED, 0, ipp);
 }
 
+/*
+ * Try to grab an inode in a manner that avoids races with physical inode
+ * allocation.  If we can't, return the locked AGI buffer so that the caller
+ * can single-step the loading process to see where things went wrong.
+ *
+ * If the iget succeeds, return 0, a NULL AGI, and the inode.
+ *
+ * If the iget fails, return the error, the locked AGI, and a NULL inode.  This
+ * can include -EINVAL and -ENOENT for invalid inode numbers or inodes that are
+ * no longer allocated; or any other corruption or runtime error.
+ *
+ * If the AGI read fails, return the error, a NULL AGI, and NULL inode.
+ *
+ * If a fatal signal is pending, return -EINTR, a NULL AGI, and a NULL inode.
+ */
+int
+xchk_iget_agi(
+	struct xfs_scrub	*sc,
+	xfs_ino_t		inum,
+	struct xfs_buf		**agi_bpp,
+	struct xfs_inode	**ipp)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_trans	*tp = sc->tp;
+	struct xfs_perag	*pag;
+	int			error;
+
+again:
+	*agi_bpp = NULL;
+	*ipp = NULL;
+	error = 0;
+
+	if (xchk_should_terminate(sc, &error))
+		return error;
+
+	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inum));
+	error = xfs_ialloc_read_agi(pag, tp, agi_bpp);
+	xfs_perag_put(pag);
+	if (error)
+		return error;
+
+	error = xfs_iget(mp, tp, inum,
+			XFS_IGET_NORETRY | XFS_IGET_UNTRUSTED, 0, ipp);
+	if (error == -EAGAIN) {
+		/*
+		 * The inode may be in core but temporarily unavailable and may
+		 * require the AGI buffer before it can be returned.  Drop the
+		 * AGI buffer and retry the lookup.
+		 */
+		xfs_trans_brelse(tp, *agi_bpp);
+		delay(1);
+		goto again;
+	}
+	if (error)
+		return error;
+
+	/* We got the inode, so we can release the AGI. */
+	ASSERT(*ipp != NULL);
+	xfs_trans_brelse(tp, *agi_bpp);
+	*agi_bpp = NULL;
+	return 0;
+}
+
+/* Install an inode that we opened by handle for scrubbing. */
+static int
+xchk_install_handle_inode(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip)
+{
+	if (VFS_I(ip)->i_generation != sc->sm->sm_gen) {
+		xchk_irele(sc, ip);
+		return -ENOENT;
+	}
+
+	sc->ip = ip;
+	return 0;
+}
+
 /*
  * Given an inode and the scrub control structure, grab either the
  * inode referenced in the control structure or the inode passed in.
@@ -731,60 +817,105 @@ xchk_get_inode(
 {
 	struct xfs_imap		imap;
 	struct xfs_mount	*mp = sc->mp;
+	struct xfs_buf		*agi_bp;
 	struct xfs_inode	*ip_in = XFS_I(file_inode(sc->file));
 	struct xfs_inode	*ip = NULL;
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, sc->sm->sm_ino);
 	int			error;
 
+	ASSERT(sc->tp == NULL);
+
 	/* We want to scan the inode we already had opened. */
 	if (sc->sm->sm_ino == 0 || sc->sm->sm_ino == ip_in->i_ino) {
 		sc->ip = ip_in;
 		return 0;
 	}
 
-	/* Look up the inode, see if the generation number matches. */
+	/* Reject internal metadata files and obviously bad inode numbers. */
 	if (xfs_internal_inum(mp, sc->sm->sm_ino))
 		return -ENOENT;
+	if (!xfs_verify_ino(sc->mp, sc->sm->sm_ino))
+		return -ENOENT;
+
+	/* Try a regular untrusted iget. */
 	error = xchk_iget(sc, sc->sm->sm_ino, &ip);
-	switch (error) {
-	case -ENOENT:
-		/* Inode doesn't exist, just bail out. */
+	if (!error)
+		return xchk_install_handle_inode(sc, ip);
+	if (error == -ENOENT)
 		return error;
-	case 0:
-		/* Got an inode, continue. */
-		break;
-	case -EINVAL:
-		/*
-		 * -EINVAL with IGET_UNTRUSTED could mean one of several
-		 * things: userspace gave us an inode number that doesn't
-		 * correspond to fs space, or doesn't have an inobt entry;
-		 * or it could simply mean that the inode buffer failed the
-		 * read verifiers.
-		 *
-		 * Try just the inode mapping lookup -- if it succeeds, then
-		 * the inode buffer verifier failed and something needs fixing.
-		 * Otherwise, we really couldn't find it so tell userspace
-		 * that it no longer exists.
-		 */
-		error = xfs_imap(sc->mp, sc->tp, sc->sm->sm_ino, &imap,
-				XFS_IGET_UNTRUSTED);
-		if (error)
-			return -ENOENT;
+	if (error != -EINVAL)
+		goto out_error;
+
+	/*
+	 * EINVAL with IGET_UNTRUSTED probably means one of several things:
+	 * userspace gave us an inode number that doesn't correspond to fs
+	 * space; the inode btree lacks a record for this inode; or there is a
+	 * record, and it says this inode is free.
+	 *
+	 * We want to look up this inode in the inobt to distinguish two
+	 * scenarios: (1) the inobt says the inode is free, in which case
+	 * there's nothing to do; and (2) the inobt says the inode is
+	 * allocated, but loading it failed due to corruption.
+	 *
+	 * Allocate a transaction and grab the AGI to prevent inobt activity
+	 * in this AG.  Retry the iget in case someone allocated a new inode
+	 * after the first iget failed.
+	 */
+	error = xchk_trans_alloc(sc, 0);
+	if (error)
+		goto out_error;
+
+	error = xchk_iget_agi(sc, sc->sm->sm_ino, &agi_bp, &ip);
+	if (error == 0) {
+		/* Actually got the inode, so install it. */
+		xchk_trans_cancel(sc);
+		return xchk_install_handle_inode(sc, ip);
+	}
+	if (error == -ENOENT)
+		goto out_gone;
+	if (error != -EINVAL)
+		goto out_cancel;
+
+	/* Ensure that we have protected against inode allocation/freeing. */
+	if (agi_bp == NULL) {
+		ASSERT(agi_bp != NULL);
+		error = -ECANCELED;
+		goto out_cancel;
+	}
+
+	/*
+	 * Untrusted iget failed a second time.  Let's try an inobt lookup.
+	 * If the inobt thinks this the inode neither can exist inside the
+	 * filesystem nor is allocated, return ENOENT to signal that the check
+	 * can be skipped.
+	 *
+	 * If the lookup returns corruption, we'll mark this inode corrupt and
+	 * exit to userspace.  There's little chance of fixing anything until
+	 * the inobt is straightened out, but there's nothing we can do here.
+	 *
+	 * If the lookup encounters any other error, exit to userspace.
+	 *
+	 * If the lookup succeeds, something else must be very wrong in the fs
+	 * such that setting up the incore inode failed in some strange way.
+	 * Treat those as corruptions.
+	 */
+	error = xfs_imap(sc->mp, sc->tp, sc->sm->sm_ino, &imap,
+			XFS_IGET_UNTRUSTED);
+	if (error == -EINVAL || error == -ENOENT)
+		goto out_gone;
+	if (!error)
 		error = -EFSCORRUPTED;
-		fallthrough;
-	default:
-		trace_xchk_op_error(sc,
-				XFS_INO_TO_AGNO(mp, sc->sm->sm_ino),
-				XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino),
-				error, __return_address);
-		return error;
-	}
-	if (VFS_I(ip)->i_generation != sc->sm->sm_gen) {
-		xchk_irele(sc, ip);
-		return -ENOENT;
-	}
 
-	sc->ip = ip;
-	return 0;
+out_cancel:
+	xchk_trans_cancel(sc);
+out_error:
+	trace_xchk_op_error(sc, agno, XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino),
+			error, __return_address);
+	return error;
+out_gone:
+	/* The file is gone, so there's nothing to check. */
+	xchk_trans_cancel(sc);
+	return -ENOENT;
 }
 
 /* Release an inode, possibly dropping it in the process. */
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 7472c41d9cfe..6a7fe2596841 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -32,6 +32,8 @@ xchk_should_terminate(
 }
 
 int xchk_trans_alloc(struct xfs_scrub *sc, uint resblks);
+void xchk_trans_cancel(struct xfs_scrub *sc);
+
 bool xchk_process_error(struct xfs_scrub *sc, xfs_agnumber_t agno,
 		xfs_agblock_t bno, int *error);
 bool xchk_fblock_process_error(struct xfs_scrub *sc, int whichfork,
@@ -138,6 +140,8 @@ int xchk_setup_inode_contents(struct xfs_scrub *sc, unsigned int resblks);
 void xchk_buffer_recheck(struct xfs_scrub *sc, struct xfs_buf *bp);
 
 int xchk_iget(struct xfs_scrub *sc, xfs_ino_t inum, struct xfs_inode **ipp);
+int xchk_iget_agi(struct xfs_scrub *sc, xfs_ino_t inum,
+		struct xfs_buf **agi_bpp, struct xfs_inode **ipp);
 void xchk_irele(struct xfs_scrub *sc, struct xfs_inode *ip);
 
 /*
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index ddeaccc04aec..0d58d7b0d8ac 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -767,7 +767,8 @@ xfs_iget(
 	return 0;
 
 out_error_or_again:
-	if (!(flags & XFS_IGET_INCORE) && error == -EAGAIN) {
+	if (!(flags & (XFS_IGET_INCORE | XFS_IGET_NORETRY)) &&
+	    error == -EAGAIN) {
 		delay(1);
 		goto again;
 	}
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 6cd180721659..87910191a9dd 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -34,10 +34,13 @@ struct xfs_icwalk {
 /*
  * Flags for xfs_iget()
  */
-#define XFS_IGET_CREATE		0x1
-#define XFS_IGET_UNTRUSTED	0x2
-#define XFS_IGET_DONTCACHE	0x4
-#define XFS_IGET_INCORE		0x8	/* don't read from disk or reinit */
+#define XFS_IGET_CREATE		(1U << 0)
+#define XFS_IGET_UNTRUSTED	(1U << 1)
+#define XFS_IGET_DONTCACHE	(1U << 2)
+/* don't read from disk or reinit */
+#define XFS_IGET_INCORE		(1U << 3)
+/* Return -EAGAIN immediately if the inode is unavailable. */
+#define XFS_IGET_NORETRY	(1U << 4)
 
 int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
 	     uint flags, uint lock_flags, xfs_inode_t **ipp);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 3/4] xfs: rename xchk_get_inode -> xchk_iget_for_scrubbing
  2022-12-30 22:11 ` [PATCHSET v24.0 0/4] xfs: fix iget/irele usage in online fsck Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 4/4] xfs: retain the AGI when we can't iget an inode to scrub the core Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  3 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Dave Chinner suggested renaming this function to make more obvious what
it does.  The function returns an incore inode to callers that want to
scrub a metadata structure that hangs off an inode.  If the iget fails
with EINVAL, it will single-step the loading process to distinguish
between actually free inodes or impossible inumbers (ENOENT);
discrepancies between the inobt freemask and the free status in the
inode record (EFSCORRUPTED).  Any other negative errno is returned
unchanged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/bmap.c   |    2 +-
 fs/xfs/scrub/common.c |   12 +++++++-----
 fs/xfs/scrub/common.h |    2 +-
 fs/xfs/scrub/inode.c  |    2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index b195bc0e09a4..fe13da54e133 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -34,7 +34,7 @@ xchk_setup_inode_bmap(
 	if (xchk_need_fshook_drain(sc))
 		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
 
-	error = xchk_get_inode(sc);
+	error = xchk_iget_for_scrubbing(sc);
 	if (error)
 		goto out;
 
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 70ee293bc58f..90f53f415d99 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -807,12 +807,14 @@ xchk_install_handle_inode(
 }
 
 /*
- * Given an inode and the scrub control structure, grab either the
- * inode referenced in the control structure or the inode passed in.
- * The inode is not locked.
+ * In preparation to scrub metadata structures that hang off of an inode,
+ * grab either the inode referenced in the scrub control structure or the
+ * inode passed in.  If the inumber does not reference an allocated inode
+ * record, the function returns ENOENT to end the scrub early.  The inode
+ * is not locked.
  */
 int
-xchk_get_inode(
+xchk_iget_for_scrubbing(
 	struct xfs_scrub	*sc)
 {
 	struct xfs_imap		imap;
@@ -961,7 +963,7 @@ xchk_setup_inode_contents(
 {
 	int			error;
 
-	error = xchk_get_inode(sc);
+	error = xchk_iget_for_scrubbing(sc);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 6a7fe2596841..5ef27e6bdac6 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -135,7 +135,7 @@ int xchk_count_rmap_ownedby_ag(struct xfs_scrub *sc, struct xfs_btree_cur *cur,
 		const struct xfs_owner_info *oinfo, xfs_filblks_t *blocks);
 
 int xchk_setup_ag_btree(struct xfs_scrub *sc, bool force_log);
-int xchk_get_inode(struct xfs_scrub *sc);
+int xchk_iget_for_scrubbing(struct xfs_scrub *sc);
 int xchk_setup_inode_contents(struct xfs_scrub *sc, unsigned int resblks);
 void xchk_buffer_recheck(struct xfs_scrub *sc, struct xfs_buf *bp);
 
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 3b272c86d0ad..39ac7cc09fbd 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -39,7 +39,7 @@ xchk_setup_inode(
 	 * Try to get the inode.  If the verifiers fail, we try again
 	 * in raw mode.
 	 */
-	error = xchk_get_inode(sc);
+	error = xchk_iget_for_scrubbing(sc);
 	switch (error) {
 	case 0:
 		break;


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 4/4] xfs: retain the AGI when we can't iget an inode to scrub the core
  2022-12-30 22:11 ` [PATCHSET v24.0 0/4] xfs: fix iget/irele usage in online fsck Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/4] xfs: fix an inode lookup race in xchk_get_inode Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/4] xfs: manage inode DONTCACHE status at irele time Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/4] xfs: rename xchk_get_inode -> xchk_iget_for_scrubbing Darrick J. Wong
  3 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

xchk_get_inode is not quite the right function to be calling from the
inode scrubber setup function.  The common get_inode function either
gets an inode and installs it in the scrub context, or it returns an
error code explaining what happened.  This is acceptable for most file
scrubbers because it is not in their scope to fix corruptions in the
inode core and fork areas that cause iget to fail.

Dealing with these problems is within the scope of the inode scrubber,
however.  If iget fails with EFSCORRUPTED, we need to xchk_inode to flag
that as corruption.  Since we can't get our hands on an incore inode, we
need to hold the AGI to prevent inode allocation activity so that
nothing changes in the inode metadata.

Looking ahead to the inode core repair patches, we will also need to
hold the AGI buffer into xrep_inode so that we can make modifications to
the xfs_dinode structure without any other thread swooping in to
allocate or free the inode.

Adapt the xchk_get_inode into xchk_setup_inode since this is a one-off
use case where the error codes we check for are a little different, and
the return state is much different from the common function.

xchk_setup_inode prepares to check or repair an inode record, so it must
continue the scrub operation even if the inode/inobt verifiers cause
xfs_iget to return EFSCORRUPTED.  This is done by attaching the locked
AGI buffer to the scrub transaction and returning 0 to move on to the
actual scrub.  (Later, the online inode repair code will also want the
xfs_imap structure so that it can reset the ondisk xfs_dinode
structure.)

xchk_get_inode retrieves an inode on behalf of a scrubber that operates
on an incore inode -- data/attr/cow forks, directories, xattrs,
symlinks, parent pointers, etc.  If the inode/inobt verifiers fail and
xfs_iget returns EFSCORRUPTED, we want to exit to userspace (because the
caller should be fix the inode first) and drop everything we acquired
along the way.

A behavior common to both functions is that it's possible that xfs_scrub
asked for a scrub-by-handle concurrent with the inode being freed or the
passed-in inumber is invalid.  In this case, we call xfs_imap to see if
the inobt index thinks the inode is allocated, and return ENOENT
("nothing to check here") to userspace if this is not the case.  The
imap lookup is why both functions call xchk_iget_agi.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c |    2 -
 fs/xfs/scrub/common.h |    1 
 fs/xfs/scrub/inode.c  |  180 +++++++++++++++++++++++++++++++++++++++++--------
 3 files changed, 153 insertions(+), 30 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 90f53f415d99..e0c1be0161f3 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -792,7 +792,7 @@ xchk_iget_agi(
 }
 
 /* Install an inode that we opened by handle for scrubbing. */
-static int
+int
 xchk_install_handle_inode(
 	struct xfs_scrub	*sc,
 	struct xfs_inode	*ip)
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 5ef27e6bdac6..07daea2c7ab4 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -143,6 +143,7 @@ int xchk_iget(struct xfs_scrub *sc, xfs_ino_t inum, struct xfs_inode **ipp);
 int xchk_iget_agi(struct xfs_scrub *sc, xfs_ino_t inum,
 		struct xfs_buf **agi_bpp, struct xfs_inode **ipp);
 void xchk_irele(struct xfs_scrub *sc, struct xfs_inode *ip);
+int xchk_install_handle_inode(struct xfs_scrub *sc, struct xfs_inode *ip);
 
 /*
  * Don't bother cross-referencing if we already found corruption or cross
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 39ac7cc09fbd..51b8ba7037f3 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -11,8 +11,10 @@
 #include "xfs_mount.h"
 #include "xfs_btree.h"
 #include "xfs_log_format.h"
+#include "xfs_trans.h"
 #include "xfs_inode.h"
 #include "xfs_ialloc.h"
+#include "xfs_icache.h"
 #include "xfs_da_format.h"
 #include "xfs_reflink.h"
 #include "xfs_rmap.h"
@@ -20,48 +22,168 @@
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/btree.h"
+#include "scrub/trace.h"
 
-/*
- * Grab total control of the inode metadata.  It doesn't matter here if
- * the file data is still changing; exclusive access to the metadata is
- * the goal.
- */
-int
-xchk_setup_inode(
+/* Prepare the attached inode for scrubbing. */
+static inline int
+xchk_prepare_iscrub(
 	struct xfs_scrub	*sc)
 {
 	int			error;
 
-	if (xchk_need_fshook_drain(sc))
-		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
-
-	/*
-	 * Try to get the inode.  If the verifiers fail, we try again
-	 * in raw mode.
-	 */
-	error = xchk_iget_for_scrubbing(sc);
-	switch (error) {
-	case 0:
-		break;
-	case -EFSCORRUPTED:
-	case -EFSBADCRC:
-		return xchk_trans_alloc(sc, 0);
-	default:
-		return error;
-	}
-
-	/* Got the inode, lock it and we're ready to go. */
 	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	xfs_ilock(sc->ip, sc->ilock_flags);
+
 	error = xchk_trans_alloc(sc, 0);
 	if (error)
-		goto out;
+		return error;
+
 	sc->ilock_flags |= XFS_ILOCK_EXCL;
 	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
+	return 0;
+}
 
-out:
-	/* scrub teardown will unlock and release the inode for us */
+/* Install this scrub-by-handle inode and prepare it for scrubbing. */
+static inline int
+xchk_install_handle_iscrub(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*ip)
+{
+	int			error;
+
+	error = xchk_install_handle_inode(sc, ip);
+	if (error)
+		return error;
+
+	return xchk_prepare_iscrub(sc);
+}
+
+/*
+ * Grab total control of the inode metadata.  In the best case, we grab the
+ * incore inode and take all locks on it.  If the incore inode cannot be
+ * constructed due to corruption problems, lock the AGI so that we can single
+ * step the loading process to fix everything that can go wrong.
+ */
+int
+xchk_setup_inode(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_imap		imap;
+	struct xfs_inode	*ip;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_inode	*ip_in = XFS_I(file_inode(sc->file));
+	struct xfs_buf		*agi_bp;
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, sc->sm->sm_ino);
+	int			error;
+
+	if (xchk_need_fshook_drain(sc))
+		xchk_fshooks_enable(sc, XCHK_FSHOOKS_DRAIN);
+
+	/* We want to scan the opened inode, so lock it and exit. */
+	if (sc->sm->sm_ino == 0 || sc->sm->sm_ino == ip_in->i_ino) {
+		sc->ip = ip_in;
+		return xchk_prepare_iscrub(sc);
+	}
+
+	/* Reject internal metadata files and obviously bad inode numbers. */
+	if (xfs_internal_inum(mp, sc->sm->sm_ino))
+		return -ENOENT;
+	if (!xfs_verify_ino(sc->mp, sc->sm->sm_ino))
+		return -ENOENT;
+
+	/* Try a regular untrusted iget. */
+	error = xchk_iget(sc, sc->sm->sm_ino, &ip);
+	if (!error)
+		return xchk_install_handle_iscrub(sc, ip);
+	if (error == -ENOENT)
+		return error;
+	if (error != -EFSCORRUPTED && error != -EFSBADCRC && error != -EINVAL)
+		goto out_error;
+
+	/*
+	 * EINVAL with IGET_UNTRUSTED probably means one of several things:
+	 * userspace gave us an inode number that doesn't correspond to fs
+	 * space; the inode btree lacks a record for this inode; or there is
+	 * a record, and it says this inode is free.
+	 *
+	 * EFSCORRUPTED/EFSBADCRC could mean that the inode was mappable, but
+	 * some other metadata corruption (e.g. inode forks) prevented
+	 * instantiation of the incore inode.  Or it could mean the inobt is
+	 * corrupt.
+	 *
+	 * We want to look up this inode in the inobt directly to distinguish
+	 * three different scenarios: (1) the inobt says the inode is free,
+	 * in which case there's nothing to do; (2) the inobt is corrupt so we
+	 * should flag the corruption and exit to userspace to let it fix the
+	 * inobt; and (3) the inobt says the inode is allocated, but loading it
+	 * failed due to corruption.
+	 *
+	 * Allocate a transaction and grab the AGI to prevent inobt activity in
+	 * this AG.  Retry the iget in case someone allocated a new inode after
+	 * the first iget failed.
+	 */
+	error = xchk_trans_alloc(sc, 0);
+	if (error)
+		goto out_error;
+
+	error = xchk_iget_agi(sc, sc->sm->sm_ino, &agi_bp, &ip);
+	if (error == 0) {
+		/* Actually got the incore inode, so install it and proceed. */
+		xchk_trans_cancel(sc);
+		return xchk_install_handle_iscrub(sc, ip);
+	}
+	if (error == -ENOENT)
+		goto out_gone;
+	if (error != -EFSCORRUPTED && error != -EFSBADCRC && error != -EINVAL)
+		goto out_cancel;
+
+	/* Ensure that we have protected against inode allocation/freeing. */
+	if (agi_bp == NULL) {
+		ASSERT(agi_bp != NULL);
+		error = -ECANCELED;
+		goto out_cancel;
+	}
+
+	/*
+	 * Untrusted iget failed a second time.  Let's try an inobt lookup.
+	 * If the inobt doesn't think this is an allocated inode then we'll
+	 * return ENOENT to signal that the check can be skipped.
+	 *
+	 * If the lookup signals corruption, we'll mark this inode corrupt and
+	 * exit to userspace.  There's little chance of fixing anything until
+	 * the inobt is straightened out, but there's nothing we can do here.
+	 *
+	 * If the lookup encounters a runtime error, exit to userspace.
+	 */
+	error = xfs_imap(mp, sc->tp, sc->sm->sm_ino, &imap,
+			XFS_IGET_UNTRUSTED);
+	if (error == -EINVAL || error == -ENOENT)
+		goto out_gone;
+	if (error)
+		goto out_cancel;
+
+	/*
+	 * The lookup succeeded.  Chances are the ondisk inode is corrupt and
+	 * preventing iget from reading it.  Retain the scrub transaction and
+	 * the AGI buffer to prevent anyone from allocating or freeing inodes.
+	 * This ensures that we preserve the inconsistency between the inobt
+	 * saying the inode is allocated and the icache being unable to load
+	 * the inode until we can flag the corruption in xchk_inode.  The
+	 * scrub function has to note the corruption, since we're not really
+	 * supposed to do that from the setup function.
+	 */
+	return 0;
+
+out_cancel:
+	xchk_trans_cancel(sc);
+out_error:
+	trace_xchk_op_error(sc, agno, XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino),
+			error, __return_address);
 	return error;
+out_gone:
+	/* The file is gone, so there's nothing to check. */
+	xchk_trans_cancel(sc);
+	return -ENOENT;
 }
 
 /* Inode core */


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/3] xfs: fix iget usage in directory scrub
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (12 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/4] xfs: fix iget/irele usage in online fsck Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/3] xfs: xfs_iget in the directory scrubber needs to use UNTRUSTED Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect mergeable and overlapping btree records Darrick J. Wong
                   ` (8 subsequent siblings)
  22 siblings, 3 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

In this series, we fix some problems with how the directory scrubber
grabs child inodes.  First, we want to reduce EDEADLOCK returns by
replacing fixed-iteration loops with interruptible trylock loops.
Second, we add UNTRUSTED to the child iget call so that we can detect a
dirent that points to an unallocated inode.  Third, we fix a bug where
we weren't checking the inode pointed to by dotdot entries at all.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes
---
 fs/xfs/scrub/common.c |   22 -----
 fs/xfs/scrub/common.h |    1 
 fs/xfs/scrub/dir.c    |   79 +++++++------------
 fs/xfs/scrub/parent.c |  203 +++++++++++++++++++++++--------------------------
 4 files changed, 126 insertions(+), 179 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/3] xfs: make checking directory dotdot entries more reliable
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: fix iget usage in directory scrub Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/3] xfs: xfs_iget in the directory scrubber needs to use UNTRUSTED Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/3] xfs: always check the existence of a dirent's child inode Darrick J. Wong
  2 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The current directory parent scrubbing code could be tighter in its
execution -- instead of bailing out to userspace after a couple of
seconds of waiting for the (alleged) parent directory's IOLOCK while
refusing to release the child directory's IOLOCK, we could just cycle
both locks until we get both or the child process absorbs a fatal
signal.

Note that because the usual sequence is to take IOLOCKs before grabbing
a transaction, we have to use the _nowait variants on both inodes to
avoid an ABBA deadlock.  Since parent pointer checking is the only place
in scrub that needs this kind of functionality, move it to parent.c as a
private function.

Furthermore, if the child directory's parent changes during the lock
cycling, we know that the new parent has stamped the correct parent into
the dotdot entry, so we can conclude that the parent entry is correct.

This eliminates an entire source of -EDEADLOCK-based "retry harder"
scrub executions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c |   22 -----
 fs/xfs/scrub/common.h |    1 
 fs/xfs/scrub/parent.c |  203 +++++++++++++++++++++++--------------------------
 3 files changed, 97 insertions(+), 129 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index e0c1be0161f3..002bb90559ff 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -1126,28 +1126,6 @@ xchk_metadata_inode_forks(
 	return 0;
 }
 
-/*
- * Try to lock an inode in violation of the usual locking order rules.  For
- * example, trying to get the IOLOCK while in transaction context, or just
- * plain breaking AG-order or inode-order inode locking rules.  Either way,
- * the only way to avoid an ABBA deadlock is to use trylock and back off if
- * we can't.
- */
-int
-xchk_ilock_inverted(
-	struct xfs_inode	*ip,
-	uint			lock_mode)
-{
-	int			i;
-
-	for (i = 0; i < 20; i++) {
-		if (xfs_ilock_nowait(ip, lock_mode))
-			return 0;
-		delay(1);
-	}
-	return -EDEADLOCK;
-}
-
 /* Pause background reaping of resources. */
 void
 xchk_stop_reaping(
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 07daea2c7ab4..9cfc2660dbb4 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -156,7 +156,6 @@ static inline bool xchk_skip_xref(struct xfs_scrub_metadata *sm)
 }
 
 int xchk_metadata_inode_forks(struct xfs_scrub *sc);
-int xchk_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
 void xchk_stop_reaping(struct xfs_scrub *sc);
 void xchk_start_reaping(struct xfs_scrub *sc);
 
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index 2696bb49324a..0c23fd49716b 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -120,6 +120,48 @@ xchk_parent_count_parent_dentries(
 	return error;
 }
 
+/*
+ * Try to iolock the parent dir @dp in shared mode and the child dir @sc->ip
+ * exclusively.
+ */
+STATIC int
+xchk_parent_lock_two_dirs(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp)
+{
+	int			error = 0;
+
+	/* Callers shouldn't do this, but protect ourselves anyway. */
+	if (dp == sc->ip) {
+		ASSERT(dp != sc->ip);
+		return -EINVAL;
+	}
+
+	xfs_iunlock(sc->ip, sc->ilock_flags);
+	sc->ilock_flags = 0;
+	while (true) {
+		if (xchk_should_terminate(sc, &error))
+			return error;
+
+		/*
+		 * Normal XFS takes the IOLOCK before grabbing a transaction.
+		 * Scrub holds a transaction, which means that we can't block
+		 * on either IOLOCK.
+		 */
+		if (xfs_ilock_nowait(dp, XFS_IOLOCK_SHARED)) {
+			if (xfs_ilock_nowait(sc->ip, XFS_IOLOCK_EXCL)) {
+				sc->ilock_flags = XFS_IOLOCK_EXCL;
+				break;
+			}
+			xfs_iunlock(dp, XFS_IOLOCK_SHARED);
+		}
+
+		delay(1);
+	}
+
+	return 0;
+}
+
 /*
  * Given the inode number of the alleged parent of the inode being
  * scrubbed, try to validate that the parent has exactly one directory
@@ -128,23 +170,20 @@ xchk_parent_count_parent_dentries(
 STATIC int
 xchk_parent_validate(
 	struct xfs_scrub	*sc,
-	xfs_ino_t		dnum,
-	bool			*try_again)
+	xfs_ino_t		parent_ino)
 {
 	struct xfs_inode	*dp = NULL;
 	xfs_nlink_t		expected_nlink;
 	xfs_nlink_t		nlink;
 	int			error = 0;
 
-	*try_again = false;
-
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
-		goto out;
+		return 0;
 
 	/* '..' must not point to ourselves. */
-	if (sc->ip->i_ino == dnum) {
+	if (sc->ip->i_ino == parent_ino) {
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
-		goto out;
+		return 0;
 	}
 
 	/*
@@ -154,106 +193,80 @@ xchk_parent_validate(
 	expected_nlink = VFS_I(sc->ip)->i_nlink == 0 ? 0 : 1;
 
 	/*
-	 * Grab this parent inode.  We release the inode before we
-	 * cancel the scrub transaction.  Since we're don't know a
-	 * priori that releasing the inode won't trigger eofblocks
-	 * cleanup (which allocates what would be a nested transaction)
-	 * if the parent pointer erroneously points to a file, we
-	 * can't use DONTCACHE here because DONTCACHE inodes can trigger
-	 * immediate inactive cleanup of the inode.
+	 * Grab the parent directory inode.  This must be released before we
+	 * cancel the scrub transaction.
 	 *
 	 * If _iget returns -EINVAL or -ENOENT then the parent inode number is
 	 * garbage and the directory is corrupt.  If the _iget returns
 	 * -EFSCORRUPTED or -EFSBADCRC then the parent is corrupt which is a
 	 *  cross referencing error.  Any other error is an operational error.
 	 */
-	error = xchk_iget(sc, dnum, &dp);
+	error = xchk_iget(sc, parent_ino, &dp);
 	if (error == -EINVAL || error == -ENOENT) {
 		error = -EFSCORRUPTED;
 		xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error);
-		goto out;
+		return error;
 	}
 	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error))
-		goto out;
+		return error;
 	if (dp == sc->ip || !S_ISDIR(VFS_I(dp)->i_mode)) {
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
 		goto out_rele;
 	}
 
 	/*
-	 * We prefer to keep the inode locked while we lock and search
-	 * its alleged parent for a forward reference.  If we can grab
-	 * the iolock, validate the pointers and we're done.  We must
-	 * use nowait here to avoid an ABBA deadlock on the parent and
-	 * the child inodes.
+	 * We prefer to keep the inode locked while we lock and search its
+	 * alleged parent for a forward reference.  If we can grab the iolock
+	 * of the alleged parent, then we can move ahead to counting dirents
+	 * and checking nlinks.
+	 *
+	 * However, if we fail to iolock the alleged parent while holding the
+	 * child iolock, we have no way to tell if a blocking lock() would
+	 * result in an ABBA deadlock.  Release the lock on the child, then
+	 * try to lock the alleged parent and trylock the child.
 	 */
-	if (xfs_ilock_nowait(dp, XFS_IOLOCK_SHARED)) {
-		error = xchk_parent_count_parent_dentries(sc, dp, &nlink);
-		if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0,
-				&error))
+	if (!xfs_ilock_nowait(dp, XFS_IOLOCK_SHARED)) {
+		error = xchk_parent_lock_two_dirs(sc, dp);
+		if (error)
+			goto out_rele;
+
+		/*
+		 * Now that we've locked out updates to the child directory,
+		 * re-sample the expected nlink and the '..' dirent.
+		 */
+		expected_nlink = VFS_I(sc->ip)->i_nlink == 0 ? 0 : 1;
+
+		error = xfs_dir_lookup(sc->tp, sc->ip, &xfs_name_dotdot,
+				&parent_ino, NULL);
+		if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error))
+			goto out_unlock;
+
+		/*
+		 * After relocking the child directory, the '..' entry points
+		 * to a different parent than before.  This means someone moved
+		 * the child elsewhere in the directory tree, which means that
+		 * the parent link is now correct and we're done.
+		 */
+		if (parent_ino != dp->i_ino)
 			goto out_unlock;
-		if (nlink != expected_nlink)
-			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
-		goto out_unlock;
 	}
 
-	/*
-	 * The game changes if we get here.  We failed to lock the parent,
-	 * so we're going to try to verify both pointers while only holding
-	 * one lock so as to avoid deadlocking with something that's actually
-	 * trying to traverse down the directory tree.
-	 */
-	xfs_iunlock(sc->ip, sc->ilock_flags);
-	sc->ilock_flags = 0;
-	error = xchk_ilock_inverted(dp, XFS_IOLOCK_SHARED);
-	if (error)
-		goto out_rele;
-
-	/* Go looking for our dentry. */
+	/* Look for a directory entry in the parent pointing to the child. */
 	error = xchk_parent_count_parent_dentries(sc, dp, &nlink);
 	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error))
 		goto out_unlock;
 
-	/* Drop the parent lock, relock this inode. */
-	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
-	error = xchk_ilock_inverted(sc->ip, XFS_IOLOCK_EXCL);
-	if (error)
-		goto out_rele;
-	sc->ilock_flags = XFS_IOLOCK_EXCL;
-
 	/*
-	 * If we're an unlinked directory, the parent /won't/ have a link
-	 * to us.  Otherwise, it should have one link.  We have to re-set
-	 * it here because we dropped the lock on sc->ip.
-	 */
-	expected_nlink = VFS_I(sc->ip)->i_nlink == 0 ? 0 : 1;
-
-	/* Look up '..' to see if the inode changed. */
-	error = xfs_dir_lookup(sc->tp, sc->ip, &xfs_name_dotdot, &dnum, NULL);
-	if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error))
-		goto out_rele;
-
-	/* Drat, parent changed.  Try again! */
-	if (dnum != dp->i_ino) {
-		xchk_irele(sc, dp);
-		*try_again = true;
-		return 0;
-	}
-	xchk_irele(sc, dp);
-
-	/*
-	 * '..' didn't change, so check that there was only one entry
-	 * for us in the parent.
+	 * Ensure that the parent has as many links to the child as the child
+	 * thinks it has to the parent.
 	 */
 	if (nlink != expected_nlink)
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
-	return error;
 
 out_unlock:
 	xfs_iunlock(dp, XFS_IOLOCK_SHARED);
 out_rele:
 	xchk_irele(sc, dp);
-out:
 	return error;
 }
 
@@ -263,10 +276,8 @@ xchk_parent(
 	struct xfs_scrub	*sc)
 {
 	struct xfs_mount	*mp = sc->mp;
-	xfs_ino_t		dnum;
-	bool			try_again;
-	int			tries = 0;
-	int			error = 0;
+	xfs_ino_t		parent_ino;
+	int			error;
 
 	/*
 	 * If we're a directory, check that the '..' link points up to
@@ -278,7 +289,7 @@ xchk_parent(
 	/* We're not a special inode, are we? */
 	if (!xfs_verify_dir_ino(mp, sc->ip->i_ino)) {
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
-		goto out;
+		return 0;
 	}
 
 	/*
@@ -292,42 +303,22 @@ xchk_parent(
 	xfs_iunlock(sc->ip, XFS_ILOCK_EXCL | XFS_MMAPLOCK_EXCL);
 
 	/* Look up '..' */
-	error = xfs_dir_lookup(sc->tp, sc->ip, &xfs_name_dotdot, &dnum, NULL);
+	error = xfs_dir_lookup(sc->tp, sc->ip, &xfs_name_dotdot, &parent_ino,
+			NULL);
 	if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error))
-		goto out;
-	if (!xfs_verify_dir_ino(mp, dnum)) {
+		return error;
+	if (!xfs_verify_dir_ino(mp, parent_ino)) {
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
-		goto out;
+		return 0;
 	}
 
 	/* Is this the root dir?  Then '..' must point to itself. */
 	if (sc->ip == mp->m_rootip) {
 		if (sc->ip->i_ino != mp->m_sb.sb_rootino ||
-		    sc->ip->i_ino != dnum)
+		    sc->ip->i_ino != parent_ino)
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
-		goto out;
+		return 0;
 	}
 
-	do {
-		error = xchk_parent_validate(sc, dnum, &try_again);
-		if (error)
-			goto out;
-	} while (try_again && ++tries < 20);
-
-	/*
-	 * We gave it our best shot but failed, so mark this scrub
-	 * incomplete.  Userspace can decide if it wants to try again.
-	 */
-	if (try_again && tries == 20)
-		xchk_set_incomplete(sc);
-out:
-	/*
-	 * If we failed to lock the parent inode even after a retry, just mark
-	 * this scrub incomplete and return.
-	 */
-	if ((sc->flags & XCHK_TRY_HARDER) && error == -EDEADLOCK) {
-		error = 0;
-		xchk_set_incomplete(sc);
-	}
-	return error;
+	return xchk_parent_validate(sc, parent_ino);
 }


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/3] xfs: xfs_iget in the directory scrubber needs to use UNTRUSTED
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: fix iget usage in directory scrub Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/3] xfs: make checking directory dotdot entries more reliable Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/3] xfs: always check the existence of a dirent's child inode Darrick J. Wong
  2 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In commit 4b80ac64450f, we tried to strengthen the directory scrubber by
using the iget call to detect directory entries that point to
unallocated inodes.  Unfortunately, that commit neglected to pass
XFS_IGET_UNTRUSTED to xfs_iget, so we don't check the inode btree first.
If the inode number points to something that isn't even an inode
cluster, iget will throw corruption errors and return -EFSCORRUPTED,
which means that we fail to mark the directory corrupt.

Fixes: 4b80ac64450f ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/dir.c |   10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 677b21c3c865..ec0c73e0eb0c 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -59,19 +59,15 @@ xchk_dir_check_ftype(
 	}
 
 	/*
-	 * Grab the inode pointed to by the dirent.  We release the
-	 * inode before we cancel the scrub transaction.  Since we're
-	 * don't know a priori that releasing the inode won't trigger
-	 * eofblocks cleanup (which allocates what would be a nested
-	 * transaction), we can't use DONTCACHE here because DONTCACHE
-	 * inodes can trigger immediate inactive cleanup of the inode.
+	 * Grab the inode pointed to by the dirent.  Use UNTRUSTED here to
+	 * check the allocation status of the inode in the inode btrees.
 	 *
 	 * If _iget returns -EINVAL or -ENOENT then the child inode number is
 	 * garbage and the directory is corrupt.  If the _iget returns
 	 * -EFSCORRUPTED or -EFSBADCRC then the child is corrupt which is a
 	 *  cross referencing error.  Any other error is an operational error.
 	 */
-	error = xfs_iget(mp, sdc->sc->tp, inum, 0, 0, &ip);
+	error = xchk_iget(sdc->sc, inum, &ip);
 	if (error == -EINVAL || error == -ENOENT) {
 		error = -EFSCORRUPTED;
 		xchk_fblock_process_error(sdc->sc, XFS_DATA_FORK, 0, &error);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 3/3] xfs: always check the existence of a dirent's child inode
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: fix iget usage in directory scrub Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/3] xfs: xfs_iget in the directory scrubber needs to use UNTRUSTED Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/3] xfs: make checking directory dotdot entries more reliable Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When we're scrubbing directory entries, we always need to iget the child
inode to make sure that the inode pointer points to a valid inode.  The
original directory scrub code (commit a5c4) only set us up to do this
for ftype=1 filesystems, which is not sufficient; and then commit 4b80
made it worse by exempting the dot and dotdot entries.

Sorta-fixes: a5c46e5e8912 ("xfs: scrub directory metadata")
Sorta-fixes: 4b80ac64450f ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/dir.c |   75 ++++++++++++++++++++--------------------------------
 1 file changed, 29 insertions(+), 46 deletions(-)


diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index ec0c73e0eb0c..8076e7620734 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -39,52 +39,28 @@ struct xchk_dir_ctx {
 };
 
 /* Check that an inode's mode matches a given DT_ type. */
-STATIC int
+STATIC void
 xchk_dir_check_ftype(
 	struct xchk_dir_ctx	*sdc,
 	xfs_fileoff_t		offset,
-	xfs_ino_t		inum,
+	struct xfs_inode	*ip,
 	int			dtype)
 {
 	struct xfs_mount	*mp = sdc->sc->mp;
-	struct xfs_inode	*ip;
 	int			ino_dtype;
-	int			error = 0;
 
 	if (!xfs_has_ftype(mp)) {
 		if (dtype != DT_UNKNOWN && dtype != DT_DIR)
 			xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
 					offset);
-		goto out;
+		return;
 	}
 
-	/*
-	 * Grab the inode pointed to by the dirent.  Use UNTRUSTED here to
-	 * check the allocation status of the inode in the inode btrees.
-	 *
-	 * If _iget returns -EINVAL or -ENOENT then the child inode number is
-	 * garbage and the directory is corrupt.  If the _iget returns
-	 * -EFSCORRUPTED or -EFSBADCRC then the child is corrupt which is a
-	 *  cross referencing error.  Any other error is an operational error.
-	 */
-	error = xchk_iget(sdc->sc, inum, &ip);
-	if (error == -EINVAL || error == -ENOENT) {
-		error = -EFSCORRUPTED;
-		xchk_fblock_process_error(sdc->sc, XFS_DATA_FORK, 0, &error);
-		goto out;
-	}
-	if (!xchk_fblock_xref_process_error(sdc->sc, XFS_DATA_FORK, offset,
-			&error))
-		goto out;
-
 	/* Convert mode to the DT_* values that dir_emit uses. */
 	ino_dtype = xfs_dir3_get_dtype(mp,
 			xfs_mode_to_ftype(VFS_I(ip)->i_mode));
 	if (ino_dtype != dtype)
 		xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK, offset);
-	xchk_irele(sdc->sc, ip);
-out:
-	return error;
 }
 
 /*
@@ -105,17 +81,17 @@ xchk_dir_actor(
 	unsigned		type)
 {
 	struct xfs_mount	*mp;
+	struct xfs_inode	*dp;
 	struct xfs_inode	*ip;
 	struct xchk_dir_ctx	*sdc;
 	struct xfs_name		xname;
 	xfs_ino_t		lookup_ino;
 	xfs_dablk_t		offset;
-	bool			checked_ftype = false;
 	int			error = 0;
 
 	sdc = container_of(dir_iter, struct xchk_dir_ctx, dir_iter);
-	ip = sdc->sc->ip;
-	mp = ip->i_mount;
+	dp = sdc->sc->ip;
+	mp = dp->i_mount;
 	offset = xfs_dir2_db_to_da(mp->m_dir_geo,
 			xfs_dir2_dataptr_to_db(mp->m_dir_geo, pos));
 
@@ -136,11 +112,7 @@ xchk_dir_actor(
 
 	if (!strncmp(".", name, namelen)) {
 		/* If this is "." then check that the inum matches the dir. */
-		if (xfs_has_ftype(mp) && type != DT_DIR)
-			xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
-					offset);
-		checked_ftype = true;
-		if (ino != ip->i_ino)
+		if (ino != dp->i_ino)
 			xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
 					offset);
 	} else if (!strncmp("..", name, namelen)) {
@@ -148,11 +120,7 @@ xchk_dir_actor(
 		 * If this is ".." in the root inode, check that the inum
 		 * matches this dir.
 		 */
-		if (xfs_has_ftype(mp) && type != DT_DIR)
-			xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
-					offset);
-		checked_ftype = true;
-		if (ip->i_ino == mp->m_sb.sb_rootino && ino != ip->i_ino)
+		if (dp->i_ino == mp->m_sb.sb_rootino && ino != dp->i_ino)
 			xchk_fblock_set_corrupt(sdc->sc, XFS_DATA_FORK,
 					offset);
 	}
@@ -162,7 +130,7 @@ xchk_dir_actor(
 	xname.len = namelen;
 	xname.type = XFS_DIR3_FT_UNKNOWN;
 
-	error = xfs_dir_lookup(sdc->sc->tp, ip, &xname, &lookup_ino, NULL);
+	error = xfs_dir_lookup(sdc->sc->tp, dp, &xname, &lookup_ino, NULL);
 	/* ENOENT means the hash lookup failed and the dir is corrupt */
 	if (error == -ENOENT)
 		error = -EFSCORRUPTED;
@@ -174,12 +142,27 @@ xchk_dir_actor(
 		goto out;
 	}
 
-	/* Verify the file type.  This function absorbs error codes. */
-	if (!checked_ftype) {
-		error = xchk_dir_check_ftype(sdc, offset, lookup_ino, type);
-		if (error)
-			goto out;
+	/*
+	 * Grab the inode pointed to by the dirent.  Use UNTRUSTED here to
+	 * check the allocation status of the inode in the inode btrees.
+	 *
+	 * If _iget returns -EINVAL or -ENOENT then the child inode number is
+	 * garbage and the directory is corrupt.  If the _iget returns
+	 * -EFSCORRUPTED or -EFSBADCRC then the child is corrupt which is a
+	 *  cross referencing error.  Any other error is an operational error.
+	 */
+	error = xchk_iget(sdc->sc, ino, &ip);
+	if (error == -EINVAL || error == -ENOENT) {
+		error = -EFSCORRUPTED;
+		xchk_fblock_process_error(sdc->sc, XFS_DATA_FORK, 0, &error);
+		goto out;
 	}
+	if (!xchk_fblock_xref_process_error(sdc->sc, XFS_DATA_FORK, offset,
+			&error))
+		goto out;
+
+	xchk_dir_check_ftype(sdc, offset, ip, type);
+	xchk_irele(sdc->sc, ip);
 out:
 	/*
 	 * A negative error code returned here is supposed to cause the


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/6] xfs: detect mergeable and overlapping btree records
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (13 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: fix iget usage in directory scrub Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/6] xfs: change bmap scrubber to store the previous mapping Darrick J. Wong
                     ` (5 more replies)
  2022-12-30 22:11 ` [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub Darrick J. Wong
                   ` (7 subsequent siblings)
  22 siblings, 6 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

While I was doing differential fuzz analysis between xfs_scrub and
xfs_repair, I noticed that xfs_repair was only partially effective at
detecting btree records that can be merged, and xfs_scrub totally didn't
notice at all.

For every interval btree type except for the bmbt, there should never
exist two adjacent records with adjacent keyspaces because the
blockcount field is always large enough to span the entire keyspace of
the domain.  This is because the free space, rmap, and refcount btrees
have a blockcount field large enough to store the maximum AG length, and
there can never be an allocation larger than an AG.

The bmbt is a different story due to its ondisk encoding where the
blockcount is only 21 bits wide.  Because AGs can span up to 2^31 blocks
and the RT volume can span up to 2^52 blocks, a preallocation of 2^22
blocks will be expressed as two records of 2^21 length.  We don't
opportunistically combine records when doing bmbt operations, which is
why the fsck tools have never complained about this scenario.

Offline repair is partially effective at detecting mergeable records
because I taught it to do that for the rmap and refcount btrees.  This
series enhances the free space, rmap, and refcount scrubbers to detect
mergeable records.  For the bmbt, it will flag the file as being
eligible for an optimization to shrink the size of the data structure.

The last patch in this set also enhances the rmap scrubber to detect
records that overlap incorrectly.  This check is done automatically for
non-overlapping btree types, but we have to do it separately for the
rmapbt because there are constraints on which allocation types are
allowed to overlap.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records
---
 fs/xfs/scrub/alloc.c    |   29 ++++++++++-
 fs/xfs/scrub/bmap.c     |   39 +++++++++++++--
 fs/xfs/scrub/refcount.c |   44 ++++++++++++++++
 fs/xfs/scrub/rmap.c     |  126 ++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 230 insertions(+), 8 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/6] xfs: change bmap scrubber to store the previous mapping
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect mergeable and overlapping btree records Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/6] xfs: alert the user about data/attr fork mappings that could be merged Darrick J. Wong
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Convert the inode data/attr/cow fork scrubber to remember the entire
previous mapping, not just the next expected offset.  No behavior
changes here, but this will enable some better checking in subsequent
patches.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/bmap.c |   12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index fe13da54e133..14fe461cac4c 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -94,7 +94,8 @@ xchk_setup_inode_bmap(
 struct xchk_bmap_info {
 	struct xfs_scrub	*sc;
 	struct xfs_iext_cursor	icur;
-	xfs_fileoff_t		lastoff;
+	struct xfs_bmbt_irec	prev_rec;
+
 	bool			is_rt;
 	bool			is_shared;
 	bool			was_loaded;
@@ -402,7 +403,8 @@ xchk_bmap_iextent(
 	 * Check for out-of-order extents.  This record could have come
 	 * from the incore list, for which there is no ordering check.
 	 */
-	if (irec->br_startoff < info->lastoff)
+	if (irec->br_startoff < info->prev_rec.br_startoff +
+				info->prev_rec.br_blockcount)
 		xchk_fblock_set_corrupt(info->sc, info->whichfork,
 				irec->br_startoff);
 
@@ -709,7 +711,8 @@ xchk_bmap_iextent_delalloc(
 	 * Check for out-of-order extents.  This record could have come
 	 * from the incore list, for which there is no ordering check.
 	 */
-	if (irec->br_startoff < info->lastoff)
+	if (irec->br_startoff < info->prev_rec.br_startoff +
+				info->prev_rec.br_blockcount)
 		xchk_fblock_set_corrupt(info->sc, info->whichfork,
 				irec->br_startoff);
 
@@ -803,7 +806,6 @@ xchk_bmap(
 		goto out;
 
 	/* Scrub extent records. */
-	info.lastoff = 0;
 	ifp = xfs_ifork_ptr(ip, whichfork);
 	for_each_xfs_iext(ifp, &info.icur, &irec) {
 		if (xchk_should_terminate(sc, &error) ||
@@ -820,7 +822,7 @@ xchk_bmap(
 			xchk_bmap_iextent_delalloc(ip, &info, &irec);
 		else
 			xchk_bmap_iextent(ip, &info, &irec);
-		info.lastoff = irec.br_startoff + irec.br_blockcount;
+		memcpy(&info.prev_rec, &irec, sizeof(struct xfs_bmbt_irec));
 	}
 
 	error = xchk_bmap_check_rmaps(sc, whichfork);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/6] xfs: alert the user about data/attr fork mappings that could be merged
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect mergeable and overlapping btree records Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/6] xfs: change bmap scrubber to store the previous mapping Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 5/6] xfs: check overlapping rmap btree records Darrick J. Wong
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If the data or attr forks have mappings that could be merged, let the
user know that the structure could be optimized.  This isn't a
filesystem corruption since the regular filesystem does not try to be
smart about merging bmbt records.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/bmap.c |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)


diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 14fe461cac4c..499e82110f2f 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -390,6 +390,29 @@ xchk_bmap_dirattr_extent(
 		xchk_fblock_set_corrupt(info->sc, info->whichfork, off);
 }
 
+/* Are these two mappings mergeable? */
+static inline bool
+xchk_bmap_mergeable(
+	struct xchk_bmap_info		*info,
+	const struct xfs_bmbt_irec	*b2)
+{
+	const struct xfs_bmbt_irec	*b1 = &info->prev_rec;
+
+	/* Skip uninitialized prev_rec and COW fork extents */
+	if (b1->br_blockcount == 0)
+		return false;
+	if (info->whichfork == XFS_COW_FORK)
+		return false;
+
+	if (b1->br_startoff + b1->br_blockcount != b2->br_startoff)
+		return false;
+	if (b1->br_startblock + b1->br_blockcount != b2->br_startblock)
+		return false;
+	if (b1->br_blockcount + b2->br_blockcount > BMBT_BLOCKCOUNT_MASK)
+		return false;
+	return b1->br_state == b2->br_state;
+}
+
 /* Scrub a single extent record. */
 STATIC void
 xchk_bmap_iextent(
@@ -441,6 +464,10 @@ xchk_bmap_iextent(
 	if (info->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
 		return;
 
+	/* Notify the user of mergeable records in the data/attr forks. */
+	if (xchk_bmap_mergeable(info, irec))
+		xchk_ino_set_preen(info->sc, info->sc->ip->i_ino);
+
 	if (info->is_rt)
 		xchk_bmap_rt_iextent_xref(ip, info, irec);
 	else


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 3/6] xfs: flag free space btree records that could be merged
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect mergeable and overlapping btree records Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 4/6] xfs: flag refcount btree records that could be merged Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 6/6] xfs: check for reverse mapping " Darrick J. Wong
  5 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Complain if we encounter free space btree records that could be merged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/alloc.c |   29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index e9f8d29544aa..94f4b836c48d 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -31,6 +31,12 @@ xchk_setup_ag_allocbt(
 }
 
 /* Free space btree scrubber. */
+
+struct xchk_alloc {
+	/* Previous free space extent. */
+	struct xfs_alloc_rec_incore	prev;
+};
+
 /*
  * Ensure there's a corresponding cntbt/bnobt record matching this
  * bnobt/cntbt record, respectively.
@@ -93,6 +99,24 @@ xchk_allocbt_xref(
 	xchk_xref_is_not_cow_staging(sc, agbno, len);
 }
 
+/* Flag failures for records that could be merged. */
+STATIC void
+xchk_allocbt_mergeable(
+	struct xchk_btree	*bs,
+	struct xchk_alloc	*ca,
+	const struct xfs_alloc_rec_incore *irec)
+{
+	if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return;
+
+	if (ca->prev.ar_blockcount > 0 &&
+	    ca->prev.ar_startblock + ca->prev.ar_blockcount == irec->ar_startblock &&
+	    ca->prev.ar_blockcount + irec->ar_blockcount < (uint32_t)~0U)
+		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	memcpy(&ca->prev, irec, sizeof(*irec));
+}
+
 /* Scrub a bnobt/cntbt record. */
 STATIC int
 xchk_allocbt_rec(
@@ -100,6 +124,7 @@ xchk_allocbt_rec(
 	const union xfs_btree_rec	*rec)
 {
 	struct xfs_alloc_rec_incore	irec;
+	struct xchk_alloc	*ca = bs->private;
 
 	xfs_alloc_btrec_to_irec(rec, &irec);
 	if (xfs_alloc_check_irec(bs->cur, &irec) != NULL) {
@@ -107,6 +132,7 @@ xchk_allocbt_rec(
 		return 0;
 	}
 
+	xchk_allocbt_mergeable(bs, ca, &irec);
 	xchk_allocbt_xref(bs->sc, &irec);
 
 	return 0;
@@ -118,10 +144,11 @@ xchk_allocbt(
 	struct xfs_scrub	*sc,
 	xfs_btnum_t		which)
 {
+	struct xchk_alloc	ca = { };
 	struct xfs_btree_cur	*cur;
 
 	cur = which == XFS_BTNUM_BNO ? sc->sa.bno_cur : sc->sa.cnt_cur;
-	return xchk_btree(sc, cur, xchk_allocbt_rec, &XFS_RMAP_OINFO_AG, NULL);
+	return xchk_btree(sc, cur, xchk_allocbt_rec, &XFS_RMAP_OINFO_AG, &ca);
 }
 
 int


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 4/6] xfs: flag refcount btree records that could be merged
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect mergeable and overlapping btree records Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 5/6] xfs: check overlapping rmap btree records Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/6] xfs: flag free space " Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 6/6] xfs: check for reverse mapping " Darrick J. Wong
  5 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Complain if we encounter refcount btree records that could be merged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/refcount.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)


diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
index e99c1e1246f8..9d957d2df3e1 100644
--- a/fs/xfs/scrub/refcount.c
+++ b/fs/xfs/scrub/refcount.c
@@ -333,6 +333,9 @@ xchk_refcountbt_xref(
 }
 
 struct xchk_refcbt_records {
+	/* Previous refcount record. */
+	struct xfs_refcount_irec prev_rec;
+
 	/* The next AG block where we aren't expecting shared extents. */
 	xfs_agblock_t		next_unshared_agbno;
 
@@ -390,6 +393,46 @@ xchk_refcountbt_xref_gaps(
 		xchk_should_check_xref(sc, &error, &sc->sa.rmap_cur);
 }
 
+static inline bool
+xchk_refcount_mergeable(
+	struct xchk_refcbt_records	*rrc,
+	const struct xfs_refcount_irec	*r2)
+{
+	const struct xfs_refcount_irec	*r1 = &rrc->prev_rec;
+
+	/* Ignore if prev_rec is not yet initialized. */
+	if (r1->rc_blockcount > 0)
+		return false;
+
+	if (r1->rc_domain != r2->rc_domain)
+		return false;
+	if (r1->rc_startblock + r1->rc_blockcount != r2->rc_startblock)
+		return false;
+	if (r1->rc_refcount != r2->rc_refcount)
+		return false;
+	if ((unsigned long long)r1->rc_blockcount + r2->rc_blockcount >
+			MAXREFCEXTLEN)
+		return false;
+
+	return true;
+}
+
+/* Flag failures for records that could be merged. */
+STATIC void
+xchk_refcountbt_check_mergeable(
+	struct xchk_btree		*bs,
+	struct xchk_refcbt_records	*rrc,
+	const struct xfs_refcount_irec	*irec)
+{
+	if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return;
+
+	if (xchk_refcount_mergeable(rrc, irec))
+		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	memcpy(&rrc->prev_rec, irec, sizeof(struct xfs_refcount_irec));
+}
+
 /* Scrub a refcountbt record. */
 STATIC int
 xchk_refcountbt_rec(
@@ -414,6 +457,7 @@ xchk_refcountbt_rec(
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 	rrc->prev_domain = irec.rc_domain;
 
+	xchk_refcountbt_check_mergeable(bs, rrc, &irec);
 	xchk_refcountbt_xref(bs->sc, &irec);
 
 	/*


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 5/6] xfs: check overlapping rmap btree records
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect mergeable and overlapping btree records Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/6] xfs: change bmap scrubber to store the previous mapping Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/6] xfs: alert the user about data/attr fork mappings that could be merged Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 4/6] xfs: flag refcount btree records that could be merged Darrick J. Wong
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The rmap btree scrubber doesn't contain sufficient checking for records
that cannot overlap but do anyway.  For the other btrees, this is
enforced by the inorder checks in xchk_btree_rec, but the rmap btree is
special because it allows overlapping records to handle shared data
extents.

Therefore, enhance the rmap btree record check function to compare each
record against the previous one so that we can detect overlapping rmap
records for space allocations that do not allow sharing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/rmap.c |   74 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 72 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 7b0ad8f846ab..270c4f1e76c9 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -32,6 +32,15 @@ xchk_setup_ag_rmapbt(
 
 /* Reverse-mapping scrubber. */
 
+struct xchk_rmap {
+	/*
+	 * The furthest-reaching of the rmapbt records that we've already
+	 * processed.  This enables us to detect overlapping records for space
+	 * allocations that cannot be shared.
+	 */
+	struct xfs_rmap_irec	overlap_rec;
+};
+
 /* Cross-reference a rmap against the refcount btree. */
 STATIC void
 xchk_rmapbt_xref_refc(
@@ -139,12 +148,63 @@ xchk_rmapbt_check_unwritten_in_keyflags(
 	}
 }
 
+static inline bool
+xchk_rmapbt_is_shareable(
+	struct xfs_scrub		*sc,
+	const struct xfs_rmap_irec	*irec)
+{
+	if (!xfs_has_reflink(sc->mp))
+		return false;
+	if (XFS_RMAP_NON_INODE_OWNER(irec->rm_owner))
+		return false;
+	if (irec->rm_flags & (XFS_RMAP_BMBT_BLOCK | XFS_RMAP_ATTR_FORK |
+			      XFS_RMAP_UNWRITTEN))
+		return false;
+	return true;
+}
+
+/* Flag failures for records that overlap but cannot. */
+STATIC void
+xchk_rmapbt_check_overlapping(
+	struct xchk_btree		*bs,
+	struct xchk_rmap		*cr,
+	const struct xfs_rmap_irec	*irec)
+{
+	xfs_agblock_t			pnext, inext;
+
+	if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return;
+
+	/* No previous record? */
+	if (cr->overlap_rec.rm_blockcount == 0)
+		goto set_prev;
+
+	/* Do overlap_rec and irec overlap? */
+	pnext = cr->overlap_rec.rm_startblock + cr->overlap_rec.rm_blockcount;
+	if (pnext <= irec->rm_startblock)
+		goto set_prev;
+
+	/* Overlap is only allowed if both records are data fork mappings. */
+	if (!xchk_rmapbt_is_shareable(bs->sc, &cr->overlap_rec) ||
+	    !xchk_rmapbt_is_shareable(bs->sc, irec))
+		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	/* Save whichever rmap record extends furthest. */
+	inext = irec->rm_startblock + irec->rm_blockcount;
+	if (pnext > inext)
+		return;
+
+set_prev:
+	memcpy(&cr->overlap_rec, irec, sizeof(struct xfs_rmap_irec));
+}
+
 /* Scrub an rmapbt record. */
 STATIC int
 xchk_rmapbt_rec(
 	struct xchk_btree	*bs,
 	const union xfs_btree_rec *rec)
 {
+	struct xchk_rmap	*cr = bs->private;
 	struct xfs_rmap_irec	irec;
 
 	if (xfs_rmap_btrec_to_irec(rec, &irec) != NULL ||
@@ -154,6 +214,7 @@ xchk_rmapbt_rec(
 	}
 
 	xchk_rmapbt_check_unwritten_in_keyflags(bs);
+	xchk_rmapbt_check_overlapping(bs, cr, &irec);
 	xchk_rmapbt_xref(bs->sc, &irec);
 	return 0;
 }
@@ -163,8 +224,17 @@ int
 xchk_rmapbt(
 	struct xfs_scrub	*sc)
 {
-	return xchk_btree(sc, sc->sa.rmap_cur, xchk_rmapbt_rec,
-			&XFS_RMAP_OINFO_AG, NULL);
+	struct xchk_rmap	*cr;
+	int			error;
+
+	cr = kzalloc(sizeof(struct xchk_rmap), XCHK_GFP_FLAGS);
+	if (!cr)
+		return -ENOMEM;
+
+	error = xchk_btree(sc, sc->sa.rmap_cur, xchk_rmapbt_rec,
+			&XFS_RMAP_OINFO_AG, cr);
+	kfree(cr);
+	return error;
 }
 
 /* xref check that the extent is owned only by a given owner */


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 6/6] xfs: check for reverse mapping records that could be merged
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect mergeable and overlapping btree records Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 3/6] xfs: flag free space " Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  5 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Enhance the rmap scrubber to flag adjacent records that could be merged.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/rmap.c |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)


diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 270c4f1e76c9..3cb92f7ac165 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -39,6 +39,12 @@ struct xchk_rmap {
 	 * allocations that cannot be shared.
 	 */
 	struct xfs_rmap_irec	overlap_rec;
+
+	/*
+	 * The previous rmapbt record, so that we can check for two records
+	 * that could be one.
+	 */
+	struct xfs_rmap_irec	prev_rec;
 };
 
 /* Cross-reference a rmap against the refcount btree. */
@@ -198,6 +204,51 @@ xchk_rmapbt_check_overlapping(
 	memcpy(&cr->overlap_rec, irec, sizeof(struct xfs_rmap_irec));
 }
 
+/* Decide if two reverse-mapping records can be merged. */
+static inline bool
+xchk_rmap_mergeable(
+	struct xchk_rmap		*cr,
+	const struct xfs_rmap_irec	*r2)
+{
+	const struct xfs_rmap_irec	*r1 = &cr->prev_rec;
+
+	/* Ignore if prev_rec is not yet initialized. */
+	if (cr->prev_rec.rm_blockcount == 0)
+		return false;
+
+	if (r1->rm_owner != r2->rm_owner)
+		return false;
+	if (r1->rm_startblock + r1->rm_blockcount != r2->rm_startblock)
+		return false;
+	if ((unsigned long long)r1->rm_blockcount + r2->rm_blockcount >
+	    XFS_RMAP_LEN_MAX)
+		return false;
+	if (XFS_RMAP_NON_INODE_OWNER(r2->rm_owner))
+		return true;
+	/* must be an inode owner below here */
+	if (r1->rm_flags != r2->rm_flags)
+		return false;
+	if (r1->rm_flags & XFS_RMAP_BMBT_BLOCK)
+		return true;
+	return r1->rm_offset + r1->rm_blockcount == r2->rm_offset;
+}
+
+/* Flag failures for records that could be merged. */
+STATIC void
+xchk_rmapbt_check_mergeable(
+	struct xchk_btree		*bs,
+	struct xchk_rmap		*cr,
+	const struct xfs_rmap_irec	*irec)
+{
+	if (bs->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return;
+
+	if (xchk_rmap_mergeable(cr, irec))
+		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
+
+	memcpy(&cr->prev_rec, irec, sizeof(struct xfs_rmap_irec));
+}
+
 /* Scrub an rmapbt record. */
 STATIC int
 xchk_rmapbt_rec(
@@ -214,6 +265,7 @@ xchk_rmapbt_rec(
 	}
 
 	xchk_rmapbt_check_unwritten_in_keyflags(bs);
+	xchk_rmapbt_check_mergeable(bs, cr, &irec);
 	xchk_rmapbt_check_overlapping(bs, cr, &irec);
 	xchk_rmapbt_xref(bs->sc, &irec);
 	return 0;


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (14 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect mergeable and overlapping btree records Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 04/11] xfs: split freemap from xchk_xattr_buf.buf Darrick J. Wong
                     ` (10 more replies)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: rework online fsck incore bitmap Darrick J. Wong
                   ` (6 subsequent siblings)
  22 siblings, 11 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

Currently, the extended attribute scrubber uses a single VLA to store
all the context information needed in various parts of the scrubber
code.  This includes xattr leaf block space usage bitmaps, and the value
buffer used to check the correctness of remote xattr value block
headers.  We try to minimize the insanity through the use of helper
functions, but this is a memory management nightmare.  Clean this up by
making the bitmap and value pointers explicit members of struct
xchk_xattr_buf.

Second, strengthen the xattr checking by teaching it to look for overlapping
data structures in the shortform attr data.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-fix-xattr-memory-mgmt
---
 fs/xfs/scrub/attr.c  |  298 +++++++++++++++++++++++++++++++++++---------------
 fs/xfs/scrub/attr.h  |   60 +---------
 fs/xfs/scrub/scrub.c |    3 +
 fs/xfs/scrub/scrub.h |   10 ++
 4 files changed, 231 insertions(+), 140 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 01/11] xfs: xattr scrub should ensure one namespace bit per name
  2022-12-30 22:11 ` [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 02/11] xfs: don't shadow @leaf in xchk_xattr_block Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 03/11] xfs: remove unnecessary dstmap in xattr scrubber Darrick J. Wong
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Check that each extended attribute exists in only one namespace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/attr.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 31529b9bf389..95752e300105 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -128,10 +128,16 @@ xchk_xattr_listent(
 		return;
 	}
 
+	/* Only one namespace bit allowed. */
+	if (hweight32(flags & XFS_ATTR_NSP_ONDISK_MASK) > 1) {
+		xchk_fblock_set_corrupt(sx->sc, XFS_ATTR_FORK, args.blkno);
+		goto fail_xref;
+	}
+
 	/* Does this name make sense? */
 	if (!xfs_attr_namecheck(name, namelen)) {
 		xchk_fblock_set_corrupt(sx->sc, XFS_ATTR_FORK, args.blkno);
-		return;
+		goto fail_xref;
 	}
 
 	/*


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 02/11] xfs: don't shadow @leaf in xchk_xattr_block
  2022-12-30 22:11 ` [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 04/11] xfs: split freemap from xchk_xattr_buf.buf Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 06/11] xfs: split valuebuf " Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 01/11] xfs: xattr scrub should ensure one namespace bit per name Darrick J. Wong
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Don't shadow the leaf variable here, because it's misleading to have one
place in the codebase where two variables with different types have the
same name.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/attr.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 95752e300105..3020892b796e 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -342,10 +342,10 @@ xchk_xattr_block(
 
 	/* Check all the padding. */
 	if (xfs_has_crc(ds->sc->mp)) {
-		struct xfs_attr3_leafblock	*leaf = bp->b_addr;
+		struct xfs_attr3_leafblock	*leaf3 = bp->b_addr;
 
-		if (leaf->hdr.pad1 != 0 || leaf->hdr.pad2 != 0 ||
-		    leaf->hdr.info.hdr.pad != 0)
+		if (leaf3->hdr.pad1 != 0 || leaf3->hdr.pad2 != 0 ||
+		    leaf3->hdr.info.hdr.pad != 0)
 			xchk_da_set_corrupt(ds, level);
 	} else {
 		if (leaf->hdr.pad1 != 0 || leaf->hdr.info.pad != 0)


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 03/11] xfs: remove unnecessary dstmap in xattr scrubber
  2022-12-30 22:11 ` [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 01/11] xfs: xattr scrub should ensure one namespace bit per name Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 05/11] xfs: split usedmap from xchk_xattr_buf.buf Darrick J. Wong
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Replace bitmap_and with bitmap_intersects in the xattr leaf block
scrubber, since we only care if there's overlap between the used space
bitmap and the free space bitmap.  This means we don't need dstmap any
more, and can thus reduce the memory requirements.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/attr.c |    7 +++----
 fs/xfs/scrub/attr.h |   12 +-----------
 2 files changed, 4 insertions(+), 15 deletions(-)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 3020892b796e..6cd0ae99c2c5 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -36,10 +36,10 @@ xchk_setup_xattr_buf(
 
 	/*
 	 * We need enough space to read an xattr value from the file or enough
-	 * space to hold three copies of the xattr free space bitmap.  We don't
+	 * space to hold two copies of the xattr free space bitmap.  We don't
 	 * need the buffer space for both purposes at the same time.
 	 */
-	sz = 3 * sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
+	sz = 2 * sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
 	sz = max_t(size_t, sz, value_size);
 
 	/*
@@ -223,7 +223,6 @@ xchk_xattr_check_freemap(
 	struct xfs_attr3_icleaf_hdr	*leafhdr)
 {
 	unsigned long			*freemap = xchk_xattr_freemap(sc);
-	unsigned long			*dstmap = xchk_xattr_dstmap(sc);
 	unsigned int			mapsize = sc->mp->m_attr_geo->blksize;
 	int				i;
 
@@ -237,7 +236,7 @@ xchk_xattr_check_freemap(
 	}
 
 	/* Look for bits that are set in freemap and are marked in use. */
-	return bitmap_and(dstmap, freemap, map, mapsize) == 0;
+	return !bitmap_intersects(freemap, map, mapsize);
 }
 
 /*
diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h
index 3590e10e3e62..be133e0da71b 100644
--- a/fs/xfs/scrub/attr.h
+++ b/fs/xfs/scrub/attr.h
@@ -21,8 +21,7 @@ struct xchk_xattr_buf {
 	 * Each bitmap contains enough bits to track every byte in an attr
 	 * block (rounded up to the size of an unsigned long).  The attr block
 	 * used space bitmap starts at the beginning of the buffer; the free
-	 * space bitmap follows immediately after; and we have a third buffer
-	 * for storing intermediate bitmap results.
+	 * space bitmap follows immediately after.
 	 */
 	uint8_t			buf[];
 };
@@ -56,13 +55,4 @@ xchk_xattr_freemap(
 			BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
 }
 
-/* A bitmap used to hold temporary results. */
-static inline unsigned long *
-xchk_xattr_dstmap(
-	struct xfs_scrub	*sc)
-{
-	return xchk_xattr_freemap(sc) +
-			BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
-}
-
 #endif	/* __XFS_SCRUB_ATTR_H__ */


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 04/11] xfs: split freemap from xchk_xattr_buf.buf
  2022-12-30 22:11 ` [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 06/11] xfs: split valuebuf " Darrick J. Wong
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move the free space bitmap from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.  This is the start of removing the complex overloaded
memory buffer that is the source of weird memory misuse bugs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/attr.c  |   40 ++++++++++++++++++++++++++++++++--------
 fs/xfs/scrub/attr.h  |   15 ++++-----------
 fs/xfs/scrub/scrub.c |    3 +++
 fs/xfs/scrub/scrub.h |   10 ++++++++++
 4 files changed, 49 insertions(+), 19 deletions(-)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 6cd0ae99c2c5..fed159aba6e2 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -20,6 +20,17 @@
 #include "scrub/dabtree.h"
 #include "scrub/attr.h"
 
+/* Free the buffers linked from the xattr buffer. */
+static void
+xchk_xattr_buf_cleanup(
+	void			*priv)
+{
+	struct xchk_xattr_buf	*ab = priv;
+
+	kvfree(ab->freemap);
+	ab->freemap = NULL;
+}
+
 /*
  * Allocate enough memory to hold an attr value and attr block bitmaps,
  * reallocating the buffer if necessary.  Buffer contents are not preserved
@@ -32,15 +43,18 @@ xchk_setup_xattr_buf(
 	gfp_t			flags)
 {
 	size_t			sz;
+	size_t			bmp_sz;
 	struct xchk_xattr_buf	*ab = sc->buf;
+	unsigned long		*old_freemap = NULL;
+
+	bmp_sz = sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
 
 	/*
 	 * We need enough space to read an xattr value from the file or enough
-	 * space to hold two copies of the xattr free space bitmap.  We don't
+	 * space to hold one copy of the xattr free space bitmap.  We don't
 	 * need the buffer space for both purposes at the same time.
 	 */
-	sz = 2 * sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
-	sz = max_t(size_t, sz, value_size);
+	sz = max_t(size_t, bmp_sz, value_size);
 
 	/*
 	 * If there's already a buffer, figure out if we need to reallocate it
@@ -49,6 +63,7 @@ xchk_setup_xattr_buf(
 	if (ab) {
 		if (sz <= ab->sz)
 			return 0;
+		old_freemap = ab->freemap;
 		kvfree(ab);
 		sc->buf = NULL;
 	}
@@ -60,9 +75,18 @@ xchk_setup_xattr_buf(
 	ab = kvmalloc(sizeof(*ab) + sz, flags);
 	if (!ab)
 		return -ENOMEM;
-
 	ab->sz = sz;
 	sc->buf = ab;
+	sc->buf_cleanup = xchk_xattr_buf_cleanup;
+
+	if (old_freemap) {
+		ab->freemap = old_freemap;
+	} else {
+		ab->freemap = kvmalloc(bmp_sz, flags);
+		if (!ab->freemap)
+			return -ENOMEM;
+	}
+
 	return 0;
 }
 
@@ -222,21 +246,21 @@ xchk_xattr_check_freemap(
 	unsigned long			*map,
 	struct xfs_attr3_icleaf_hdr	*leafhdr)
 {
-	unsigned long			*freemap = xchk_xattr_freemap(sc);
+	struct xchk_xattr_buf		*ab = sc->buf;
 	unsigned int			mapsize = sc->mp->m_attr_geo->blksize;
 	int				i;
 
 	/* Construct bitmap of freemap contents. */
-	bitmap_zero(freemap, mapsize);
+	bitmap_zero(ab->freemap, mapsize);
 	for (i = 0; i < XFS_ATTR_LEAF_MAPSIZE; i++) {
-		if (!xchk_xattr_set_map(sc, freemap,
+		if (!xchk_xattr_set_map(sc, ab->freemap,
 				leafhdr->freemap[i].base,
 				leafhdr->freemap[i].size))
 			return false;
 	}
 
 	/* Look for bits that are set in freemap and are marked in use. */
-	return !bitmap_intersects(freemap, map, mapsize);
+	return !bitmap_intersects(ab->freemap, map, mapsize);
 }
 
 /*
diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h
index be133e0da71b..e6f11d44e84d 100644
--- a/fs/xfs/scrub/attr.h
+++ b/fs/xfs/scrub/attr.h
@@ -10,6 +10,9 @@
  * Temporary storage for online scrub and repair of extended attributes.
  */
 struct xchk_xattr_buf {
+	/* Bitmap of free space in xattr leaf blocks. */
+	unsigned long		*freemap;
+
 	/* Size of @buf, in bytes. */
 	size_t			sz;
 
@@ -20,8 +23,7 @@ struct xchk_xattr_buf {
 	 *
 	 * Each bitmap contains enough bits to track every byte in an attr
 	 * block (rounded up to the size of an unsigned long).  The attr block
-	 * used space bitmap starts at the beginning of the buffer; the free
-	 * space bitmap follows immediately after.
+	 * used space bitmap starts at the beginning of the buffer.
 	 */
 	uint8_t			buf[];
 };
@@ -46,13 +48,4 @@ xchk_xattr_usedmap(
 	return (unsigned long *)ab->buf;
 }
 
-/* A bitmap of free space computed by walking attr leaf block free info. */
-static inline unsigned long *
-xchk_xattr_freemap(
-	struct xfs_scrub	*sc)
-{
-	return xchk_xattr_usedmap(sc) +
-			BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
-}
-
 #endif	/* __XFS_SCRUB_ATTR_H__ */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index bc9638c7a379..6697f5f32106 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -189,7 +189,10 @@ xchk_teardown(
 	if (sc->flags & XCHK_REAPING_DISABLED)
 		xchk_start_reaping(sc);
 	if (sc->buf) {
+		if (sc->buf_cleanup)
+			sc->buf_cleanup(sc->buf);
 		kvfree(sc->buf);
+		sc->buf_cleanup = NULL;
 		sc->buf = NULL;
 	}
 
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 20e74179d8a7..5d6e9a9527c3 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -77,7 +77,17 @@ struct xfs_scrub {
 	 */
 	struct xfs_inode		*ip;
 
+	/* Kernel memory buffer used by scrubbers; freed at teardown. */
 	void				*buf;
+
+	/*
+	 * Clean up resources owned by whatever is in the buffer.  Cleanup can
+	 * be deferred with this hook as a means for scrub functions to pass
+	 * data to repair functions.  This function must not free the buffer
+	 * itself.
+	 */
+	void				(*buf_cleanup)(void *buf);
+
 	uint				ilock_flags;
 
 	/* See the XCHK/XREP state flags below. */


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 05/11] xfs: split usedmap from xchk_xattr_buf.buf
  2022-12-30 22:11 ` [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 03/11] xfs: remove unnecessary dstmap in xattr scrubber Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 07/11] xfs: remove flags argument from xchk_setup_xattr_buf Darrick J. Wong
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move the used space bitmap from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/attr.c |   39 +++++++++++++++++++++------------------
 fs/xfs/scrub/attr.h |   22 +++++-----------------
 2 files changed, 26 insertions(+), 35 deletions(-)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index fed159aba6e2..c343ae932ae3 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -29,6 +29,8 @@ xchk_xattr_buf_cleanup(
 
 	kvfree(ab->freemap);
 	ab->freemap = NULL;
+	kvfree(ab->usedmap);
+	ab->usedmap = NULL;
 }
 
 /*
@@ -42,20 +44,14 @@ xchk_setup_xattr_buf(
 	size_t			value_size,
 	gfp_t			flags)
 {
-	size_t			sz;
+	size_t			sz = value_size;
 	size_t			bmp_sz;
 	struct xchk_xattr_buf	*ab = sc->buf;
+	unsigned long		*old_usedmap = NULL;
 	unsigned long		*old_freemap = NULL;
 
 	bmp_sz = sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
 
-	/*
-	 * We need enough space to read an xattr value from the file or enough
-	 * space to hold one copy of the xattr free space bitmap.  We don't
-	 * need the buffer space for both purposes at the same time.
-	 */
-	sz = max_t(size_t, bmp_sz, value_size);
-
 	/*
 	 * If there's already a buffer, figure out if we need to reallocate it
 	 * to accommodate a larger size.
@@ -64,6 +60,7 @@ xchk_setup_xattr_buf(
 		if (sz <= ab->sz)
 			return 0;
 		old_freemap = ab->freemap;
+		old_usedmap = ab->usedmap;
 		kvfree(ab);
 		sc->buf = NULL;
 	}
@@ -79,6 +76,14 @@ xchk_setup_xattr_buf(
 	sc->buf = ab;
 	sc->buf_cleanup = xchk_xattr_buf_cleanup;
 
+	if (old_usedmap) {
+		ab->usedmap = old_usedmap;
+	} else {
+		ab->usedmap = kvmalloc(bmp_sz, flags);
+		if (!ab->usedmap)
+			return -ENOMEM;
+	}
+
 	if (old_freemap) {
 		ab->freemap = old_freemap;
 	} else {
@@ -243,7 +248,6 @@ xchk_xattr_set_map(
 STATIC bool
 xchk_xattr_check_freemap(
 	struct xfs_scrub		*sc,
-	unsigned long			*map,
 	struct xfs_attr3_icleaf_hdr	*leafhdr)
 {
 	struct xchk_xattr_buf		*ab = sc->buf;
@@ -260,7 +264,7 @@ xchk_xattr_check_freemap(
 	}
 
 	/* Look for bits that are set in freemap and are marked in use. */
-	return !bitmap_intersects(ab->freemap, map, mapsize);
+	return !bitmap_intersects(ab->freemap, ab->usedmap, mapsize);
 }
 
 /*
@@ -280,7 +284,7 @@ xchk_xattr_entry(
 	__u32				*last_hashval)
 {
 	struct xfs_mount		*mp = ds->state->mp;
-	unsigned long			*usedmap = xchk_xattr_usedmap(ds->sc);
+	struct xchk_xattr_buf		*ab = ds->sc->buf;
 	char				*name_end;
 	struct xfs_attr_leaf_name_local	*lentry;
 	struct xfs_attr_leaf_name_remote *rentry;
@@ -320,7 +324,7 @@ xchk_xattr_entry(
 	if (name_end > buf_end)
 		xchk_da_set_corrupt(ds, level);
 
-	if (!xchk_xattr_set_map(ds->sc, usedmap, nameidx, namesize))
+	if (!xchk_xattr_set_map(ds->sc, ab->usedmap, nameidx, namesize))
 		xchk_da_set_corrupt(ds, level);
 	if (!(ds->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
 		*usedbytes += namesize;
@@ -340,7 +344,7 @@ xchk_xattr_block(
 	struct xfs_attr_leafblock	*leaf = bp->b_addr;
 	struct xfs_attr_leaf_entry	*ent;
 	struct xfs_attr_leaf_entry	*entries;
-	unsigned long			*usedmap;
+	struct xchk_xattr_buf		*ab = ds->sc->buf;
 	char				*buf_end;
 	size_t				off;
 	__u32				last_hashval = 0;
@@ -358,10 +362,9 @@ xchk_xattr_block(
 		return -EDEADLOCK;
 	if (error)
 		return error;
-	usedmap = xchk_xattr_usedmap(ds->sc);
 
 	*last_checked = blk->blkno;
-	bitmap_zero(usedmap, mp->m_attr_geo->blksize);
+	bitmap_zero(ab->usedmap, mp->m_attr_geo->blksize);
 
 	/* Check all the padding. */
 	if (xfs_has_crc(ds->sc->mp)) {
@@ -385,7 +388,7 @@ xchk_xattr_block(
 		xchk_da_set_corrupt(ds, level);
 	if (leafhdr.firstused < hdrsize)
 		xchk_da_set_corrupt(ds, level);
-	if (!xchk_xattr_set_map(ds->sc, usedmap, 0, hdrsize))
+	if (!xchk_xattr_set_map(ds->sc, ab->usedmap, 0, hdrsize))
 		xchk_da_set_corrupt(ds, level);
 
 	if (ds->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
@@ -399,7 +402,7 @@ xchk_xattr_block(
 	for (i = 0, ent = entries; i < leafhdr.count; ent++, i++) {
 		/* Mark the leaf entry itself. */
 		off = (char *)ent - (char *)leaf;
-		if (!xchk_xattr_set_map(ds->sc, usedmap, off,
+		if (!xchk_xattr_set_map(ds->sc, ab->usedmap, off,
 				sizeof(xfs_attr_leaf_entry_t))) {
 			xchk_da_set_corrupt(ds, level);
 			goto out;
@@ -413,7 +416,7 @@ xchk_xattr_block(
 			goto out;
 	}
 
-	if (!xchk_xattr_check_freemap(ds->sc, usedmap, &leafhdr))
+	if (!xchk_xattr_check_freemap(ds->sc, &leafhdr))
 		xchk_da_set_corrupt(ds, level);
 
 	if (leafhdr.usedbytes != usedbytes)
diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h
index e6f11d44e84d..f6f033c19118 100644
--- a/fs/xfs/scrub/attr.h
+++ b/fs/xfs/scrub/attr.h
@@ -10,6 +10,9 @@
  * Temporary storage for online scrub and repair of extended attributes.
  */
 struct xchk_xattr_buf {
+	/* Bitmap of used space in xattr leaf blocks. */
+	unsigned long		*usedmap;
+
 	/* Bitmap of free space in xattr leaf blocks. */
 	unsigned long		*freemap;
 
@@ -17,13 +20,8 @@ struct xchk_xattr_buf {
 	size_t			sz;
 
 	/*
-	 * Memory buffer -- either used for extracting attr values while
-	 * walking the attributes; or for computing attr block bitmaps when
-	 * checking the attribute tree.
-	 *
-	 * Each bitmap contains enough bits to track every byte in an attr
-	 * block (rounded up to the size of an unsigned long).  The attr block
-	 * used space bitmap starts at the beginning of the buffer.
+	 * Memory buffer -- used for extracting attr values while walking the
+	 * attributes.
 	 */
 	uint8_t			buf[];
 };
@@ -38,14 +36,4 @@ xchk_xattr_valuebuf(
 	return ab->buf;
 }
 
-/* A bitmap of space usage computed by walking an attr leaf block. */
-static inline unsigned long *
-xchk_xattr_usedmap(
-	struct xfs_scrub	*sc)
-{
-	struct xchk_xattr_buf	*ab = sc->buf;
-
-	return (unsigned long *)ab->buf;
-}
-
 #endif	/* __XFS_SCRUB_ATTR_H__ */


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 06/11] xfs: split valuebuf from xchk_xattr_buf.buf
  2022-12-30 22:11 ` [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 04/11] xfs: split freemap from xchk_xattr_buf.buf Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 02/11] xfs: don't shadow @leaf in xchk_xattr_block Darrick J. Wong
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move the xattr value buffer from somewhere in xchk_xattr_buf.buf[] to an
explicit pointer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/attr.c |   89 +++++++++++++++++++++++++--------------------------
 fs/xfs/scrub/attr.h |   21 ++----------
 2 files changed, 46 insertions(+), 64 deletions(-)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index c343ae932ae3..98371daa1397 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -31,6 +31,9 @@ xchk_xattr_buf_cleanup(
 	ab->freemap = NULL;
 	kvfree(ab->usedmap);
 	ab->usedmap = NULL;
+	kvfree(ab->value);
+	ab->value = NULL;
+	ab->value_sz = 0;
 }
 
 /*
@@ -44,54 +47,45 @@ xchk_setup_xattr_buf(
 	size_t			value_size,
 	gfp_t			flags)
 {
-	size_t			sz = value_size;
 	size_t			bmp_sz;
 	struct xchk_xattr_buf	*ab = sc->buf;
-	unsigned long		*old_usedmap = NULL;
-	unsigned long		*old_freemap = NULL;
+	void			*new_val;
 
 	bmp_sz = sizeof(long) * BITS_TO_LONGS(sc->mp->m_attr_geo->blksize);
 
-	/*
-	 * If there's already a buffer, figure out if we need to reallocate it
-	 * to accommodate a larger size.
-	 */
-	if (ab) {
-		if (sz <= ab->sz)
-			return 0;
-		old_freemap = ab->freemap;
-		old_usedmap = ab->usedmap;
-		kvfree(ab);
-		sc->buf = NULL;
-	}
+	if (ab)
+		goto resize_value;
 
-	/*
-	 * Don't zero the buffer upon allocation to avoid runtime overhead.
-	 * All users must be careful never to read uninitialized contents.
-	 */
-	ab = kvmalloc(sizeof(*ab) + sz, flags);
+	ab = kvzalloc(sizeof(struct xchk_xattr_buf), flags);
 	if (!ab)
 		return -ENOMEM;
-	ab->sz = sz;
 	sc->buf = ab;
 	sc->buf_cleanup = xchk_xattr_buf_cleanup;
 
-	if (old_usedmap) {
-		ab->usedmap = old_usedmap;
-	} else {
-		ab->usedmap = kvmalloc(bmp_sz, flags);
-		if (!ab->usedmap)
-			return -ENOMEM;
-	}
+	ab->usedmap = kvmalloc(bmp_sz, flags);
+	if (!ab->usedmap)
+		return -ENOMEM;
+
+	ab->freemap = kvmalloc(bmp_sz, flags);
+	if (!ab->freemap)
+		return -ENOMEM;
 
-	if (old_freemap) {
-		ab->freemap = old_freemap;
-	} else {
-		ab->freemap = kvmalloc(bmp_sz, flags);
-		if (!ab->freemap)
-			return -ENOMEM;
+resize_value:
+	if (ab->value_sz >= value_size)
+		return 0;
+
+	if (ab->value) {
+		kvfree(ab->value);
+		ab->value = NULL;
+		ab->value_sz = 0;
 	}
 
+	new_val = kvmalloc(value_size, flags);
+	if (!new_val)
+		return -ENOMEM;
+
+	ab->value = new_val;
+	ab->value_sz = value_size;
 	return 0;
 }
 
@@ -140,11 +134,24 @@ xchk_xattr_listent(
 	int				namelen,
 	int				valuelen)
 {
+	struct xfs_da_args		args = {
+		.op_flags		= XFS_DA_OP_NOTIME,
+		.attr_filter		= flags & XFS_ATTR_NSP_ONDISK_MASK,
+		.geo			= context->dp->i_mount->m_attr_geo,
+		.whichfork		= XFS_ATTR_FORK,
+		.dp			= context->dp,
+		.name			= name,
+		.namelen		= namelen,
+		.hashval		= xfs_da_hashname(name, namelen),
+		.trans			= context->tp,
+		.valuelen		= valuelen,
+	};
+	struct xchk_xattr_buf		*ab;
 	struct xchk_xattr		*sx;
-	struct xfs_da_args		args = { NULL };
 	int				error = 0;
 
 	sx = container_of(context, struct xchk_xattr, context);
+	ab = sx->sc->buf;
 
 	if (xchk_should_terminate(sx->sc, &error)) {
 		context->seen_enough = error;
@@ -182,17 +189,7 @@ xchk_xattr_listent(
 		return;
 	}
 
-	args.op_flags = XFS_DA_OP_NOTIME;
-	args.attr_filter = flags & XFS_ATTR_NSP_ONDISK_MASK;
-	args.geo = context->dp->i_mount->m_attr_geo;
-	args.whichfork = XFS_ATTR_FORK;
-	args.dp = context->dp;
-	args.name = name;
-	args.namelen = namelen;
-	args.hashval = xfs_da_hashname(args.name, args.namelen);
-	args.trans = context->tp;
-	args.value = xchk_xattr_valuebuf(sx->sc);
-	args.valuelen = valuelen;
+	args.value = ab->value;
 
 	error = xfs_attr_get_ilocked(&args);
 	/* ENODATA means the hash lookup failed and the attr is bad */
diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h
index f6f033c19118..18445cc3d33b 100644
--- a/fs/xfs/scrub/attr.h
+++ b/fs/xfs/scrub/attr.h
@@ -16,24 +16,9 @@ struct xchk_xattr_buf {
 	/* Bitmap of free space in xattr leaf blocks. */
 	unsigned long		*freemap;
 
-	/* Size of @buf, in bytes. */
-	size_t			sz;
-
-	/*
-	 * Memory buffer -- used for extracting attr values while walking the
-	 * attributes.
-	 */
-	uint8_t			buf[];
+	/* Memory buffer used to extract xattr values. */
+	void			*value;
+	size_t			value_sz;
 };
 
-/* A place to store attribute values. */
-static inline uint8_t *
-xchk_xattr_valuebuf(
-	struct xfs_scrub	*sc)
-{
-	struct xchk_xattr_buf	*ab = sc->buf;
-
-	return ab->buf;
-}
-
 #endif	/* __XFS_SCRUB_ATTR_H__ */


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 07/11] xfs: remove flags argument from xchk_setup_xattr_buf
  2022-12-30 22:11 ` [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 05/11] xfs: split usedmap from xchk_xattr_buf.buf Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 09/11] xfs: check used space of shortform xattr structures Darrick J. Wong
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

All callers pass XCHK_GFP_FLAGS as the flags argument to
xchk_setup_xattr_buf, so get rid of the argument.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/attr.c |   18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 98371daa1397..df2f21296b30 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -44,8 +44,7 @@ xchk_xattr_buf_cleanup(
 static int
 xchk_setup_xattr_buf(
 	struct xfs_scrub	*sc,
-	size_t			value_size,
-	gfp_t			flags)
+	size_t			value_size)
 {
 	size_t			bmp_sz;
 	struct xchk_xattr_buf	*ab = sc->buf;
@@ -56,17 +55,17 @@ xchk_setup_xattr_buf(
 	if (ab)
 		goto resize_value;
 
-	ab = kvzalloc(sizeof(struct xchk_xattr_buf), flags);
+	ab = kvzalloc(sizeof(struct xchk_xattr_buf), XCHK_GFP_FLAGS);
 	if (!ab)
 		return -ENOMEM;
 	sc->buf = ab;
 	sc->buf_cleanup = xchk_xattr_buf_cleanup;
 
-	ab->usedmap = kvmalloc(bmp_sz, flags);
+	ab->usedmap = kvmalloc(bmp_sz, XCHK_GFP_FLAGS);
 	if (!ab->usedmap)
 		return -ENOMEM;
 
-	ab->freemap = kvmalloc(bmp_sz, flags);
+	ab->freemap = kvmalloc(bmp_sz, XCHK_GFP_FLAGS);
 	if (!ab->freemap)
 		return -ENOMEM;
 
@@ -80,7 +79,7 @@ xchk_setup_xattr_buf(
 		ab->value_sz = 0;
 	}
 
-	new_val = kvmalloc(value_size, flags);
+	new_val = kvmalloc(value_size, XCHK_GFP_FLAGS);
 	if (!new_val)
 		return -ENOMEM;
 
@@ -102,8 +101,7 @@ xchk_setup_xattr(
 	 * without the inode lock held, which means we can sleep.
 	 */
 	if (sc->flags & XCHK_TRY_HARDER) {
-		error = xchk_setup_xattr_buf(sc, XATTR_SIZE_MAX,
-				XCHK_GFP_FLAGS);
+		error = xchk_setup_xattr_buf(sc, XATTR_SIZE_MAX);
 		if (error)
 			return error;
 	}
@@ -181,7 +179,7 @@ xchk_xattr_listent(
 	 * doesn't work, we overload the seen_enough variable to convey
 	 * the error message back to the main scrub function.
 	 */
-	error = xchk_setup_xattr_buf(sx->sc, valuelen, XCHK_GFP_FLAGS);
+	error = xchk_setup_xattr_buf(sx->sc, valuelen);
 	if (error == -ENOMEM)
 		error = -EDEADLOCK;
 	if (error) {
@@ -354,7 +352,7 @@ xchk_xattr_block(
 		return 0;
 
 	/* Allocate memory for block usage checking. */
-	error = xchk_setup_xattr_buf(ds->sc, 0, XCHK_GFP_FLAGS);
+	error = xchk_setup_xattr_buf(ds->sc, 0);
 	if (error == -ENOMEM)
 		return -EDEADLOCK;
 	if (error)


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 08/11] xfs: move xattr scrub buffer allocation to top level function
  2022-12-30 22:11 ` [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub Darrick J. Wong
                     ` (9 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 11/11] xfs: only allocate free space bitmap for xattr scrub if needed Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  10 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move the xchk_setup_xattr_buf call from xchk_xattr_block to xchk_xattr,
since we only need to set up the leaf block bitmaps once.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/attr.c |   15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index df2f21296b30..a98ea78c41a0 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -346,18 +346,10 @@ xchk_xattr_block(
 	unsigned int			usedbytes = 0;
 	unsigned int			hdrsize;
 	int				i;
-	int				error;
 
 	if (*last_checked == blk->blkno)
 		return 0;
 
-	/* Allocate memory for block usage checking. */
-	error = xchk_setup_xattr_buf(ds->sc, 0);
-	if (error == -ENOMEM)
-		return -EDEADLOCK;
-	if (error)
-		return error;
-
 	*last_checked = blk->blkno;
 	bitmap_zero(ab->usedmap, mp->m_attr_geo->blksize);
 
@@ -507,6 +499,13 @@ xchk_xattr(
 	if (!xfs_inode_hasattr(sc->ip))
 		return -ENOENT;
 
+	/* Allocate memory for xattr checking. */
+	error = xchk_setup_xattr_buf(sc, 0);
+	if (error == -ENOMEM)
+		return -EDEADLOCK;
+	if (error)
+		return error;
+
 	memset(&sx, 0, sizeof(sx));
 	/* Check attribute tree structure */
 	error = xchk_da_btree(sc, XFS_ATTR_FORK, xchk_xattr_rec,


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 09/11] xfs: check used space of shortform xattr structures
  2022-12-30 22:11 ` [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 07/11] xfs: remove flags argument from xchk_setup_xattr_buf Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 10/11] xfs: clean up xattr scrub initialization Darrick J. Wong
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure that the records used inside a shortform xattr structure do
not overlap.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/attr.c |   79 ++++++++++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/scrub/attr.h |    2 +
 2 files changed, 76 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index a98ea78c41a0..3e568c78210b 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -15,6 +15,7 @@
 #include "xfs_da_btree.h"
 #include "xfs_attr.h"
 #include "xfs_attr_leaf.h"
+#include "xfs_attr_sf.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/dabtree.h"
@@ -487,6 +488,73 @@ xchk_xattr_rec(
 	return error;
 }
 
+/* Check space usage of shortform attrs. */
+STATIC int
+xchk_xattr_check_sf(
+	struct xfs_scrub		*sc)
+{
+	struct xchk_xattr_buf		*ab = sc->buf;
+	struct xfs_attr_shortform	*sf;
+	struct xfs_attr_sf_entry	*sfe;
+	struct xfs_attr_sf_entry	*next;
+	struct xfs_ifork		*ifp;
+	unsigned char			*end;
+	int				i;
+	int				error = 0;
+
+	ifp = xfs_ifork_ptr(sc->ip, XFS_ATTR_FORK);
+
+	bitmap_zero(ab->usedmap, ifp->if_bytes);
+	sf = (struct xfs_attr_shortform *)sc->ip->i_af.if_u1.if_data;
+	end = (unsigned char *)ifp->if_u1.if_data + ifp->if_bytes;
+	xchk_xattr_set_map(sc, ab->usedmap, 0, sizeof(sf->hdr));
+
+	sfe = &sf->list[0];
+	if ((unsigned char *)sfe > end) {
+		xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0);
+		return 0;
+	}
+
+	for (i = 0; i < sf->hdr.count; i++) {
+		unsigned char		*name = sfe->nameval;
+		unsigned char		*value = &sfe->nameval[sfe->namelen];
+
+		if (xchk_should_terminate(sc, &error))
+			return error;
+
+		next = xfs_attr_sf_nextentry(sfe);
+		if ((unsigned char *)next > end) {
+			xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0);
+			break;
+		}
+
+		if (!xchk_xattr_set_map(sc, ab->usedmap,
+				(char *)sfe - (char *)sf,
+				sizeof(struct xfs_attr_sf_entry))) {
+			xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0);
+			break;
+		}
+
+		if (!xchk_xattr_set_map(sc, ab->usedmap,
+				(char *)name - (char *)sf,
+				sfe->namelen)) {
+			xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0);
+			break;
+		}
+
+		if (!xchk_xattr_set_map(sc, ab->usedmap,
+				(char *)value - (char *)sf,
+				sfe->valuelen)) {
+			xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, 0);
+			break;
+		}
+
+		sfe = next;
+	}
+
+	return 0;
+}
+
 /* Scrub the extended attribute metadata. */
 int
 xchk_xattr(
@@ -506,10 +574,12 @@ xchk_xattr(
 	if (error)
 		return error;
 
-	memset(&sx, 0, sizeof(sx));
-	/* Check attribute tree structure */
-	error = xchk_da_btree(sc, XFS_ATTR_FORK, xchk_xattr_rec,
-			&last_checked);
+	/* Check the physical structure of the xattr. */
+	if (sc->ip->i_af.if_format == XFS_DINODE_FMT_LOCAL)
+		error = xchk_xattr_check_sf(sc);
+	else
+		error = xchk_da_btree(sc, XFS_ATTR_FORK, xchk_xattr_rec,
+				&last_checked);
 	if (error)
 		goto out;
 
@@ -517,6 +587,7 @@ xchk_xattr(
 		goto out;
 
 	/* Check that every attr key can also be looked up by hash. */
+	memset(&sx, 0, sizeof(sx));
 	sx.context.dp = sc->ip;
 	sx.context.resynch = 1;
 	sx.context.put_listent = xchk_xattr_listent;
diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h
index 18445cc3d33b..5f6835752738 100644
--- a/fs/xfs/scrub/attr.h
+++ b/fs/xfs/scrub/attr.h
@@ -10,7 +10,7 @@
  * Temporary storage for online scrub and repair of extended attributes.
  */
 struct xchk_xattr_buf {
-	/* Bitmap of used space in xattr leaf blocks. */
+	/* Bitmap of used space in xattr leaf blocks and shortform forks. */
 	unsigned long		*usedmap;
 
 	/* Bitmap of free space in xattr leaf blocks. */


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 10/11] xfs: clean up xattr scrub initialization
  2022-12-30 22:11 ` [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub Darrick J. Wong
                     ` (7 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 09/11] xfs: check used space of shortform xattr structures Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 11/11] xfs: only allocate free space bitmap for xattr scrub if needed Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 08/11] xfs: move xattr scrub buffer allocation to top level function Darrick J. Wong
  10 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Clean up local variable initialization and error returns in xchk_xattr.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/attr.c |   34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 3e568c78210b..ea4a723f175c 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -560,7 +560,16 @@ int
 xchk_xattr(
 	struct xfs_scrub		*sc)
 {
-	struct xchk_xattr		sx;
+	struct xchk_xattr		sx = {
+		.sc			= sc,
+		.context		= {
+			.dp		= sc->ip,
+			.tp		= sc->tp,
+			.resynch	= 1,
+			.put_listent	= xchk_xattr_listent,
+			.allow_incomplete = true,
+		},
+	};
 	xfs_dablk_t			last_checked = -1U;
 	int				error = 0;
 
@@ -581,22 +590,13 @@ xchk_xattr(
 		error = xchk_da_btree(sc, XFS_ATTR_FORK, xchk_xattr_rec,
 				&last_checked);
 	if (error)
-		goto out;
+		return error;
 
 	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
-		goto out;
-
-	/* Check that every attr key can also be looked up by hash. */
-	memset(&sx, 0, sizeof(sx));
-	sx.context.dp = sc->ip;
-	sx.context.resynch = 1;
-	sx.context.put_listent = xchk_xattr_listent;
-	sx.context.tp = sc->tp;
-	sx.context.allow_incomplete = true;
-	sx.sc = sc;
+		return 0;
 
 	/*
-	 * Look up every xattr in this file by name.
+	 * Look up every xattr in this file by name and hash.
 	 *
 	 * Use the backend implementation of xfs_attr_list to call
 	 * xchk_xattr_listent on every attribute key in this inode.
@@ -613,11 +613,11 @@ xchk_xattr(
 	 */
 	error = xfs_attr_list_ilocked(&sx.context);
 	if (!xchk_fblock_process_error(sc, XFS_ATTR_FORK, 0, &error))
-		goto out;
+		return error;
 
 	/* Did our listent function try to return any errors? */
 	if (sx.context.seen_enough < 0)
-		error = sx.context.seen_enough;
-out:
-	return error;
+		return sx.context.seen_enough;
+
+	return 0;
 }


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 11/11] xfs: only allocate free space bitmap for xattr scrub if needed
  2022-12-30 22:11 ` [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub Darrick J. Wong
                     ` (8 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 10/11] xfs: clean up xattr scrub initialization Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 08/11] xfs: move xattr scrub buffer allocation to top level function Darrick J. Wong
  10 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The free space bitmap is only required if we're going to check the
bestfree space at the end of an xattr leaf block.  Therefore, we can
reduce the memory requirements of this scrubber if we can determine that
the xattr is in short format.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/attr.c |   31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index ea4a723f175c..ea9d0f1a6fd0 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -37,6 +37,29 @@ xchk_xattr_buf_cleanup(
 	ab->value_sz = 0;
 }
 
+/*
+ * Allocate the free space bitmap if we're trying harder; there are leaf blocks
+ * in the attr fork; or we can't tell if there are leaf blocks.
+ */
+static inline bool
+xchk_xattr_want_freemap(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_ifork	*ifp;
+
+	if (sc->flags & XCHK_TRY_HARDER)
+		return true;
+
+	if (!sc->ip)
+		return true;
+
+	ifp = xfs_ifork_ptr(sc->ip, XFS_ATTR_FORK);
+	if (!ifp)
+		return false;
+
+	return xfs_ifork_has_extents(ifp);
+}
+
 /*
  * Allocate enough memory to hold an attr value and attr block bitmaps,
  * reallocating the buffer if necessary.  Buffer contents are not preserved
@@ -66,9 +89,11 @@ xchk_setup_xattr_buf(
 	if (!ab->usedmap)
 		return -ENOMEM;
 
-	ab->freemap = kvmalloc(bmp_sz, XCHK_GFP_FLAGS);
-	if (!ab->freemap)
-		return -ENOMEM;
+	if (xchk_xattr_want_freemap(sc)) {
+		ab->freemap = kvmalloc(bmp_sz, XCHK_GFP_FLAGS);
+		if (!ab->freemap)
+			return -ENOMEM;
+	}
 
 resize_value:
 	if (ab->value_sz >= value_size)


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/3] xfs: rework online fsck incore bitmap
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (15 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/3] xfs: drop the _safe behavior from the xbitmap foreach macro Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: strengthen rmapbt scrubbing Darrick J. Wong
                   ` (5 subsequent siblings)
  22 siblings, 3 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

In this series, we make some changes to the incore bitmap code: First,
we shorten the prefix to 'xbitmap'.  Then, we rework some utility
functions for later use by online repair and clarify how the walk
functions are supposed to be used.

Finally, we use all these new pieces to convert the incore bitmap to use
an interval tree instead of linked lists.  This lifts the limitation
that callers had to be careful not to set a range that was already set;
and gets us ready for the btree rebuilder functions needing to be able
to set bits in a bitmap and generate maximal contiguous extents for the
set ranges.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework
---
 fs/xfs/scrub/agheader_repair.c |   99 ++++++-----
 fs/xfs/scrub/bitmap.c          |  367 +++++++++++++++++++++++++---------------
 fs/xfs/scrub/bitmap.h          |   33 ++--
 fs/xfs/scrub/repair.c          |  104 ++++++-----
 4 files changed, 358 insertions(+), 245 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/3] xfs: remove the for_each_xbitmap_ helpers
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: rework online fsck incore bitmap Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/3] xfs: drop the _safe behavior from the xbitmap foreach macro Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/3] xfs: convert xbitmap to interval tree Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Remove the for_each_xbitmap_ macros in favor of proper iterator
functions.  We'll soon be switching this data structure over to an
interval tree implementation, which means that we can't allow callers to
modify the bitmap during iteration without telling us.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/agheader_repair.c |   89 +++++++++++++++++++---------------
 fs/xfs/scrub/bitmap.c          |   59 +++++++++++++++++++++++
 fs/xfs/scrub/bitmap.h          |   22 ++++++--
 fs/xfs/scrub/repair.c          |  104 ++++++++++++++++++++++------------------
 4 files changed, 180 insertions(+), 94 deletions(-)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index d75d82151eeb..26bce2f12b09 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -486,10 +486,11 @@ xrep_agfl_walk_rmap(
 /* Strike out the blocks that are cross-linked according to the rmapbt. */
 STATIC int
 xrep_agfl_check_extent(
-	struct xrep_agfl	*ra,
 	uint64_t		start,
-	uint64_t		len)
+	uint64_t		len,
+	void			*priv)
 {
+	struct xrep_agfl	*ra = priv;
 	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(ra->sc->mp, start);
 	xfs_agblock_t		last_agbno = agbno + len - 1;
 	int			error;
@@ -537,7 +538,6 @@ xrep_agfl_collect_blocks(
 	struct xrep_agfl	ra;
 	struct xfs_mount	*mp = sc->mp;
 	struct xfs_btree_cur	*cur;
-	struct xbitmap_range	*br, *n;
 	int			error;
 
 	ra.sc = sc;
@@ -578,11 +578,7 @@ xrep_agfl_collect_blocks(
 
 	/* Strike out the blocks that are cross-linked. */
 	ra.rmap_cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.pag);
-	for_each_xbitmap_extent(br, n, agfl_extents) {
-		error = xrep_agfl_check_extent(&ra, br->start, br->len);
-		if (error)
-			break;
-	}
+	error = xbitmap_walk(agfl_extents, xrep_agfl_check_extent, &ra);
 	xfs_btree_del_cursor(ra.rmap_cur, error);
 	if (error)
 		goto out_bmp;
@@ -628,6 +624,43 @@ xrep_agfl_update_agf(
 			XFS_AGF_FLFIRST | XFS_AGF_FLLAST | XFS_AGF_FLCOUNT);
 }
 
+struct xrep_agfl_fill {
+	struct xbitmap		used_extents;
+	struct xfs_scrub	*sc;
+	__be32			*agfl_bno;
+	xfs_agblock_t		flcount;
+	unsigned int		fl_off;
+};
+
+/* Fill the AGFL with whatever blocks are in this extent. */
+static int
+xrep_agfl_fill(
+	uint64_t		start,
+	uint64_t		len,
+	void			*priv)
+{
+	struct xrep_agfl_fill	*af = priv;
+	struct xfs_scrub	*sc = af->sc;
+	xfs_fsblock_t		fsbno = start;
+	int			error;
+
+	while (fsbno < start + len && af->fl_off < af->flcount)
+		af->agfl_bno[af->fl_off++] =
+				cpu_to_be32(XFS_FSB_TO_AGBNO(sc->mp, fsbno++));
+
+	trace_xrep_agfl_insert(sc->mp, sc->sa.pag->pag_agno,
+			XFS_FSB_TO_AGBNO(sc->mp, start), len);
+
+	error = xbitmap_set(&af->used_extents, start, fsbno - 1);
+	if (error)
+		return error;
+
+	if (af->fl_off == af->flcount)
+		return -ECANCELED;
+
+	return 0;
+}
+
 /* Write out a totally new AGFL. */
 STATIC void
 xrep_agfl_init_header(
@@ -636,13 +669,12 @@ xrep_agfl_init_header(
 	struct xbitmap		*agfl_extents,
 	xfs_agblock_t		flcount)
 {
+	struct xrep_agfl_fill	af = {
+		.sc		= sc,
+		.flcount	= flcount,
+	};
 	struct xfs_mount	*mp = sc->mp;
-	__be32			*agfl_bno;
-	struct xbitmap_range	*br;
-	struct xbitmap_range	*n;
 	struct xfs_agfl		*agfl;
-	xfs_agblock_t		agbno;
-	unsigned int		fl_off;
 
 	ASSERT(flcount <= xfs_agfl_size(mp));
 
@@ -661,36 +693,15 @@ xrep_agfl_init_header(
 	 * blocks than fit in the AGFL, they will be freed in a subsequent
 	 * step.
 	 */
-	fl_off = 0;
-	agfl_bno = xfs_buf_to_agfl_bno(agfl_bp);
-	for_each_xbitmap_extent(br, n, agfl_extents) {
-		agbno = XFS_FSB_TO_AGBNO(mp, br->start);
-
-		trace_xrep_agfl_insert(mp, sc->sa.pag->pag_agno, agbno,
-				br->len);
-
-		while (br->len > 0 && fl_off < flcount) {
-			agfl_bno[fl_off] = cpu_to_be32(agbno);
-			fl_off++;
-			agbno++;
-
-			/*
-			 * We've now used br->start by putting it in the AGFL,
-			 * so bump br so that we don't reap the block later.
-			 */
-			br->start++;
-			br->len--;
-		}
-
-		if (br->len)
-			break;
-		list_del(&br->list);
-		kfree(br);
-	}
+	xbitmap_init(&af.used_extents);
+	af.agfl_bno = xfs_buf_to_agfl_bno(agfl_bp),
+	xbitmap_walk(agfl_extents, xrep_agfl_fill, &af);
+	xbitmap_disunion(agfl_extents, &af.used_extents);
 
 	/* Write new AGFL to disk. */
 	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
 	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
+	xbitmap_destroy(&af.used_extents);
 }
 
 /* Repair the AGFL. */
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index a255f09e9f0a..d32ded56da90 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -13,6 +13,9 @@
 #include "scrub/scrub.h"
 #include "scrub/bitmap.h"
 
+#define for_each_xbitmap_extent(bex, n, bitmap) \
+	list_for_each_entry_safe((bex), (n), &(bitmap)->list, list)
+
 /*
  * Set a range of this bitmap.  Caller must ensure the range is not set.
  *
@@ -313,3 +316,59 @@ xbitmap_hweight(
 
 	return ret;
 }
+
+/* Call a function for every run of set bits in this bitmap. */
+int
+xbitmap_walk(
+	struct xbitmap		*bitmap,
+	xbitmap_walk_fn	fn,
+	void			*priv)
+{
+	struct xbitmap_range	*bex, *n;
+	int			error = 0;
+
+	for_each_xbitmap_extent(bex, n, bitmap) {
+		error = fn(bex->start, bex->len, priv);
+		if (error)
+			break;
+	}
+
+	return error;
+}
+
+struct xbitmap_walk_bits {
+	xbitmap_walk_bits_fn	fn;
+	void			*priv;
+};
+
+/* Walk all the bits in a run. */
+static int
+xbitmap_walk_bits_in_run(
+	uint64_t			start,
+	uint64_t			len,
+	void				*priv)
+{
+	struct xbitmap_walk_bits	*wb = priv;
+	uint64_t			i;
+	int				error = 0;
+
+	for (i = start; i < start + len; i++) {
+		error = wb->fn(i, wb->priv);
+		if (error)
+			break;
+	}
+
+	return error;
+}
+
+/* Call a function for every set bit in this bitmap. */
+int
+xbitmap_walk_bits(
+	struct xbitmap			*bitmap,
+	xbitmap_walk_bits_fn		fn,
+	void				*priv)
+{
+	struct xbitmap_walk_bits	wb = {.fn = fn, .priv = priv};
+
+	return xbitmap_walk(bitmap, xbitmap_walk_bits_in_run, &wb);
+}
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 900646b72de1..53601d281ffb 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -19,13 +19,6 @@ struct xbitmap {
 void xbitmap_init(struct xbitmap *bitmap);
 void xbitmap_destroy(struct xbitmap *bitmap);
 
-#define for_each_xbitmap_extent(bex, n, bitmap) \
-	list_for_each_entry_safe((bex), (n), &(bitmap)->list, list)
-
-#define for_each_xbitmap_block(b, bex, n, bitmap) \
-	list_for_each_entry_safe((bex), (n), &(bitmap)->list, list) \
-		for ((b) = (bex)->start; (b) < (bex)->start + (bex)->len; (b)++)
-
 int xbitmap_set(struct xbitmap *bitmap, uint64_t start, uint64_t len);
 int xbitmap_disunion(struct xbitmap *bitmap, struct xbitmap *sub);
 int xbitmap_set_btcur_path(struct xbitmap *bitmap,
@@ -34,4 +27,19 @@ int xbitmap_set_btblocks(struct xbitmap *bitmap,
 		struct xfs_btree_cur *cur);
 uint64_t xbitmap_hweight(struct xbitmap *bitmap);
 
+/*
+ * Return codes for the bitmap iterator functions are 0 to continue iterating,
+ * and non-zero to stop iterating.  Any non-zero value will be passed up to the
+ * iteration caller.  The special value -ECANCELED can be used to stop
+ * iteration, because neither bitmap iterator ever generates that error code on
+ * its own.  Callers must not modify the bitmap while walking it.
+ */
+typedef int (*xbitmap_walk_fn)(uint64_t start, uint64_t len, void *priv);
+int xbitmap_walk(struct xbitmap *bitmap, xbitmap_walk_fn fn,
+		void *priv);
+
+typedef int (*xbitmap_walk_bits_fn)(uint64_t bit, void *priv);
+int xbitmap_walk_bits(struct xbitmap *bitmap, xbitmap_walk_bits_fn fn,
+		void *priv);
+
 #endif	/* __XFS_SCRUB_BITMAP_H__ */
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 446ffe987ca0..074c6f5974d1 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -446,6 +446,30 @@ xrep_init_btblock(
  * buffers associated with @bitmap.
  */
 
+static int
+xrep_invalidate_block(
+	uint64_t		fsbno,
+	void			*priv)
+{
+	struct xfs_scrub	*sc = priv;
+	struct xfs_buf		*bp;
+	int			error;
+
+	/* Skip AG headers and post-EOFS blocks */
+	if (!xfs_verify_fsbno(sc->mp, fsbno))
+		return 0;
+
+	error = xfs_buf_incore(sc->mp->m_ddev_targp,
+			XFS_FSB_TO_DADDR(sc->mp, fsbno),
+			XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK, &bp);
+	if (error)
+		return 0;
+
+	xfs_trans_bjoin(sc->tp, bp);
+	xfs_trans_binval(sc->tp, bp);
+	return 0;
+}
+
 /*
  * Invalidate buffers for per-AG btree blocks we're dumping.  This function
  * is not intended for use with file data repairs; we have bunmapi for that.
@@ -455,11 +479,6 @@ xrep_invalidate_blocks(
 	struct xfs_scrub	*sc,
 	struct xbitmap		*bitmap)
 {
-	struct xbitmap_range	*bmr;
-	struct xbitmap_range	*n;
-	struct xfs_buf		*bp;
-	xfs_fsblock_t		fsbno;
-
 	/*
 	 * For each block in each extent, see if there's an incore buffer for
 	 * exactly that block; if so, invalidate it.  The buffer cache only
@@ -468,23 +487,7 @@ xrep_invalidate_blocks(
 	 * because we never own those; and if we can't TRYLOCK the buffer we
 	 * assume it's owned by someone else.
 	 */
-	for_each_xbitmap_block(fsbno, bmr, n, bitmap) {
-		int		error;
-
-		/* Skip AG headers and post-EOFS blocks */
-		if (!xfs_verify_fsbno(sc->mp, fsbno))
-			continue;
-		error = xfs_buf_incore(sc->mp->m_ddev_targp,
-				XFS_FSB_TO_DADDR(sc->mp, fsbno),
-				XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK, &bp);
-		if (error)
-			continue;
-
-		xfs_trans_bjoin(sc->tp, bp);
-		xfs_trans_binval(sc->tp, bp);
-	}
-
-	return 0;
+	return xbitmap_walk_bits(bitmap, xrep_invalidate_block, sc);
 }
 
 /* Ensure the freelist is the correct size. */
@@ -505,6 +508,15 @@ xrep_fix_freelist(
 			can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK);
 }
 
+/* Information about reaping extents after a repair. */
+struct xrep_reap_state {
+	struct xfs_scrub		*sc;
+
+	/* Reverse mapping owner and metadata reservation type. */
+	const struct xfs_owner_info	*oinfo;
+	enum xfs_ag_resv_type		resv;
+};
+
 /*
  * Put a block back on the AGFL.
  */
@@ -549,17 +561,23 @@ xrep_put_freelist(
 /* Dispose of a single block. */
 STATIC int
 xrep_reap_block(
-	struct xfs_scrub		*sc,
-	xfs_fsblock_t			fsbno,
-	const struct xfs_owner_info	*oinfo,
-	enum xfs_ag_resv_type		resv)
+	uint64_t			fsbno,
+	void				*priv)
 {
+	struct xrep_reap_state		*rs = priv;
+	struct xfs_scrub		*sc = rs->sc;
 	struct xfs_btree_cur		*cur;
 	struct xfs_buf			*agf_bp = NULL;
 	xfs_agblock_t			agbno;
 	bool				has_other_rmap;
 	int				error;
 
+	ASSERT(sc->ip != NULL ||
+	       XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno);
+	trace_xrep_dispose_btree_extent(sc->mp,
+			XFS_FSB_TO_AGNO(sc->mp, fsbno),
+			XFS_FSB_TO_AGBNO(sc->mp, fsbno), 1);
+
 	agbno = XFS_FSB_TO_AGBNO(sc->mp, fsbno);
 	ASSERT(XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno);
 
@@ -578,7 +596,8 @@ xrep_reap_block(
 	cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, agf_bp, sc->sa.pag);
 
 	/* Can we find any other rmappings? */
-	error = xfs_rmap_has_other_keys(cur, agbno, 1, oinfo, &has_other_rmap);
+	error = xfs_rmap_has_other_keys(cur, agbno, 1, rs->oinfo,
+			&has_other_rmap);
 	xfs_btree_del_cursor(cur, error);
 	if (error)
 		goto out_free;
@@ -598,12 +617,12 @@ xrep_reap_block(
 	 */
 	if (has_other_rmap)
 		error = xfs_rmap_free(sc->tp, agf_bp, sc->sa.pag, agbno,
-					1, oinfo);
-	else if (resv == XFS_AG_RESV_AGFL)
+					1, rs->oinfo);
+	else if (rs->resv == XFS_AG_RESV_AGFL)
 		error = xrep_put_freelist(sc, agbno);
 	else
-		error = xfs_free_extent(sc->tp, sc->sa.pag, agbno, 1, oinfo,
-				resv);
+		error = xfs_free_extent(sc->tp, sc->sa.pag, agbno, 1, rs->oinfo,
+				rs->resv);
 	if (agf_bp != sc->sa.agf_bp)
 		xfs_trans_brelse(sc->tp, agf_bp);
 	if (error)
@@ -627,26 +646,15 @@ xrep_reap_extents(
 	const struct xfs_owner_info	*oinfo,
 	enum xfs_ag_resv_type		type)
 {
-	struct xbitmap_range		*bmr;
-	struct xbitmap_range		*n;
-	xfs_fsblock_t			fsbno;
-	int				error = 0;
+	struct xrep_reap_state		rs = {
+		.sc			= sc,
+		.oinfo			= oinfo,
+		.resv			= type,
+	};
 
 	ASSERT(xfs_has_rmapbt(sc->mp));
 
-	for_each_xbitmap_block(fsbno, bmr, n, bitmap) {
-		ASSERT(sc->ip != NULL ||
-		       XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.pag->pag_agno);
-		trace_xrep_dispose_btree_extent(sc->mp,
-				XFS_FSB_TO_AGNO(sc->mp, fsbno),
-				XFS_FSB_TO_AGBNO(sc->mp, fsbno), 1);
-
-		error = xrep_reap_block(sc, fsbno, oinfo, type);
-		if (error)
-			break;
-	}
-
-	return error;
+	return xbitmap_walk_bits(bitmap, xrep_reap_block, &rs);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/3] xfs: drop the _safe behavior from the xbitmap foreach macro
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: rework online fsck incore bitmap Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/3] xfs: convert xbitmap to interval tree Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/3] xfs: remove the for_each_xbitmap_ helpers Darrick J. Wong
  2 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

It's not safe to edit bitmap intervals while we're iterating them with
for_each_xbitmap_extent.  None of the existing callers actually need
that ability anyway, so drop the safe variable.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/bitmap.c |   17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index d32ded56da90..f8ebc4d61462 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -13,8 +13,9 @@
 #include "scrub/scrub.h"
 #include "scrub/bitmap.h"
 
-#define for_each_xbitmap_extent(bex, n, bitmap) \
-	list_for_each_entry_safe((bex), (n), &(bitmap)->list, list)
+/* Iterate each interval of a bitmap.  Do not change the bitmap. */
+#define for_each_xbitmap_extent(bex, bitmap) \
+	list_for_each_entry((bex), &(bitmap)->list, list)
 
 /*
  * Set a range of this bitmap.  Caller must ensure the range is not set.
@@ -46,10 +47,9 @@ void
 xbitmap_destroy(
 	struct xbitmap		*bitmap)
 {
-	struct xbitmap_range	*bmr;
-	struct xbitmap_range	*n;
+	struct xbitmap_range	*bmr, *n;
 
-	for_each_xbitmap_extent(bmr, n, bitmap) {
+	list_for_each_entry_safe(bmr, n, &bitmap->list, list) {
 		list_del(&bmr->list);
 		kfree(bmr);
 	}
@@ -308,10 +308,9 @@ xbitmap_hweight(
 	struct xbitmap		*bitmap)
 {
 	struct xbitmap_range	*bmr;
-	struct xbitmap_range	*n;
 	uint64_t		ret = 0;
 
-	for_each_xbitmap_extent(bmr, n, bitmap)
+	for_each_xbitmap_extent(bmr, bitmap)
 		ret += bmr->len;
 
 	return ret;
@@ -324,10 +323,10 @@ xbitmap_walk(
 	xbitmap_walk_fn	fn,
 	void			*priv)
 {
-	struct xbitmap_range	*bex, *n;
+	struct xbitmap_range	*bex;
 	int			error = 0;
 
-	for_each_xbitmap_extent(bex, n, bitmap) {
+	for_each_xbitmap_extent(bex, bitmap) {
 		error = fn(bex->start, bex->len, priv);
 		if (error)
 			break;


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 3/3] xfs: convert xbitmap to interval tree
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: rework online fsck incore bitmap Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/3] xfs: drop the _safe behavior from the xbitmap foreach macro Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/3] xfs: remove the for_each_xbitmap_ helpers Darrick J. Wong
  2 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Convert the xbitmap code to use interval trees instead of linked lists.
This reduces the amount of coding required to handle the disunion
operation and in the future will make it easier to set bits in arbitrary
order yet later be able to extract maximally sized extents, which we'll
need for rebuilding certain structures.  We define our own interval tree
type so that it can deal with 64-bit indices even on 32-bit machines.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/agheader_repair.c |   12 +
 fs/xfs/scrub/bitmap.c          |  323 ++++++++++++++++++++++------------------
 fs/xfs/scrub/bitmap.h          |   11 -
 3 files changed, 187 insertions(+), 159 deletions(-)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index 26bce2f12b09..c22dc71fdd82 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -662,7 +662,7 @@ xrep_agfl_fill(
 }
 
 /* Write out a totally new AGFL. */
-STATIC void
+STATIC int
 xrep_agfl_init_header(
 	struct xfs_scrub	*sc,
 	struct xfs_buf		*agfl_bp,
@@ -675,6 +675,7 @@ xrep_agfl_init_header(
 	};
 	struct xfs_mount	*mp = sc->mp;
 	struct xfs_agfl		*agfl;
+	int			error;
 
 	ASSERT(flcount <= xfs_agfl_size(mp));
 
@@ -696,12 +697,15 @@ xrep_agfl_init_header(
 	xbitmap_init(&af.used_extents);
 	af.agfl_bno = xfs_buf_to_agfl_bno(agfl_bp),
 	xbitmap_walk(agfl_extents, xrep_agfl_fill, &af);
-	xbitmap_disunion(agfl_extents, &af.used_extents);
+	error = xbitmap_disunion(agfl_extents, &af.used_extents);
+	if (error)
+		return error;
 
 	/* Write new AGFL to disk. */
 	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
 	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
 	xbitmap_destroy(&af.used_extents);
+	return 0;
 }
 
 /* Repair the AGFL. */
@@ -754,7 +758,9 @@ xrep_agfl(
 	 * buffers until we know that part works.
 	 */
 	xrep_agfl_update_agf(sc, agf_bp, flcount);
-	xrep_agfl_init_header(sc, agfl_bp, &agfl_extents, flcount);
+	error = xrep_agfl_init_header(sc, agfl_bp, &agfl_extents, flcount);
+	if (error)
+		goto err;
 
 	/*
 	 * Ok, the AGFL should be ready to go now.  Roll the transaction to
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index f8ebc4d61462..1b04d2ce020a 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -13,31 +13,160 @@
 #include "scrub/scrub.h"
 #include "scrub/bitmap.h"
 
+#include <linux/interval_tree_generic.h>
+
+struct xbitmap_node {
+	struct rb_node	bn_rbnode;
+
+	/* First set bit of this interval and subtree. */
+	uint64_t	bn_start;
+
+	/* Last set bit of this interval. */
+	uint64_t	bn_last;
+
+	/* Last set bit of this subtree.  Do not touch this. */
+	uint64_t	__bn_subtree_last;
+};
+
+/* Define our own interval tree type with uint64_t parameters. */
+
+#define START(node) ((node)->bn_start)
+#define LAST(node)  ((node)->bn_last)
+
+/*
+ * These functions are defined by the INTERVAL_TREE_DEFINE macro, but we'll
+ * forward-declare them anyway for clarity.
+ */
+static inline void
+xbitmap_tree_insert(struct xbitmap_node *node, struct rb_root_cached *root);
+
+static inline void
+xbitmap_tree_remove(struct xbitmap_node *node, struct rb_root_cached *root);
+
+static inline struct xbitmap_node *
+xbitmap_tree_iter_first(struct rb_root_cached *root, uint64_t start,
+			uint64_t last);
+
+static inline struct xbitmap_node *
+xbitmap_tree_iter_next(struct xbitmap_node *node, uint64_t start,
+		       uint64_t last);
+
+INTERVAL_TREE_DEFINE(struct xbitmap_node, bn_rbnode, uint64_t,
+		__bn_subtree_last, START, LAST, static inline, xbitmap_tree)
+
 /* Iterate each interval of a bitmap.  Do not change the bitmap. */
-#define for_each_xbitmap_extent(bex, bitmap) \
-	list_for_each_entry((bex), &(bitmap)->list, list)
-
-/*
- * Set a range of this bitmap.  Caller must ensure the range is not set.
- *
- * This is the logical equivalent of bitmap |= mask(start, len).
- */
+#define for_each_xbitmap_extent(bn, bitmap) \
+	for ((bn) = rb_entry_safe(rb_first(&(bitmap)->xb_root.rb_root), \
+				   struct xbitmap_node, bn_rbnode); \
+	     (bn) != NULL; \
+	     (bn) = rb_entry_safe(rb_next(&(bn)->bn_rbnode), \
+				   struct xbitmap_node, bn_rbnode))
+
+/* Clear a range of this bitmap. */
+int
+xbitmap_clear(
+	struct xbitmap		*bitmap,
+	uint64_t		start,
+	uint64_t		len)
+{
+	struct xbitmap_node	*bn;
+	struct xbitmap_node	*new_bn;
+	uint64_t		last = start + len - 1;
+
+	while ((bn = xbitmap_tree_iter_first(&bitmap->xb_root, start, last))) {
+		if (bn->bn_start < start && bn->bn_last > last) {
+			uint64_t	old_last = bn->bn_last;
+
+			/* overlaps with the entire clearing range */
+			xbitmap_tree_remove(bn, &bitmap->xb_root);
+			bn->bn_last = start - 1;
+			xbitmap_tree_insert(bn, &bitmap->xb_root);
+
+			/* add an extent */
+			new_bn = kmalloc(sizeof(struct xbitmap_node),
+					XCHK_GFP_FLAGS);
+			if (!new_bn)
+				return -ENOMEM;
+			new_bn->bn_start = last + 1;
+			new_bn->bn_last = old_last;
+			xbitmap_tree_insert(new_bn, &bitmap->xb_root);
+		} else if (bn->bn_start < start) {
+			/* overlaps with the left side of the clearing range */
+			xbitmap_tree_remove(bn, &bitmap->xb_root);
+			bn->bn_last = start - 1;
+			xbitmap_tree_insert(bn, &bitmap->xb_root);
+		} else if (bn->bn_last > last) {
+			/* overlaps with the right side of the clearing range */
+			xbitmap_tree_remove(bn, &bitmap->xb_root);
+			bn->bn_start = last + 1;
+			xbitmap_tree_insert(bn, &bitmap->xb_root);
+			break;
+		} else {
+			/* in the middle of the clearing range */
+			xbitmap_tree_remove(bn, &bitmap->xb_root);
+			kfree(bn);
+		}
+	}
+
+	return 0;
+}
+
+/* Set a range of this bitmap. */
 int
 xbitmap_set(
 	struct xbitmap		*bitmap,
 	uint64_t		start,
 	uint64_t		len)
 {
-	struct xbitmap_range	*bmr;
+	struct xbitmap_node	*left;
+	struct xbitmap_node	*right;
+	uint64_t		last = start + len - 1;
+	int			error;
 
-	bmr = kmalloc(sizeof(struct xbitmap_range), XCHK_GFP_FLAGS);
-	if (!bmr)
-		return -ENOMEM;
+	/* Is this whole range already set? */
+	left = xbitmap_tree_iter_first(&bitmap->xb_root, start, last);
+	if (left && left->bn_start <= start && left->bn_last >= last)
+		return 0;
 
-	INIT_LIST_HEAD(&bmr->list);
-	bmr->start = start;
-	bmr->len = len;
-	list_add_tail(&bmr->list, &bitmap->list);
+	/* Clear out everything in the range we want to set. */
+	error = xbitmap_clear(bitmap, start, len);
+	if (error)
+		return error;
+
+	/* Do we have a left-adjacent extent? */
+	left = xbitmap_tree_iter_first(&bitmap->xb_root, start - 1, start - 1);
+	ASSERT(!left || left->bn_last + 1 == start);
+
+	/* Do we have a right-adjacent extent? */
+	right = xbitmap_tree_iter_first(&bitmap->xb_root, last + 1, last + 1);
+	ASSERT(!right || right->bn_start == last + 1);
+
+	if (left && right) {
+		/* combine left and right adjacent extent */
+		xbitmap_tree_remove(left, &bitmap->xb_root);
+		xbitmap_tree_remove(right, &bitmap->xb_root);
+		left->bn_last = right->bn_last;
+		xbitmap_tree_insert(left, &bitmap->xb_root);
+		kfree(right);
+	} else if (left) {
+		/* combine with left extent */
+		xbitmap_tree_remove(left, &bitmap->xb_root);
+		left->bn_last = last;
+		xbitmap_tree_insert(left, &bitmap->xb_root);
+	} else if (right) {
+		/* combine with right extent */
+		xbitmap_tree_remove(right, &bitmap->xb_root);
+		right->bn_start = start;
+		xbitmap_tree_insert(right, &bitmap->xb_root);
+	} else {
+		/* add an extent */
+		left = kmalloc(sizeof(struct xbitmap_node), XCHK_GFP_FLAGS);
+		if (!left)
+			return -ENOMEM;
+		left->bn_start = start;
+		left->bn_last = last;
+		xbitmap_tree_insert(left, &bitmap->xb_root);
+	}
 
 	return 0;
 }
@@ -47,11 +176,11 @@ void
 xbitmap_destroy(
 	struct xbitmap		*bitmap)
 {
-	struct xbitmap_range	*bmr, *n;
+	struct xbitmap_node	*bn;
 
-	list_for_each_entry_safe(bmr, n, &bitmap->list, list) {
-		list_del(&bmr->list);
-		kfree(bmr);
+	while ((bn = xbitmap_tree_iter_first(&bitmap->xb_root, 0, -1ULL))) {
+		xbitmap_tree_remove(bn, &bitmap->xb_root);
+		kfree(bn);
 	}
 }
 
@@ -60,27 +189,7 @@ void
 xbitmap_init(
 	struct xbitmap		*bitmap)
 {
-	INIT_LIST_HEAD(&bitmap->list);
-}
-
-/* Compare two btree extents. */
-static int
-xbitmap_range_cmp(
-	void			*priv,
-	const struct list_head	*a,
-	const struct list_head	*b)
-{
-	struct xbitmap_range	*ap;
-	struct xbitmap_range	*bp;
-
-	ap = container_of(a, struct xbitmap_range, list);
-	bp = container_of(b, struct xbitmap_range, list);
-
-	if (ap->start > bp->start)
-		return 1;
-	if (ap->start < bp->start)
-		return -1;
-	return 0;
+	bitmap->xb_root = RB_ROOT_CACHED;
 }
 
 /*
@@ -97,118 +206,26 @@ xbitmap_range_cmp(
  *
  * This is the logical equivalent of bitmap &= ~sub.
  */
-#define LEFT_ALIGNED	(1 << 0)
-#define RIGHT_ALIGNED	(1 << 1)
 int
 xbitmap_disunion(
 	struct xbitmap		*bitmap,
 	struct xbitmap		*sub)
 {
-	struct list_head	*lp;
-	struct xbitmap_range	*br;
-	struct xbitmap_range	*new_br;
-	struct xbitmap_range	*sub_br;
-	uint64_t		sub_start;
-	uint64_t		sub_len;
-	int			state;
-	int			error = 0;
+	struct xbitmap_node	*bn;
+	int			error;
 
-	if (list_empty(&bitmap->list) || list_empty(&sub->list))
+	if (xbitmap_empty(bitmap) || xbitmap_empty(sub))
 		return 0;
-	ASSERT(!list_empty(&sub->list));
 
-	list_sort(NULL, &bitmap->list, xbitmap_range_cmp);
-	list_sort(NULL, &sub->list, xbitmap_range_cmp);
-
-	/*
-	 * Now that we've sorted both lists, we iterate bitmap once, rolling
-	 * forward through sub and/or bitmap as necessary until we find an
-	 * overlap or reach the end of either list.  We do not reset lp to the
-	 * head of bitmap nor do we reset sub_br to the head of sub.  The
-	 * list traversal is similar to merge sort, but we're deleting
-	 * instead.  In this manner we avoid O(n^2) operations.
-	 */
-	sub_br = list_first_entry(&sub->list, struct xbitmap_range,
-			list);
-	lp = bitmap->list.next;
-	while (lp != &bitmap->list) {
-		br = list_entry(lp, struct xbitmap_range, list);
-
-		/*
-		 * Advance sub_br and/or br until we find a pair that
-		 * intersect or we run out of extents.
-		 */
-		while (sub_br->start + sub_br->len <= br->start) {
-			if (list_is_last(&sub_br->list, &sub->list))
-				goto out;
-			sub_br = list_next_entry(sub_br, list);
-		}
-		if (sub_br->start >= br->start + br->len) {
-			lp = lp->next;
-			continue;
-		}
-
-		/* trim sub_br to fit the extent we have */
-		sub_start = sub_br->start;
-		sub_len = sub_br->len;
-		if (sub_br->start < br->start) {
-			sub_len -= br->start - sub_br->start;
-			sub_start = br->start;
-		}
-		if (sub_len > br->len)
-			sub_len = br->len;
-
-		state = 0;
-		if (sub_start == br->start)
-			state |= LEFT_ALIGNED;
-		if (sub_start + sub_len == br->start + br->len)
-			state |= RIGHT_ALIGNED;
-		switch (state) {
-		case LEFT_ALIGNED:
-			/* Coincides with only the left. */
-			br->start += sub_len;
-			br->len -= sub_len;
-			break;
-		case RIGHT_ALIGNED:
-			/* Coincides with only the right. */
-			br->len -= sub_len;
-			lp = lp->next;
-			break;
-		case LEFT_ALIGNED | RIGHT_ALIGNED:
-			/* Total overlap, just delete ex. */
-			lp = lp->next;
-			list_del(&br->list);
-			kfree(br);
-			break;
-		case 0:
-			/*
-			 * Deleting from the middle: add the new right extent
-			 * and then shrink the left extent.
-			 */
-			new_br = kmalloc(sizeof(struct xbitmap_range),
-					XCHK_GFP_FLAGS);
-			if (!new_br) {
-				error = -ENOMEM;
-				goto out;
-			}
-			INIT_LIST_HEAD(&new_br->list);
-			new_br->start = sub_start + sub_len;
-			new_br->len = br->start + br->len - new_br->start;
-			list_add(&new_br->list, &br->list);
-			br->len = sub_start - br->start;
-			lp = lp->next;
-			break;
-		default:
-			ASSERT(0);
-			break;
-		}
+	for_each_xbitmap_extent(bn, sub) {
+		error = xbitmap_clear(bitmap, bn->bn_start,
+				bn->bn_last - bn->bn_start + 1);
+		if (error)
+			return error;
 	}
 
-out:
-	return error;
+	return 0;
 }
-#undef LEFT_ALIGNED
-#undef RIGHT_ALIGNED
 
 /*
  * Record all btree blocks seen while iterating all records of a btree.
@@ -307,11 +324,11 @@ uint64_t
 xbitmap_hweight(
 	struct xbitmap		*bitmap)
 {
-	struct xbitmap_range	*bmr;
+	struct xbitmap_node	*bn;
 	uint64_t		ret = 0;
 
-	for_each_xbitmap_extent(bmr, bitmap)
-		ret += bmr->len;
+	for_each_xbitmap_extent(bn, bitmap)
+		ret += bn->bn_last - bn->bn_start + 1;
 
 	return ret;
 }
@@ -320,14 +337,14 @@ xbitmap_hweight(
 int
 xbitmap_walk(
 	struct xbitmap		*bitmap,
-	xbitmap_walk_fn	fn,
+	xbitmap_walk_fn		fn,
 	void			*priv)
 {
-	struct xbitmap_range	*bex;
+	struct xbitmap_node	*bn;
 	int			error = 0;
 
-	for_each_xbitmap_extent(bex, bitmap) {
-		error = fn(bex->start, bex->len, priv);
+	for_each_xbitmap_extent(bn, bitmap) {
+		error = fn(bn->bn_start, bn->bn_last - bn->bn_start + 1, priv);
 		if (error)
 			break;
 	}
@@ -371,3 +388,11 @@ xbitmap_walk_bits(
 
 	return xbitmap_walk(bitmap, xbitmap_walk_bits_in_run, &wb);
 }
+
+/* Does this bitmap have no bits set at all? */
+bool
+xbitmap_empty(
+	struct xbitmap		*bitmap)
+{
+	return bitmap->xb_root.rb_root.rb_node == NULL;
+}
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 53601d281ffb..7afd64a318d1 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -6,19 +6,14 @@
 #ifndef __XFS_SCRUB_BITMAP_H__
 #define __XFS_SCRUB_BITMAP_H__
 
-struct xbitmap_range {
-	struct list_head	list;
-	uint64_t		start;
-	uint64_t		len;
-};
-
 struct xbitmap {
-	struct list_head	list;
+	struct rb_root_cached	xb_root;
 };
 
 void xbitmap_init(struct xbitmap *bitmap);
 void xbitmap_destroy(struct xbitmap *bitmap);
 
+int xbitmap_clear(struct xbitmap *bitmap, uint64_t start, uint64_t len);
 int xbitmap_set(struct xbitmap *bitmap, uint64_t start, uint64_t len);
 int xbitmap_disunion(struct xbitmap *bitmap, struct xbitmap *sub);
 int xbitmap_set_btcur_path(struct xbitmap *bitmap,
@@ -42,4 +37,6 @@ typedef int (*xbitmap_walk_bits_fn)(uint64_t bit, void *priv);
 int xbitmap_walk_bits(struct xbitmap *bitmap, xbitmap_walk_bits_fn fn,
 		void *priv);
 
+bool xbitmap_empty(struct xbitmap *bitmap);
+
 #endif	/* __XFS_SCRUB_BITMAP_H__ */


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/5] xfs: strengthen rmapbt scrubbing
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (16 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: rework online fsck incore bitmap Darrick J. Wong
@ 2022-12-30 22:11 ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/5] xfs: introduce bitmap type for AG blocks Darrick J. Wong
                     ` (4 more replies)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: fix rmap btree key flag handling Darrick J. Wong
                   ` (4 subsequent siblings)
  22 siblings, 5 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

Hi all,

This series strengthens space allocation record cross referencing by
using AG block bitmaps to compute the difference between space used
according to the rmap records and the primary metadata, and reports
cross-referencing errors for any discrepancies.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking
---
 fs/xfs/Makefile       |    2 
 fs/xfs/scrub/bitmap.c |   55 +++++++++
 fs/xfs/scrub/bitmap.h |   70 ++++++++++++
 fs/xfs/scrub/repair.h |    1 
 fs/xfs/scrub/rmap.c   |  284 +++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 410 insertions(+), 2 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/5] xfs: introduce bitmap type for AG blocks
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: strengthen rmapbt scrubbing Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 5/5] xfs: cross-reference rmap records with refcount btrees Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a typechecked bitmap for extents within an AG.  Online repair
uses bitmaps to store various different types of numbers, so let's make
it obvious when we're storing xfs_agblock_t (and later xfs_fsblock_t)
versus anything else.

In subsequent patches, we're going to use agblock bitmaps to enhance the
rmapbt checker to look for discrepancies between the rmapbt records and
AG metadata block usage.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/bitmap.h |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h |    1 +
 2 files changed, 49 insertions(+)


diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 7afd64a318d1..7f538effc196 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -39,4 +39,52 @@ int xbitmap_walk_bits(struct xbitmap *bitmap, xbitmap_walk_bits_fn fn,
 
 bool xbitmap_empty(struct xbitmap *bitmap);
 
+/* Bitmaps, but for type-checked for xfs_agblock_t */
+
+struct xagb_bitmap {
+	struct xbitmap	agbitmap;
+};
+
+static inline void xagb_bitmap_init(struct xagb_bitmap *bitmap)
+{
+	xbitmap_init(&bitmap->agbitmap);
+}
+
+static inline void xagb_bitmap_destroy(struct xagb_bitmap *bitmap)
+{
+	xbitmap_destroy(&bitmap->agbitmap);
+}
+
+static inline int xagb_bitmap_clear(struct xagb_bitmap *bitmap,
+		xfs_agblock_t start, xfs_extlen_t len)
+{
+	return xbitmap_clear(&bitmap->agbitmap, start, len);
+}
+static inline int xagb_bitmap_set(struct xagb_bitmap *bitmap,
+		xfs_agblock_t start, xfs_extlen_t len)
+{
+	return xbitmap_set(&bitmap->agbitmap, start, len);
+}
+
+static inline int xagb_bitmap_disunion(struct xagb_bitmap *bitmap,
+		struct xagb_bitmap *sub)
+{
+	return xbitmap_disunion(&bitmap->agbitmap, &sub->agbitmap);
+}
+
+static inline uint32_t xagb_bitmap_hweight(struct xagb_bitmap *bitmap)
+{
+	return xbitmap_hweight(&bitmap->agbitmap);
+}
+static inline bool xagb_bitmap_empty(struct xagb_bitmap *bitmap)
+{
+	return xbitmap_empty(&bitmap->agbitmap);
+}
+
+static inline int xagb_bitmap_walk(struct xagb_bitmap *bitmap,
+		xbitmap_walk_fn fn, void *priv)
+{
+	return xbitmap_walk(&bitmap->agbitmap, fn, priv);
+}
+
 #endif	/* __XFS_SCRUB_BITMAP_H__ */
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 840f74ec431c..150157ac2489 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -31,6 +31,7 @@ int xrep_init_btblock(struct xfs_scrub *sc, xfs_fsblock_t fsb,
 		const struct xfs_buf_ops *ops);
 
 struct xbitmap;
+struct xagb_bitmap;
 
 int xrep_fix_freelist(struct xfs_scrub *sc, bool can_shrink);
 int xrep_invalidate_blocks(struct xfs_scrub *sc, struct xbitmap *btlist);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/5] xfs: cross-reference rmap records with ag btrees
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: strengthen rmapbt scrubbing Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 3/5] xfs: cross-reference rmap records with free space btrees Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Strengthen the rmap btree record checker a little more by comparing
OWN_FS and OWN_LOG reverse mappings against the AG headers and internal
logs, respectively.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile       |    2 -
 fs/xfs/scrub/bitmap.c |   22 +++++++
 fs/xfs/scrub/bitmap.h |   19 ++++++
 fs/xfs/scrub/rmap.c   |  159 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 200 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index ea0725cfb6fb..0b8dfac6d9a3 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -147,6 +147,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   agheader.o \
 				   alloc.o \
 				   attr.o \
+				   bitmap.o \
 				   bmap.o \
 				   btree.o \
 				   common.o \
@@ -170,7 +171,6 @@ xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
 ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
-				   bitmap.o \
 				   repair.o \
 				   )
 endif
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index 1b04d2ce020a..14caff0a28ce 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -396,3 +396,25 @@ xbitmap_empty(
 {
 	return bitmap->xb_root.rb_root.rb_node == NULL;
 }
+
+/* Is the start of the range set or clear?  And for how long? */
+bool
+xbitmap_test(
+	struct xbitmap		*bitmap,
+	uint64_t		start,
+	uint64_t		*len)
+{
+	struct xbitmap_node	*bn;
+	uint64_t		last = start + *len - 1;
+
+	bn = xbitmap_tree_iter_first(&bitmap->xb_root, start, last);
+	if (!bn)
+		return false;
+	if (bn->bn_start <= start) {
+		if (bn->bn_last < last)
+			*len = bn->bn_last - start + 1;
+		return true;
+	}
+	*len = bn->bn_start - start;
+	return false;
+}
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 7f538effc196..65a6c5a92c7a 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -38,6 +38,7 @@ int xbitmap_walk_bits(struct xbitmap *bitmap, xbitmap_walk_bits_fn fn,
 		void *priv);
 
 bool xbitmap_empty(struct xbitmap *bitmap);
+bool xbitmap_test(struct xbitmap *bitmap, uint64_t start, uint64_t *len);
 
 /* Bitmaps, but for type-checked for xfs_agblock_t */
 
@@ -65,6 +66,24 @@ static inline int xagb_bitmap_set(struct xagb_bitmap *bitmap,
 {
 	return xbitmap_set(&bitmap->agbitmap, start, len);
 }
+static inline bool xagb_bitmap_test(struct xagb_bitmap *bitmap,
+		xfs_agblock_t start, xfs_extlen_t *len)
+{
+	uint64_t	biglen = *len;
+	int		error;
+
+	error = xbitmap_test(&bitmap->agbitmap, start, &biglen);
+	if (error)
+		return error;
+
+	if (biglen >= UINT_MAX) {
+		ASSERT(0);
+		return -EOVERFLOW;
+	}
+
+	*len = biglen;
+	return 0;
+}
 
 static inline int xagb_bitmap_disunion(struct xagb_bitmap *bitmap,
 		struct xagb_bitmap *sub)
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 3cb92f7ac165..415d8e9918da 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -12,10 +12,12 @@
 #include "xfs_btree.h"
 #include "xfs_rmap.h"
 #include "xfs_refcount.h"
+#include "xfs_ag.h"
+#include "xfs_bit.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/btree.h"
-#include "xfs_ag.h"
+#include "scrub/bitmap.h"
 
 /*
  * Set us up to scrub reverse mapping btrees.
@@ -45,6 +47,13 @@ struct xchk_rmap {
 	 * that could be one.
 	 */
 	struct xfs_rmap_irec	prev_rec;
+
+	/* Bitmaps containing all blocks for each type of AG metadata. */
+	struct xagb_bitmap	fs_owned;
+	struct xagb_bitmap	log_owned;
+
+	/* Did we complete the AG space metadata bitmaps? */
+	bool			bitmaps_complete;
 };
 
 /* Cross-reference a rmap against the refcount btree. */
@@ -249,6 +258,68 @@ xchk_rmapbt_check_mergeable(
 	memcpy(&cr->prev_rec, irec, sizeof(struct xfs_rmap_irec));
 }
 
+/* Compare an rmap for AG metadata against the metadata walk. */
+STATIC int
+xchk_rmapbt_mark_bitmap(
+	struct xchk_btree		*bs,
+	struct xchk_rmap		*cr,
+	const struct xfs_rmap_irec	*irec)
+{
+	struct xfs_scrub		*sc = bs->sc;
+	struct xagb_bitmap		*bmp = NULL;
+	xfs_extlen_t			fsbcount = irec->rm_blockcount;
+
+	/*
+	 * Skip corrupt records.  It is essential that we detect records in the
+	 * btree that cannot overlap but do, flag those as CORRUPT, and skip
+	 * the bitmap comparison to avoid generating false XCORRUPT reports.
+	 */
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return 0;
+
+	/*
+	 * If the AG metadata walk didn't complete, there's no point in
+	 * comparing against partial results.
+	 */
+	if (!cr->bitmaps_complete)
+		return 0;
+
+	switch (irec->rm_owner) {
+	case XFS_RMAP_OWN_FS:
+		bmp = &cr->fs_owned;
+		break;
+	case XFS_RMAP_OWN_LOG:
+		bmp = &cr->log_owned;
+		break;
+	}
+
+	if (!bmp)
+		return 0;
+
+	if (xagb_bitmap_test(bmp, irec->rm_startblock, &fsbcount)) {
+		/*
+		 * The start of this reverse mapping corresponds to a set
+		 * region in the bitmap.  If the mapping covers more area than
+		 * the set region, then it covers space that wasn't found by
+		 * the AG metadata walk.
+		 */
+		if (fsbcount < irec->rm_blockcount)
+			xchk_btree_xref_set_corrupt(bs->sc,
+					bs->sc->sa.rmap_cur, 0);
+	} else {
+		/*
+		 * The start of this reverse mapping does not correspond to a
+		 * completely set region in the bitmap.  The region wasn't
+		 * fully set by walking the AG metadata, so this is a
+		 * cross-referencing corruption.
+		 */
+		xchk_btree_xref_set_corrupt(bs->sc, bs->sc->sa.rmap_cur, 0);
+	}
+
+	/* Unset the region so that we can detect missing rmap records. */
+	return xagb_bitmap_clear(bmp, irec->rm_startblock, irec->rm_blockcount);
+}
+
 /* Scrub an rmapbt record. */
 STATIC int
 xchk_rmapbt_rec(
@@ -268,9 +339,80 @@ xchk_rmapbt_rec(
 	xchk_rmapbt_check_mergeable(bs, cr, &irec);
 	xchk_rmapbt_check_overlapping(bs, cr, &irec);
 	xchk_rmapbt_xref(bs->sc, &irec);
+
+	return xchk_rmapbt_mark_bitmap(bs, cr, &irec);
+}
+
+/*
+ * Set up bitmaps mapping all the AG metadata to compare with the rmapbt
+ * records.
+ */
+STATIC int
+xchk_rmapbt_walk_ag_metadata(
+	struct xfs_scrub	*sc,
+	struct xchk_rmap	*cr)
+{
+	struct xfs_mount	*mp = sc->mp;
+	int			error;
+
+	/* OWN_FS: AG headers */
+	error = xagb_bitmap_set(&cr->fs_owned, XFS_SB_BLOCK(mp),
+			XFS_AGFL_BLOCK(mp) - XFS_SB_BLOCK(mp) + 1);
+	if (error)
+		goto out;
+
+	/* OWN_LOG: Internal log */
+	if (xfs_ag_contains_log(mp, sc->sa.pag->pag_agno)) {
+		error = xagb_bitmap_set(&cr->log_owned,
+				XFS_FSB_TO_AGBNO(mp, mp->m_sb.sb_logstart),
+				mp->m_sb.sb_logblocks);
+		if (error)
+			goto out;
+	}
+
+out:
+	/*
+	 * If there's an error, set XFAIL and disable the bitmap
+	 * cross-referencing checks, but proceed with the scrub anyway.
+	 */
+	if (error)
+		xchk_btree_xref_process_error(sc, sc->sa.rmap_cur,
+				sc->sa.rmap_cur->bc_nlevels - 1, &error);
+	else
+		cr->bitmaps_complete = true;
 	return 0;
 }
 
+/*
+ * Check for set regions in the bitmaps; if there are any, the rmap records do
+ * not describe all the AG metadata.
+ */
+STATIC void
+xchk_rmapbt_check_bitmaps(
+	struct xfs_scrub	*sc,
+	struct xchk_rmap	*cr)
+{
+	struct xfs_btree_cur	*cur = sc->sa.rmap_cur;
+	unsigned int		level;
+
+	if (sc->sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
+				XFS_SCRUB_OFLAG_XFAIL))
+		return;
+	if (!cur)
+		return;
+	level = cur->bc_nlevels - 1;
+
+	/*
+	 * Any bitmap with bits still set indicates that the reverse mapping
+	 * doesn't cover the entire primary structure.
+	 */
+	if (xagb_bitmap_hweight(&cr->fs_owned) != 0)
+		xchk_btree_xref_set_corrupt(sc, cur, level);
+
+	if (xagb_bitmap_hweight(&cr->log_owned) != 0)
+		xchk_btree_xref_set_corrupt(sc, cur, level);
+}
+
 /* Scrub the rmap btree for some AG. */
 int
 xchk_rmapbt(
@@ -283,8 +425,23 @@ xchk_rmapbt(
 	if (!cr)
 		return -ENOMEM;
 
+	xagb_bitmap_init(&cr->fs_owned);
+	xagb_bitmap_init(&cr->log_owned);
+
+	error = xchk_rmapbt_walk_ag_metadata(sc, cr);
+	if (error)
+		goto out;
+
 	error = xchk_btree(sc, sc->sa.rmap_cur, xchk_rmapbt_rec,
 			&XFS_RMAP_OINFO_AG, cr);
+	if (error)
+		goto out;
+
+	xchk_rmapbt_check_bitmaps(sc, cr);
+
+out:
+	xagb_bitmap_destroy(&cr->log_owned);
+	xagb_bitmap_destroy(&cr->fs_owned);
 	kfree(cr);
 	return error;
 }


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 3/5] xfs: cross-reference rmap records with free space btrees
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: strengthen rmapbt scrubbing Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:11   ` [PATCH 4/5] xfs: cross-reference rmap records with inode btrees Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/5] xfs: cross-reference rmap records with ag btrees Darrick J. Wong
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Strengthen the rmap btree record checker a little more by comparing
OWN_AG reverse mappings against the free space btrees, the rmap btree,
and the AGFL.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/bitmap.c |   33 +++++++++++++++++++++++++
 fs/xfs/scrub/bitmap.h |    3 ++
 fs/xfs/scrub/rmap.c   |   66 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 102 insertions(+)


diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index 14caff0a28ce..72fdb6cd69b4 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -6,6 +6,7 @@
 #include "xfs.h"
 #include "xfs_fs.h"
 #include "xfs_shared.h"
+#include "xfs_bit.h"
 #include "xfs_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
@@ -262,6 +263,38 @@ xbitmap_disunion(
  * For the 300th record we just exit, with the list being [1, 4, 2, 3].
  */
 
+/* Mark a btree block to the agblock bitmap. */
+STATIC int
+xagb_bitmap_visit_btblock(
+	struct xfs_btree_cur	*cur,
+	int			level,
+	void			*priv)
+{
+	struct xagb_bitmap	*bitmap = priv;
+	struct xfs_buf		*bp;
+	xfs_fsblock_t		fsbno;
+	xfs_agblock_t		agbno;
+
+	xfs_btree_get_block(cur, level, &bp);
+	if (!bp)
+		return 0;
+
+	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp));
+	agbno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
+
+	return xagb_bitmap_set(bitmap, agbno, 1);
+}
+
+/* Mark all (per-AG) btree blocks in the agblock bitmap. */
+int
+xagb_bitmap_set_btblocks(
+	struct xagb_bitmap	*bitmap,
+	struct xfs_btree_cur	*cur)
+{
+	return xfs_btree_visit_blocks(cur, xagb_bitmap_visit_btblock,
+			XFS_BTREE_VISIT_ALL, bitmap);
+}
+
 /*
  * Record all the buffers pointed to by the btree cursor.  Callers already
  * engaged in a btree walk should call this function to capture the list of
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 65a6c5a92c7a..ab67073f4f01 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -106,4 +106,7 @@ static inline int xagb_bitmap_walk(struct xagb_bitmap *bitmap,
 	return xbitmap_walk(&bitmap->agbitmap, fn, priv);
 }
 
+int xagb_bitmap_set_btblocks(struct xagb_bitmap *bitmap,
+		struct xfs_btree_cur *cur);
+
 #endif	/* __XFS_SCRUB_BITMAP_H__ */
diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index 415d8e9918da..b8e82f5b84f4 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -7,13 +7,17 @@
 #include "xfs_fs.h"
 #include "xfs_shared.h"
 #include "xfs_format.h"
+#include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
+#include "xfs_trans.h"
 #include "xfs_btree.h"
 #include "xfs_rmap.h"
 #include "xfs_refcount.h"
 #include "xfs_ag.h"
 #include "xfs_bit.h"
+#include "xfs_alloc.h"
+#include "xfs_alloc_btree.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/btree.h"
@@ -51,6 +55,7 @@ struct xchk_rmap {
 	/* Bitmaps containing all blocks for each type of AG metadata. */
 	struct xagb_bitmap	fs_owned;
 	struct xagb_bitmap	log_owned;
+	struct xagb_bitmap	ag_owned;
 
 	/* Did we complete the AG space metadata bitmaps? */
 	bool			bitmaps_complete;
@@ -291,6 +296,9 @@ xchk_rmapbt_mark_bitmap(
 	case XFS_RMAP_OWN_LOG:
 		bmp = &cr->log_owned;
 		break;
+	case XFS_RMAP_OWN_AG:
+		bmp = &cr->ag_owned;
+		break;
 	}
 
 	if (!bmp)
@@ -343,9 +351,26 @@ xchk_rmapbt_rec(
 	return xchk_rmapbt_mark_bitmap(bs, cr, &irec);
 }
 
+/* Add an AGFL block to the rmap list. */
+STATIC int
+xchk_rmapbt_walk_agfl(
+	struct xfs_mount	*mp,
+	xfs_agblock_t		agbno,
+	void			*priv)
+{
+	struct xagb_bitmap	*bitmap = priv;
+
+	return xagb_bitmap_set(bitmap, agbno, 1);
+}
+
 /*
  * Set up bitmaps mapping all the AG metadata to compare with the rmapbt
  * records.
+ *
+ * Grab our own btree cursors here if the scrub setup function didn't give us a
+ * btree cursor due to reports of poor health.  We need to find out if the
+ * rmapbt disagrees with primary metadata btrees to tag the rmapbt as being
+ * XCORRUPT.
  */
 STATIC int
 xchk_rmapbt_walk_ag_metadata(
@@ -353,6 +378,9 @@ xchk_rmapbt_walk_ag_metadata(
 	struct xchk_rmap	*cr)
 {
 	struct xfs_mount	*mp = sc->mp;
+	struct xfs_buf		*agfl_bp;
+	struct xfs_agf		*agf = sc->sa.agf_bp->b_addr;
+	struct xfs_btree_cur	*cur;
 	int			error;
 
 	/* OWN_FS: AG headers */
@@ -370,6 +398,39 @@ xchk_rmapbt_walk_ag_metadata(
 			goto out;
 	}
 
+	/* OWN_AG: bnobt, cntbt, rmapbt, and AGFL */
+	cur = sc->sa.bno_cur;
+	if (!cur)
+		cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+				sc->sa.pag, XFS_BTNUM_BNO);
+	error = xagb_bitmap_set_btblocks(&cr->ag_owned, cur);
+	if (cur != sc->sa.bno_cur)
+		xfs_btree_del_cursor(cur, error);
+	if (error)
+		goto out;
+
+	cur = sc->sa.cnt_cur;
+	if (!cur)
+		cur = xfs_allocbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+				sc->sa.pag, XFS_BTNUM_CNT);
+	error = xagb_bitmap_set_btblocks(&cr->ag_owned, cur);
+	if (cur != sc->sa.cnt_cur)
+		xfs_btree_del_cursor(cur, error);
+	if (error)
+		goto out;
+
+	error = xagb_bitmap_set_btblocks(&cr->ag_owned, sc->sa.rmap_cur);
+	if (error)
+		goto out;
+
+	error = xfs_alloc_read_agfl(sc->sa.pag, sc->tp, &agfl_bp);
+	if (error)
+		goto out;
+
+	error = xfs_agfl_walk(sc->mp, agf, agfl_bp, xchk_rmapbt_walk_agfl,
+			&cr->ag_owned);
+	xfs_trans_brelse(sc->tp, agfl_bp);
+
 out:
 	/*
 	 * If there's an error, set XFAIL and disable the bitmap
@@ -411,6 +472,9 @@ xchk_rmapbt_check_bitmaps(
 
 	if (xagb_bitmap_hweight(&cr->log_owned) != 0)
 		xchk_btree_xref_set_corrupt(sc, cur, level);
+
+	if (xagb_bitmap_hweight(&cr->ag_owned) != 0)
+		xchk_btree_xref_set_corrupt(sc, cur, level);
 }
 
 /* Scrub the rmap btree for some AG. */
@@ -427,6 +491,7 @@ xchk_rmapbt(
 
 	xagb_bitmap_init(&cr->fs_owned);
 	xagb_bitmap_init(&cr->log_owned);
+	xagb_bitmap_init(&cr->ag_owned);
 
 	error = xchk_rmapbt_walk_ag_metadata(sc, cr);
 	if (error)
@@ -440,6 +505,7 @@ xchk_rmapbt(
 	xchk_rmapbt_check_bitmaps(sc, cr);
 
 out:
+	xagb_bitmap_destroy(&cr->ag_owned);
 	xagb_bitmap_destroy(&cr->log_owned);
 	xagb_bitmap_destroy(&cr->fs_owned);
 	kfree(cr);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 4/5] xfs: cross-reference rmap records with inode btrees
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: strengthen rmapbt scrubbing Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/5] xfs: introduce bitmap type for AG blocks Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 5/5] xfs: cross-reference rmap records with refcount btrees Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 3/5] xfs: cross-reference rmap records with free space btrees Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 2/5] xfs: cross-reference rmap records with ag btrees Darrick J. Wong
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Strengthen the rmap btree record checker a little more by comparing
OWN_INOBT reverse mappings against the inode btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/rmap.c |   36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)


diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index b8e82f5b84f4..f9a05a8c3936 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -18,6 +18,7 @@
 #include "xfs_bit.h"
 #include "xfs_alloc.h"
 #include "xfs_alloc_btree.h"
+#include "xfs_ialloc_btree.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/btree.h"
@@ -56,6 +57,7 @@ struct xchk_rmap {
 	struct xagb_bitmap	fs_owned;
 	struct xagb_bitmap	log_owned;
 	struct xagb_bitmap	ag_owned;
+	struct xagb_bitmap	inobt_owned;
 
 	/* Did we complete the AG space metadata bitmaps? */
 	bool			bitmaps_complete;
@@ -299,6 +301,9 @@ xchk_rmapbt_mark_bitmap(
 	case XFS_RMAP_OWN_AG:
 		bmp = &cr->ag_owned;
 		break;
+	case XFS_RMAP_OWN_INOBT:
+		bmp = &cr->inobt_owned;
+		break;
 	}
 
 	if (!bmp)
@@ -430,6 +435,32 @@ xchk_rmapbt_walk_ag_metadata(
 	error = xfs_agfl_walk(sc->mp, agf, agfl_bp, xchk_rmapbt_walk_agfl,
 			&cr->ag_owned);
 	xfs_trans_brelse(sc->tp, agfl_bp);
+	if (error)
+		goto out;
+
+	/* OWN_INOBT: inobt, finobt */
+	cur = sc->sa.ino_cur;
+	if (!cur)
+		cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp,
+				sc->sa.pag, XFS_BTNUM_INO);
+	error = xagb_bitmap_set_btblocks(&cr->inobt_owned, cur);
+	if (cur != sc->sa.ino_cur)
+		xfs_btree_del_cursor(cur, error);
+	if (error)
+		goto out;
+
+	if (xfs_has_finobt(sc->mp)) {
+		cur = sc->sa.fino_cur;
+		if (!cur)
+			cur = xfs_inobt_init_cursor(sc->mp, sc->tp,
+					sc->sa.agi_bp, sc->sa.pag,
+					XFS_BTNUM_FINO);
+		error = xagb_bitmap_set_btblocks(&cr->inobt_owned, cur);
+		if (cur != sc->sa.fino_cur)
+			xfs_btree_del_cursor(cur, error);
+		if (error)
+			goto out;
+	}
 
 out:
 	/*
@@ -475,6 +506,9 @@ xchk_rmapbt_check_bitmaps(
 
 	if (xagb_bitmap_hweight(&cr->ag_owned) != 0)
 		xchk_btree_xref_set_corrupt(sc, cur, level);
+
+	if (xagb_bitmap_hweight(&cr->inobt_owned) != 0)
+		xchk_btree_xref_set_corrupt(sc, cur, level);
 }
 
 /* Scrub the rmap btree for some AG. */
@@ -492,6 +526,7 @@ xchk_rmapbt(
 	xagb_bitmap_init(&cr->fs_owned);
 	xagb_bitmap_init(&cr->log_owned);
 	xagb_bitmap_init(&cr->ag_owned);
+	xagb_bitmap_init(&cr->inobt_owned);
 
 	error = xchk_rmapbt_walk_ag_metadata(sc, cr);
 	if (error)
@@ -505,6 +540,7 @@ xchk_rmapbt(
 	xchk_rmapbt_check_bitmaps(sc, cr);
 
 out:
+	xagb_bitmap_destroy(&cr->inobt_owned);
 	xagb_bitmap_destroy(&cr->ag_owned);
 	xagb_bitmap_destroy(&cr->log_owned);
 	xagb_bitmap_destroy(&cr->fs_owned);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 5/5] xfs: cross-reference rmap records with refcount btrees
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: strengthen rmapbt scrubbing Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 1/5] xfs: introduce bitmap type for AG blocks Darrick J. Wong
@ 2022-12-30 22:11   ` Darrick J. Wong
  2022-12-30 22:11   ` [PATCH 4/5] xfs: cross-reference rmap records with inode btrees Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:11 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Strengthen the rmap btree record checker a little more by comparing
OWN_REFCBT reverse mappings against the refcount btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/rmap.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)


diff --git a/fs/xfs/scrub/rmap.c b/fs/xfs/scrub/rmap.c
index f9a05a8c3936..8f1fdae71766 100644
--- a/fs/xfs/scrub/rmap.c
+++ b/fs/xfs/scrub/rmap.c
@@ -19,6 +19,7 @@
 #include "xfs_alloc.h"
 #include "xfs_alloc_btree.h"
 #include "xfs_ialloc_btree.h"
+#include "xfs_refcount_btree.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/btree.h"
@@ -58,6 +59,7 @@ struct xchk_rmap {
 	struct xagb_bitmap	log_owned;
 	struct xagb_bitmap	ag_owned;
 	struct xagb_bitmap	inobt_owned;
+	struct xagb_bitmap	refcbt_owned;
 
 	/* Did we complete the AG space metadata bitmaps? */
 	bool			bitmaps_complete;
@@ -304,6 +306,9 @@ xchk_rmapbt_mark_bitmap(
 	case XFS_RMAP_OWN_INOBT:
 		bmp = &cr->inobt_owned;
 		break;
+	case XFS_RMAP_OWN_REFC:
+		bmp = &cr->refcbt_owned;
+		break;
 	}
 
 	if (!bmp)
@@ -462,6 +467,19 @@ xchk_rmapbt_walk_ag_metadata(
 			goto out;
 	}
 
+	/* OWN_REFC: refcountbt */
+	if (xfs_has_reflink(sc->mp)) {
+		cur = sc->sa.refc_cur;
+		if (!cur)
+			cur = xfs_refcountbt_init_cursor(sc->mp, sc->tp,
+					sc->sa.agf_bp, sc->sa.pag);
+		error = xagb_bitmap_set_btblocks(&cr->refcbt_owned, cur);
+		if (cur != sc->sa.refc_cur)
+			xfs_btree_del_cursor(cur, error);
+		if (error)
+			goto out;
+	}
+
 out:
 	/*
 	 * If there's an error, set XFAIL and disable the bitmap
@@ -509,6 +527,9 @@ xchk_rmapbt_check_bitmaps(
 
 	if (xagb_bitmap_hweight(&cr->inobt_owned) != 0)
 		xchk_btree_xref_set_corrupt(sc, cur, level);
+
+	if (xagb_bitmap_hweight(&cr->refcbt_owned) != 0)
+		xchk_btree_xref_set_corrupt(sc, cur, level);
 }
 
 /* Scrub the rmap btree for some AG. */
@@ -527,6 +548,7 @@ xchk_rmapbt(
 	xagb_bitmap_init(&cr->log_owned);
 	xagb_bitmap_init(&cr->ag_owned);
 	xagb_bitmap_init(&cr->inobt_owned);
+	xagb_bitmap_init(&cr->refcbt_owned);
 
 	error = xchk_rmapbt_walk_ag_metadata(sc, cr);
 	if (error)
@@ -540,6 +562,7 @@ xchk_rmapbt(
 	xchk_rmapbt_check_bitmaps(sc, cr);
 
 out:
+	xagb_bitmap_destroy(&cr->refcbt_owned);
 	xagb_bitmap_destroy(&cr->inobt_owned);
 	xagb_bitmap_destroy(&cr->ag_owned);
 	xagb_bitmap_destroy(&cr->log_owned);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/4] xfs: fix rmap btree key flag handling
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (17 preceding siblings ...)
  2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: strengthen rmapbt scrubbing Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/4] xfs: fix rm_offset flag handling in rmap keys Darrick J. Wong
                     ` (3 more replies)
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
                   ` (3 subsequent siblings)
  22 siblings, 4 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

Hi all,

This series fixes numerous flag handling bugs in the rmapbt key code.
The most serious transgression is that key comparisons completely strip
out all flag bits from rm_offset, including the ones that participate in
record lookups.  The second problem is that for years we've been letting
the unwritten flag (which is an attribute of a specific record and not
part of the record key) escape from leaf records into key records.

The solution to the second problem is to filter attribute flags when
creating keys from records, and the solution to the first problem is to
preserve *only* the flags used for key lookups.  The ATTR and BMBT flags
are a part of the lookup key, and the UNWRITTEN flag is a record
attribute.

This has worked for years without generating user complaints because
ATTR and BMBT extents cannot be shared, so key comparisons succeed
solely on rm_startblock.  Only file data fork extents can be shared, and
those records never set any of the three flag bits, so comparisons that
dig into rm_owner and rm_offset work just fine.

A filesystem written with an unpatched kernel and mounted on a patched
kernel will work correctly because the ATTR/BMBT flags have been
conveyed into keys correctly all along, and we still ignore the
UNWRITTEN flag in any key record.  This was what doomed my previous
attempt to correct this problem in 2019.

A filesystem written with a patched kernel and mounted on an unpatched
kernel will also work correctly because unpatched kernels ignore all
flags.

With this patchset applied, the scrub code gains the ability to detect
rmap btrees with incorrectly set attr and bmbt flags in the key records.
After three years of testing, I haven't encountered any problems.
Online scrub is amended to recommend rebuilding of rmap btrees with the
unwritten flag set in key records.

The xfsprogs counterpart to this series amends xfs_repair to report key
records with the unwritten flag bit set, just prior to rebuilding the
rmapbt.  It also exposes the bit via xfs_db to enable testing back and
forth.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=rmap-btree-fix-key-handling

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=rmap-btree-fix-key-handling
---
 db/btblock.c            |    4 +++
 libxfs/xfs_rmap_btree.c |   40 ++++++++++++++++++++++++-------
 repair/scan.c           |   60 ++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 93 insertions(+), 11 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/4] xfs: fix rm_offset flag handling in rmap keys
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: fix rmap btree key flag handling Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/4] xfs_repair: check low keys of rmap btrees Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Keys for extent interval records in the reverse mapping btree are
supposed to be computed as follows:

(physical block, owner, fork, is_btree, offset)

This provides users the ability to look up a reverse mapping from a file
block mapping record -- start with the physical block; then if there are
multiple records for the same block, move on to the owner; then the
inode fork type; and so on to the file offset.

Unfortunately, the code that creates rmap lookup keys from rmap records
forgot to mask off the record attribute flags, leading to ondisk keys
that look like this:

(physical block, owner, fork, is_btree, unwritten state, offset)

Fortunately, this has all worked ok for the past six years because the
key comparison functions incorrectly ignore the fork/bmbt/unwritten
information that's encoded in the on-disk offset.  This means that
lookup comparisons are only done with:

(physical block, owner, offset)

Queries can (theoretically) return incorrect results because of this
omission.  On consistent filesystems this isn't an issue because xattr
and bmbt blocks cannot be shared and hence the comparisons succeed
purely on the contents of the rm_startblock field.  For the one case
where we support sharing (written data fork blocks) all flag bits are
zero, so the omission in the comparison has no ill effects.

Unfortunately, this bug prevents scrub from detecting incorrect fork and
bmbt flag bits in the rmap btree, so we really do need to fix the
compare code.  Old filesystems with the unwritten bit erroneously set in
the rmap key struct will work fine on new kernels since we still ignore
the unwritten bit.  New filesystems on older kernels will work fine
since the old kernels never paid attention to the unwritten bit.

A previous version of this patch forgot to keep the (un)written state
flag masked during the comparison and caused a major regression in
5.9.x since unwritten extent conversion can update an rmap record
without requiring key updates.

Note that blocks cannot go directly from data fork to attr fork without
being deallocated and reallocated, nor can they be added to or removed
from a bmbt without a free/alloc cycle, so this should not cause any
regressions.

Found by fuzzing keys[1].attrfork = ones on xfs/371.

Fixes: 4b8ed67794fe ("xfs: add rmap btree operations")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_rmap_btree.c |   40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)


diff --git a/libxfs/xfs_rmap_btree.c b/libxfs/xfs_rmap_btree.c
index bb64ab2e25c..f0368383775 100644
--- a/libxfs/xfs_rmap_btree.c
+++ b/libxfs/xfs_rmap_btree.c
@@ -154,6 +154,16 @@ xfs_rmapbt_get_maxrecs(
 	return cur->bc_mp->m_rmap_mxr[level != 0];
 }
 
+/*
+ * Convert the ondisk record's offset field into the ondisk key's offset field.
+ * Fork and bmbt are significant parts of the rmap record key, but written
+ * status is merely a record attribute.
+ */
+static inline __be64 ondisk_rec_offset_to_key(const union xfs_btree_rec *rec)
+{
+	return rec->rmap.rm_offset & ~cpu_to_be64(XFS_RMAP_OFF_UNWRITTEN);
+}
+
 STATIC void
 xfs_rmapbt_init_key_from_rec(
 	union xfs_btree_key		*key,
@@ -161,7 +171,7 @@ xfs_rmapbt_init_key_from_rec(
 {
 	key->rmap.rm_startblock = rec->rmap.rm_startblock;
 	key->rmap.rm_owner = rec->rmap.rm_owner;
-	key->rmap.rm_offset = rec->rmap.rm_offset;
+	key->rmap.rm_offset = ondisk_rec_offset_to_key(rec);
 }
 
 /*
@@ -184,7 +194,7 @@ xfs_rmapbt_init_high_key_from_rec(
 	key->rmap.rm_startblock = rec->rmap.rm_startblock;
 	be32_add_cpu(&key->rmap.rm_startblock, adj);
 	key->rmap.rm_owner = rec->rmap.rm_owner;
-	key->rmap.rm_offset = rec->rmap.rm_offset;
+	key->rmap.rm_offset = ondisk_rec_offset_to_key(rec);
 	if (XFS_RMAP_NON_INODE_OWNER(be64_to_cpu(rec->rmap.rm_owner)) ||
 	    XFS_RMAP_IS_BMBT_BLOCK(be64_to_cpu(rec->rmap.rm_offset)))
 		return;
@@ -217,6 +227,16 @@ xfs_rmapbt_init_ptr_from_cur(
 	ptr->s = agf->agf_roots[cur->bc_btnum];
 }
 
+/*
+ * Mask the appropriate parts of the ondisk key field for a key comparison.
+ * Fork and bmbt are significant parts of the rmap record key, but written
+ * status is merely a record attribute.
+ */
+static inline uint64_t offset_keymask(uint64_t offset)
+{
+	return offset & ~XFS_RMAP_OFF_UNWRITTEN;
+}
+
 STATIC int64_t
 xfs_rmapbt_key_diff(
 	struct xfs_btree_cur		*cur,
@@ -238,8 +258,8 @@ xfs_rmapbt_key_diff(
 	else if (y > x)
 		return -1;
 
-	x = XFS_RMAP_OFF(be64_to_cpu(kp->rm_offset));
-	y = rec->rm_offset;
+	x = offset_keymask(be64_to_cpu(kp->rm_offset));
+	y = offset_keymask(xfs_rmap_irec_offset_pack(rec));
 	if (x > y)
 		return 1;
 	else if (y > x)
@@ -270,8 +290,8 @@ xfs_rmapbt_diff_two_keys(
 	else if (y > x)
 		return -1;
 
-	x = XFS_RMAP_OFF(be64_to_cpu(kp1->rm_offset));
-	y = XFS_RMAP_OFF(be64_to_cpu(kp2->rm_offset));
+	x = offset_keymask(be64_to_cpu(kp1->rm_offset));
+	y = offset_keymask(be64_to_cpu(kp2->rm_offset));
 	if (x > y)
 		return 1;
 	else if (y > x)
@@ -385,8 +405,8 @@ xfs_rmapbt_keys_inorder(
 		return 1;
 	else if (a > b)
 		return 0;
-	a = XFS_RMAP_OFF(be64_to_cpu(k1->rmap.rm_offset));
-	b = XFS_RMAP_OFF(be64_to_cpu(k2->rmap.rm_offset));
+	a = offset_keymask(be64_to_cpu(k1->rmap.rm_offset));
+	b = offset_keymask(be64_to_cpu(k2->rmap.rm_offset));
 	if (a <= b)
 		return 1;
 	return 0;
@@ -415,8 +435,8 @@ xfs_rmapbt_recs_inorder(
 		return 1;
 	else if (a > b)
 		return 0;
-	a = XFS_RMAP_OFF(be64_to_cpu(r1->rmap.rm_offset));
-	b = XFS_RMAP_OFF(be64_to_cpu(r2->rmap.rm_offset));
+	a = offset_keymask(be64_to_cpu(r1->rmap.rm_offset));
+	b = offset_keymask(be64_to_cpu(r2->rmap.rm_offset));
 	if (a <= b)
 		return 1;
 	return 0;


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/4] xfs_repair: check low keys of rmap btrees
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: fix rmap btree key flag handling Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/4] xfs: fix rm_offset flag handling in rmap keys Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 4/4] xfs_db: expose the unwritten flag in rmapbt keys Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/4] xfs_repair: warn about unwritten bits set in rmap btree keys Darrick J. Wong
  3 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

For whatever reason, we only check the high keys in an rmapbt node
block.  We should be checking the low keys and the high keys, so fix
this gap.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 repair/scan.c |   32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)


diff --git a/repair/scan.c b/repair/scan.c
index 7b72013153d..d66ce60cbb3 100644
--- a/repair/scan.c
+++ b/repair/scan.c
@@ -992,6 +992,7 @@ scan_rmapbt(
 	uint64_t		lastowner = 0;
 	uint64_t		lastoffset = 0;
 	struct xfs_rmap_key	*kp;
+	struct xfs_rmap_irec	oldkey;
 	struct xfs_rmap_irec	key = {0};
 	struct xfs_perag	*pag;
 
@@ -1211,7 +1212,7 @@ _("%s rmap btree block claimed (state %d), agno %d, bno %d, suspect %d\n"),
 	}
 
 	/* check the node's high keys */
-	for (i = 0; !isroot && i < numrecs; i++) {
+	for (i = 0; i < numrecs; i++) {
 		kp = XFS_RMAP_HIGH_KEY_ADDR(block, i + 1);
 
 		key.rm_flags = 0;
@@ -1231,6 +1232,35 @@ _("%s rmap btree block claimed (state %d), agno %d, bno %d, suspect %d\n"),
 				i, agno, bno, name);
 	}
 
+	/* check for in-order keys */
+	for (i = 0; i < numrecs; i++)  {
+		kp = XFS_RMAP_KEY_ADDR(block, i + 1);
+
+		key.rm_flags = 0;
+		key.rm_startblock = be32_to_cpu(kp->rm_startblock);
+		key.rm_owner = be64_to_cpu(kp->rm_owner);
+		if (libxfs_rmap_irec_offset_unpack(be64_to_cpu(kp->rm_offset),
+				&key)) {
+			/* Look for impossible flags. */
+			do_warn(
+_("invalid flags in key %u of %s btree block %u/%u\n"),
+				i, name, agno, bno);
+			suspect++;
+			continue;
+		}
+		if (i == 0) {
+			oldkey = key;
+			continue;
+		}
+		if (rmap_diffkeys(&oldkey, &key) > 0) {
+			do_warn(
+_("out of order key %u in %s btree block (%u/%u)\n"),
+				i, name, agno, bno);
+			suspect++;
+		}
+		oldkey = key;
+	}
+
 	pag = libxfs_perag_get(mp, agno);
 	for (i = 0; i < numrecs; i++)  {
 		xfs_agblock_t		agbno = be32_to_cpu(pp[i]);


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 3/4] xfs_repair: warn about unwritten bits set in rmap btree keys
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: fix rmap btree key flag handling Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 4/4] xfs_db: expose the unwritten flag in rmapbt keys Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  3 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we've changed libxfs to handle the rmapbt flags correctly when
creating and comparing rmapbt keys, teach repair to warn about keys that
have the unwritten bit erroneously set.  The old broken behavior never
caused any problems, so we only warn once per filesystem and don't set
the exitcode to 1 if we're running in dry run mode.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 repair/scan.c |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)


diff --git a/repair/scan.c b/repair/scan.c
index d66ce60cbb3..008ef65ac75 100644
--- a/repair/scan.c
+++ b/repair/scan.c
@@ -966,6 +966,30 @@ verify_rmap_agbno(
 	return agbno < libxfs_ag_block_count(mp, agno);
 }
 
+static inline void
+warn_rmap_unwritten_key(
+	xfs_agblock_t		agno)
+{
+	static bool		warned = false;
+	static pthread_mutex_t	lock = PTHREAD_MUTEX_INITIALIZER;
+
+	if (warned)
+		return;
+
+	pthread_mutex_lock(&lock);
+	if (!warned) {
+		if (no_modify)
+			do_log(
+ _("would clear unwritten flag on rmapbt key in agno 0x%x\n"),
+			       agno);
+		else
+			do_warn(
+ _("clearing unwritten flag on rmapbt key in agno 0x%x\n"),
+			       agno);
+		warned = true;
+	}
+	pthread_mutex_unlock(&lock);
+}
 
 static void
 scan_rmapbt(
@@ -1218,6 +1242,8 @@ _("%s rmap btree block claimed (state %d), agno %d, bno %d, suspect %d\n"),
 		key.rm_flags = 0;
 		key.rm_startblock = be32_to_cpu(kp->rm_startblock);
 		key.rm_owner = be64_to_cpu(kp->rm_owner);
+		if (kp->rm_offset & cpu_to_be64(XFS_RMAP_OFF_UNWRITTEN))
+			warn_rmap_unwritten_key(agno);
 		if (libxfs_rmap_irec_offset_unpack(be64_to_cpu(kp->rm_offset),
 				&key)) {
 			/* Look for impossible flags. */
@@ -1239,6 +1265,8 @@ _("%s rmap btree block claimed (state %d), agno %d, bno %d, suspect %d\n"),
 		key.rm_flags = 0;
 		key.rm_startblock = be32_to_cpu(kp->rm_startblock);
 		key.rm_owner = be64_to_cpu(kp->rm_owner);
+		if (kp->rm_offset & cpu_to_be64(XFS_RMAP_OFF_UNWRITTEN))
+			warn_rmap_unwritten_key(agno);
 		if (libxfs_rmap_irec_offset_unpack(be64_to_cpu(kp->rm_offset),
 				&key)) {
 			/* Look for impossible flags. */


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 4/4] xfs_db: expose the unwritten flag in rmapbt keys
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: fix rmap btree key flag handling Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/4] xfs: fix rm_offset flag handling in rmap keys Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/4] xfs_repair: check low keys of rmap btrees Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/4] xfs_repair: warn about unwritten bits set in rmap btree keys Darrick J. Wong
  3 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Teach the debugger to expose the "unwritten" flag in rmapbt keys so that
we can simulate an old filesystem writing out bad keys for testing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 db/btblock.c |    4 ++++
 1 file changed, 4 insertions(+)


diff --git a/db/btblock.c b/db/btblock.c
index 30f7b5ef955..d5be6adb734 100644
--- a/db/btblock.c
+++ b/db/btblock.c
@@ -770,6 +770,8 @@ const field_t	rmapbt_key_flds[] = {
 	{ "startblock", FLDT_AGBLOCK, OI(KOFF(startblock)), C1, 0, TYP_DATA },
 	{ "owner", FLDT_INT64D, OI(KOFF(owner)), C1, 0, TYP_NONE },
 	{ "offset", FLDT_RFILEOFFD, OI(RMAPBK_OFFSET_BITOFF), C1, 0, TYP_NONE },
+	{ "extentflag", FLDT_REXTFLG, OI(RMAPBK_EXNTFLAG_BITOFF), C1, 0,
+	  TYP_NONE },
 	{ "attrfork", FLDT_RATTRFORKFLG, OI(RMAPBK_ATTRFLAG_BITOFF), C1, 0,
 	  TYP_NONE },
 	{ "bmbtblock", FLDT_RBMBTFLG, OI(RMAPBK_BMBTFLAG_BITOFF), C1, 0,
@@ -777,6 +779,8 @@ const field_t	rmapbt_key_flds[] = {
 	{ "startblock_hi", FLDT_AGBLOCK, OI(HI_KOFF(startblock)), C1, 0, TYP_DATA },
 	{ "owner_hi", FLDT_INT64D, OI(HI_KOFF(owner)), C1, 0, TYP_NONE },
 	{ "offset_hi", FLDT_RFILEOFFD, OI(RMAPBK_OFFSETHI_BITOFF), C1, 0, TYP_NONE },
+	{ "extentflag_hi", FLDT_REXTFLG, OI(RMAPBK_EXNTFLAGHI_BITOFF), C1, 0,
+	  TYP_NONE },
 	{ "attrfork_hi", FLDT_RATTRFORKFLG, OI(RMAPBK_ATTRFLAGHI_BITOFF), C1, 0,
 	  TYP_NONE },
 	{ "bmbtblock_hi", FLDT_RBMBTFLG, OI(RMAPBK_BMBTFLAGHI_BITOFF), C1, 0,


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (18 preceding siblings ...)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: fix rmap btree key flag handling Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 03/16] xfs/422: rework feature detection so we only test-format scratch once Darrick J. Wong
                     ` (15 more replies)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/3] fstests: refactor GETFSMAP stress tests Darrick J. Wong
                   ` (2 subsequent siblings)
  22 siblings, 16 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

This series prepares us to begin creating stress tests for the XFS
online fsck feature.  We start by hoisting the loop control code out of
the one existing test (xfs/422) into common/fuzzy, and then we commence
rearranging the code to make it easy to generate more and more tests.
Eventually we will race fsstress against online scrub and online repair
to make sure that xfs_scrub running on a correct filesystem cannot take
it down by accident.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress
---
 common/fuzzy        |  272 +++++++++++++++++++++++++++++++++++++++++++++++++++
 doc/group-names.txt |    1 
 tests/xfs/422       |  109 ++------------------
 tests/xfs/422.out   |    4 -
 4 files changed, 285 insertions(+), 101 deletions(-)


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 01/16] xfs/422: create a new test group for fsstress/repair racers
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 03/16] xfs/422: rework feature detection so we only test-format scratch once Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 07/16] fuzzy: give each test local control over what scrub stress tests get run Darrick J. Wong
                     ` (13 subsequent siblings)
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Create a new group for tests that race fsstress with online filesystem
repair, and add this to the dangerous_online_repair group too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 doc/group-names.txt |    1 +
 tests/xfs/422       |    2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)


diff --git a/doc/group-names.txt b/doc/group-names.txt
index 6cc9af7844..ac219e05b3 100644
--- a/doc/group-names.txt
+++ b/doc/group-names.txt
@@ -34,6 +34,7 @@ dangerous_bothrepair	fuzzers to evaluate xfs_scrub + xfs_repair repair
 dangerous_fuzzers	fuzzers that can crash your computer
 dangerous_norepair	fuzzers to evaluate kernel metadata verifiers
 dangerous_online_repair	fuzzers to evaluate xfs_scrub online repair
+dangerous_fsstress_repair	race fsstress and xfs_scrub online repair
 dangerous_repair	fuzzers to evaluate xfs_repair offline repair
 dangerous_scrub		fuzzers to evaluate xfs_scrub checking
 data			data loss checkers
diff --git a/tests/xfs/422 b/tests/xfs/422
index f3c63e8d6a..9ed944ed63 100755
--- a/tests/xfs/422
+++ b/tests/xfs/422
@@ -9,7 +9,7 @@
 # activity, so we can't have userspace wandering in and thawing it.
 #
 . ./common/preamble
-_begin_fstest dangerous_scrub dangerous_online_repair freeze
+_begin_fstest online_repair dangerous_fsstress_repair freeze
 
 _register_cleanup "_cleanup" BUS
 


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 02/16] xfs/422: move the fsstress/freeze/scrub racing logic to common/fuzzy
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
                     ` (5 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 06/16] fuzzy: explicitly check for common/inject in _require_xfs_stress_online_repair Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 12/16] fuzzy: increase operation count for each fsstress invocation Darrick J. Wong
                     ` (8 subsequent siblings)
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Hoist all this code to common/fuzzy in preparation for making this code
more generic so that we implement a variety of tests that check the
concurrency correctness of online fsck.  Do just enough renaming so that
we don't pollute the test program's namespace; we'll fix the other warts
in subsequent patches.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy      |  100 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/422     |  104 ++++-------------------------------------------------
 tests/xfs/422.out |    4 +-
 3 files changed, 109 insertions(+), 99 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index 70213af5db..979fa55515 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -316,3 +316,103 @@ _scratch_xfs_fuzz_metadata() {
 		done
 	done
 }
+
+# Functions to race fsstress, fs freeze, and xfs metadata scrubbing against
+# each other to shake out bugs in xfs online repair.
+
+# Filter freeze and thaw loop output so that we don't tarnish the golden output
+# if the kernel temporarily won't let us freeze.
+__stress_freeze_filter_output() {
+	grep -E -v '(Device or resource busy|Invalid argument)'
+}
+
+# Filter scrub output so that we don't tarnish the golden output if the fs is
+# too busy to scrub.  Note: Tests should _notrun if the scrub type is not
+# supported.
+__stress_scrub_filter_output() {
+	grep -E -v '(Device or resource busy|Invalid argument)'
+}
+
+# Run fs freeze and thaw in a tight loop.
+__stress_scrub_freeze_loop() {
+	local end="$1"
+
+	while [ "$(date +%s)" -lt $end ]; do
+		$XFS_IO_PROG -x -c 'freeze' -c 'thaw' $SCRATCH_MNT 2>&1 | \
+			__stress_freeze_filter_output
+	done
+}
+
+# Run xfs online fsck commands in a tight loop.
+__stress_scrub_loop() {
+	local end="$1"
+
+	while [ "$(date +%s)" -lt $end ]; do
+		$XFS_IO_PROG -x -c 'repair rmapbt 0' -c 'repair rmapbt 1' $SCRATCH_MNT 2>&1 | \
+			__stress_scrub_filter_output
+	done
+}
+
+# Run fsstress while we're testing online fsck.
+__stress_scrub_fsstress_loop() {
+	local end="$1"
+
+	local args=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000 $FSSTRESS_AVOID)
+
+	while [ "$(date +%s)" -lt $end ]; do
+		$FSSTRESS_PROG $args >> $seqres.full
+	done
+}
+
+# Make sure we have everything we need to run stress and scrub
+_require_xfs_stress_scrub() {
+	_require_xfs_io_command "scrub"
+	_require_command "$KILLALL_PROG" killall
+	_require_freeze
+}
+
+# Make sure we have everything we need to run stress and online repair
+_require_xfs_stress_online_repair() {
+	_require_xfs_stress_scrub
+	_require_xfs_io_command "repair"
+	_require_xfs_io_error_injection "force_repair"
+	_require_freeze
+}
+
+# Clean up after the loops in case they didn't do it themselves.
+_scratch_xfs_stress_scrub_cleanup() {
+	$KILLALL_PROG -TERM xfs_io fsstress >> $seqres.full 2>&1
+	$XFS_IO_PROG -x -c 'thaw' $SCRATCH_MNT >> $seqres.full 2>&1
+}
+
+# Start scrub, freeze, and fsstress in background looping processes, and wait
+# for 30*TIME_FACTOR seconds to see if the filesystem goes down.  Callers
+# must call _scratch_xfs_stress_scrub_cleanup from their cleanup functions.
+_scratch_xfs_stress_scrub() {
+	local start="$(date +%s)"
+	local end="$((start + (30 * TIME_FACTOR) ))"
+
+	echo "Loop started at $(date --date="@${start}")," \
+		   "ending at $(date --date="@${end}")" >> $seqres.full
+
+	__stress_scrub_fsstress_loop $end &
+	__stress_scrub_freeze_loop $end &
+	__stress_scrub_loop $end &
+
+	# Wait until 2 seconds after the loops should have finished, then
+	# clean up after ourselves.
+	while [ "$(date +%s)" -lt $((end + 2)) ]; do
+		sleep 1
+	done
+	_scratch_xfs_stress_scrub_cleanup
+
+	echo "Loop finished at $(date)" >> $seqres.full
+}
+
+# Start online repair, freeze, and fsstress in background looping processes,
+# and wait for 30*TIME_FACTOR seconds to see if the filesystem goes down.
+# Same requirements and arguments as _scratch_xfs_stress_scrub.
+_scratch_xfs_stress_online_repair() {
+	$XFS_IO_PROG -x -c 'inject force_repair' $SCRATCH_MNT
+	_scratch_xfs_stress_scrub "$@"
+}
diff --git a/tests/xfs/422 b/tests/xfs/422
index 9ed944ed63..0bf08572f3 100755
--- a/tests/xfs/422
+++ b/tests/xfs/422
@@ -4,40 +4,19 @@
 #
 # FS QA Test No. 422
 #
-# Race freeze and rmapbt repair for a while to see if we crash or livelock.
+# Race fsstress and rmapbt repair for a while to see if we crash or livelock.
 # rmapbt repair requires us to freeze the filesystem to stop all filesystem
 # activity, so we can't have userspace wandering in and thawing it.
 #
 . ./common/preamble
 _begin_fstest online_repair dangerous_fsstress_repair freeze
 
-_register_cleanup "_cleanup" BUS
-
-# First kill and wait the freeze loop so it won't try to freeze fs again
-# Then make sure fs is not frozen
-# Then kill and wait for the rest of the workers
-# Because if fs is frozen a killed writer will never exit
-kill_loops() {
-	local sig=$1
-
-	[ -n "$freeze_pid" ] && kill $sig $freeze_pid
-	wait $freeze_pid
-	unset freeze_pid
-	$XFS_IO_PROG -x -c 'thaw' $SCRATCH_MNT
-	[ -n "$stress_pid" ] && kill $sig $stress_pid
-	[ -n "$repair_pid" ] && kill $sig $repair_pid
-	wait
-	unset stress_pid
-	unset repair_pid
-}
-
-# Override the default cleanup function.
-_cleanup()
-{
-	kill_loops -9 > /dev/null 2>&1
+_cleanup() {
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
 	cd /
-	rm -rf $tmp.*
+	rm -r -f $tmp.*
 }
+_register_cleanup "_cleanup" BUS
 
 # Import common functions.
 . ./common/filter
@@ -47,80 +26,13 @@ _cleanup()
 # real QA test starts here
 _supported_fs xfs
 _require_xfs_scratch_rmapbt
-_require_xfs_io_command "scrub"
-_require_xfs_io_error_injection "force_repair"
-_require_command "$KILLALL_PROG" killall
-_require_freeze
+_require_xfs_stress_online_repair
 
-echo "Format and populate"
 _scratch_mkfs > "$seqres.full" 2>&1
 _scratch_mount
-
-STRESS_DIR="$SCRATCH_MNT/testdir"
-mkdir -p $STRESS_DIR
-
-for i in $(seq 0 9); do
-	mkdir -p $STRESS_DIR/$i
-	for j in $(seq 0 9); do
-		mkdir -p $STRESS_DIR/$i/$j
-		for k in $(seq 0 9); do
-			echo x > $STRESS_DIR/$i/$j/$k
-		done
-	done
-done
-
-cpus=$(( $($here/src/feature -o) * 4 * LOAD_FACTOR))
-
-echo "Concurrent repair"
-filter_output() {
-	grep -E -v '(Device or resource busy|Invalid argument)'
-}
-freeze_loop() {
-	end="$1"
-
-	while [ "$(date +%s)" -lt $end ]; do
-		$XFS_IO_PROG -x -c 'freeze' -c 'thaw' $SCRATCH_MNT 2>&1 | filter_output
-	done
-}
-repair_loop() {
-	end="$1"
-
-	while [ "$(date +%s)" -lt $end ]; do
-		$XFS_IO_PROG -x -c 'repair rmapbt 0' -c 'repair rmapbt 1' $SCRATCH_MNT 2>&1 | filter_output
-	done
-}
-stress_loop() {
-	end="$1"
-
-	FSSTRESS_ARGS=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000 $FSSTRESS_AVOID)
-	while [ "$(date +%s)" -lt $end ]; do
-		$FSSTRESS_PROG $FSSTRESS_ARGS >> $seqres.full
-	done
-}
-$XFS_IO_PROG -x -c 'inject force_repair' $SCRATCH_MNT
-
-start=$(date +%s)
-end=$((start + (30 * TIME_FACTOR) ))
-
-echo "Loop started at $(date --date="@${start}"), ending at $(date --date="@${end}")" >> $seqres.full
-stress_loop $end &
-stress_pid=$!
-freeze_loop $end &
-freeze_pid=$!
-repair_loop $end &
-repair_pid=$!
-
-# Wait until 2 seconds after the loops should have finished...
-while [ "$(date +%s)" -lt $((end + 2)) ]; do
-	sleep 1
-done
-
-# ...and clean up after the loops in case they didn't do it themselves.
-kill_loops >> $seqres.full 2>&1
-
-echo "Loop finished at $(date)" >> $seqres.full
-echo "Test done"
+_scratch_xfs_stress_online_repair
 
 # success, all done
+echo Silence is golden
 status=0
 exit
diff --git a/tests/xfs/422.out b/tests/xfs/422.out
index 3818c48fa8..f70693fde6 100644
--- a/tests/xfs/422.out
+++ b/tests/xfs/422.out
@@ -1,4 +1,2 @@
 QA output created by 422
-Format and populate
-Concurrent repair
-Test done
+Silence is golden


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 03/16] xfs/422: rework feature detection so we only test-format scratch once
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 01/16] xfs/422: create a new test group for fsstress/repair racers Darrick J. Wong
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Rework the feature detection in the one online fsck stress test so that
we only format the scratch device twice per test run.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/422 |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/tests/xfs/422 b/tests/xfs/422
index 0bf08572f3..b3353d2202 100755
--- a/tests/xfs/422
+++ b/tests/xfs/422
@@ -25,11 +25,12 @@ _register_cleanup "_cleanup" BUS
 
 # real QA test starts here
 _supported_fs xfs
-_require_xfs_scratch_rmapbt
+_require_scratch
 _require_xfs_stress_online_repair
 
 _scratch_mkfs > "$seqres.full" 2>&1
 _scratch_mount
+_require_xfs_has_feature "$SCRATCH_MNT" rmapbt
 _scratch_xfs_stress_online_repair
 
 # success, all done


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 04/16] fuzzy: clean up scrub stress programs quietly
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
                     ` (2 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 07/16] fuzzy: give each test local control over what scrub stress tests get run Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 05/16] fuzzy: rework scrub stress output filtering Darrick J. Wong
                     ` (11 subsequent siblings)
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

In the cleanup function for online fsck stress test common code, send
SIGINT instead of SIGTERM to the fsstress and xfs_io processes to kill
them.  bash prints 'Terminated' to the golden output when children die
with SIGTERM, which can make a test fail, and we don't want a regular
cleanup function being the thing that prevents the test from passing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


diff --git a/common/fuzzy b/common/fuzzy
index 979fa55515..e52831560d 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -381,7 +381,9 @@ _require_xfs_stress_online_repair() {
 
 # Clean up after the loops in case they didn't do it themselves.
 _scratch_xfs_stress_scrub_cleanup() {
-	$KILLALL_PROG -TERM xfs_io fsstress >> $seqres.full 2>&1
+	# Send SIGINT so that bash won't print a 'Terminated' message that
+	# distorts the golden output.
+	$KILLALL_PROG -INT xfs_io fsstress >> $seqres.full 2>&1
 	$XFS_IO_PROG -x -c 'thaw' $SCRATCH_MNT >> $seqres.full 2>&1
 }
 


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 05/16] fuzzy: rework scrub stress output filtering
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
                     ` (3 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 04/16] fuzzy: clean up scrub stress programs quietly Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 06/16] fuzzy: explicitly check for common/inject in _require_xfs_stress_online_repair Darrick J. Wong
                     ` (10 subsequent siblings)
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Rework the output filtering functions for scrub stress tests: first, we
should use _filter_scratch to avoid leaking the scratch fs details to
the output.  Second, for scrub and repair, change the filter elements to
reflect outputs that don't indicate failure (such as busy resources,
preening requests, and insufficient space to do anything).  Finally,
change the _require function to check that filter functions have been
sourced.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index e52831560d..94a6ce85a3 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -323,14 +323,19 @@ _scratch_xfs_fuzz_metadata() {
 # Filter freeze and thaw loop output so that we don't tarnish the golden output
 # if the kernel temporarily won't let us freeze.
 __stress_freeze_filter_output() {
-	grep -E -v '(Device or resource busy|Invalid argument)'
+	_filter_scratch | \
+		sed -e '/Device or resource busy/d' \
+		    -e '/Invalid argument/d'
 }
 
 # Filter scrub output so that we don't tarnish the golden output if the fs is
 # too busy to scrub.  Note: Tests should _notrun if the scrub type is not
 # supported.
 __stress_scrub_filter_output() {
-	grep -E -v '(Device or resource busy|Invalid argument)'
+	_filter_scratch | \
+		sed -e '/Device or resource busy/d' \
+		    -e '/Optimization possible/d' \
+		    -e '/No space left on device/d'
 }
 
 # Run fs freeze and thaw in a tight loop.
@@ -369,6 +374,8 @@ _require_xfs_stress_scrub() {
 	_require_xfs_io_command "scrub"
 	_require_command "$KILLALL_PROG" killall
 	_require_freeze
+	command -v _filter_scratch &>/dev/null || \
+		_notrun 'xfs scrub stress test requires common/filter'
 }
 
 # Make sure we have everything we need to run stress and online repair


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 06/16] fuzzy: explicitly check for common/inject in _require_xfs_stress_online_repair
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
                     ` (4 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 05/16] fuzzy: rework scrub stress output filtering Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 02/16] xfs/422: move the fsstress/freeze/scrub racing logic to common/fuzzy Darrick J. Wong
                     ` (9 subsequent siblings)
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

In _require_xfs_stress_online_repair, make sure that the test has
sourced common/inject before we try to call its functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |    2 ++
 1 file changed, 2 insertions(+)


diff --git a/common/fuzzy b/common/fuzzy
index 94a6ce85a3..de9e398984 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -382,6 +382,8 @@ _require_xfs_stress_scrub() {
 _require_xfs_stress_online_repair() {
 	_require_xfs_stress_scrub
 	_require_xfs_io_command "repair"
+	command -v _require_xfs_io_error_injection &>/dev/null || \
+		_notrun 'xfs repair stress test requires common/inject'
 	_require_xfs_io_error_injection "force_repair"
 	_require_freeze
 }


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 07/16] fuzzy: give each test local control over what scrub stress tests get run
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 03/16] xfs/422: rework feature detection so we only test-format scratch once Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 01/16] xfs/422: create a new test group for fsstress/repair racers Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 04/16] fuzzy: clean up scrub stress programs quietly Darrick J. Wong
                     ` (12 subsequent siblings)
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Now that we've hoisted the scrub stress code to common/fuzzy, introduce
argument parsing so that each test can specify what they want to test.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy  |   39 +++++++++++++++++++++++++++++++++++----
 tests/xfs/422 |    2 +-
 2 files changed, 36 insertions(+), 5 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index de9e398984..88ba5fef69 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -348,12 +348,19 @@ __stress_scrub_freeze_loop() {
 	done
 }
 
-# Run xfs online fsck commands in a tight loop.
-__stress_scrub_loop() {
+# Run individual XFS online fsck commands in a tight loop with xfs_io.
+__stress_one_scrub_loop() {
 	local end="$1"
+	local scrub_tgt="$2"
+	shift; shift
+
+	local xfs_io_args=()
+	for arg in "$@"; do
+		xfs_io_args+=('-c' "$arg")
+	done
 
 	while [ "$(date +%s)" -lt $end ]; do
-		$XFS_IO_PROG -x -c 'repair rmapbt 0' -c 'repair rmapbt 1' $SCRATCH_MNT 2>&1 | \
+		$XFS_IO_PROG -x "${xfs_io_args[@]}" "$scrub_tgt" 2>&1 | \
 			__stress_scrub_filter_output
 	done
 }
@@ -390,6 +397,8 @@ _require_xfs_stress_online_repair() {
 
 # Clean up after the loops in case they didn't do it themselves.
 _scratch_xfs_stress_scrub_cleanup() {
+	echo "Cleaning up scrub stress run at $(date)" >> $seqres.full
+
 	# Send SIGINT so that bash won't print a 'Terminated' message that
 	# distorts the golden output.
 	$KILLALL_PROG -INT xfs_io fsstress >> $seqres.full 2>&1
@@ -399,7 +408,25 @@ _scratch_xfs_stress_scrub_cleanup() {
 # Start scrub, freeze, and fsstress in background looping processes, and wait
 # for 30*TIME_FACTOR seconds to see if the filesystem goes down.  Callers
 # must call _scratch_xfs_stress_scrub_cleanup from their cleanup functions.
+#
+# Various options include:
+#
+# -s	Pass this command to xfs_io to test scrub.  If zero -s options are
+#	specified, xfs_io will not be run.
+# -t	Run online scrub against this file; $SCRATCH_MNT is the default.
 _scratch_xfs_stress_scrub() {
+	local one_scrub_args=()
+	local scrub_tgt="$SCRATCH_MNT"
+
+	OPTIND=1
+	while getopts "s:t:" c; do
+		case "$c" in
+			s) one_scrub_args+=("$OPTARG");;
+			t) scrub_tgt="$OPTARG";;
+			*) return 1; ;;
+		esac
+	done
+
 	local start="$(date +%s)"
 	local end="$((start + (30 * TIME_FACTOR) ))"
 
@@ -408,7 +435,11 @@ _scratch_xfs_stress_scrub() {
 
 	__stress_scrub_fsstress_loop $end &
 	__stress_scrub_freeze_loop $end &
-	__stress_scrub_loop $end &
+
+	if [ "${#one_scrub_args[@]}" -gt 0 ]; then
+		__stress_one_scrub_loop "$end" "$scrub_tgt" \
+				"${one_scrub_args[@]}" &
+	fi
 
 	# Wait until 2 seconds after the loops should have finished, then
 	# clean up after ourselves.
diff --git a/tests/xfs/422 b/tests/xfs/422
index b3353d2202..faea5d6792 100755
--- a/tests/xfs/422
+++ b/tests/xfs/422
@@ -31,7 +31,7 @@ _require_xfs_stress_online_repair
 _scratch_mkfs > "$seqres.full" 2>&1
 _scratch_mount
 _require_xfs_has_feature "$SCRATCH_MNT" rmapbt
-_scratch_xfs_stress_online_repair
+_scratch_xfs_stress_online_repair -s "repair rmapbt 0" -s "repair rmapbt 1"
 
 # success, all done
 echo Silence is golden


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 08/16] fuzzy: test the scrub stress subcommands before looping
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
                     ` (9 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 09/16] fuzzy: make scrub stress loop control more robust Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 13/16] fuzzy: clean up frozen fses after scrub stress testing Darrick J. Wong
                     ` (4 subsequent siblings)
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Before we commit to running fsstress and scrub commands in a loop for
some time, we should check that the provided commands actually work on
the scratch filesystem.  The _require_xfs_io_command predicate only
detects the presence of the scrub ioctl, not any particular subcommand.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)


diff --git a/common/fuzzy b/common/fuzzy
index 88ba5fef69..8d3e30e32b 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -405,6 +405,25 @@ _scratch_xfs_stress_scrub_cleanup() {
 	$XFS_IO_PROG -x -c 'thaw' $SCRATCH_MNT >> $seqres.full 2>&1
 }
 
+# Make sure the provided scrub/repair commands actually work on the scratch
+# filesystem before we start running them in a loop.
+__stress_scrub_check_commands() {
+	local scrub_tgt="$1"
+	shift
+
+	for arg in "$@"; do
+		testio=`$XFS_IO_PROG -x -c "$arg" $scrub_tgt 2>&1`
+		echo $testio | grep -q "Unknown type" && \
+			_notrun "xfs_io scrub subcommand support is missing"
+		echo $testio | grep -q "Inappropriate ioctl" && \
+			_notrun "kernel scrub ioctl is missing"
+		echo $testio | grep -q "No such file or directory" && \
+			_notrun "kernel does not know about: $arg"
+		echo $testio | grep -q "Operation not supported" && \
+			_notrun "kernel does not support: $arg"
+	done
+}
+
 # Start scrub, freeze, and fsstress in background looping processes, and wait
 # for 30*TIME_FACTOR seconds to see if the filesystem goes down.  Callers
 # must call _scratch_xfs_stress_scrub_cleanup from their cleanup functions.
@@ -427,6 +446,8 @@ _scratch_xfs_stress_scrub() {
 		esac
 	done
 
+	__stress_scrub_check_commands "$scrub_tgt" "${one_scrub_args[@]}"
+
 	local start="$(date +%s)"
 	local end="$((start + (30 * TIME_FACTOR) ))"
 


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 09/16] fuzzy: make scrub stress loop control more robust
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
                     ` (8 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 11/16] fuzzy: clear out the scratch filesystem if it's too full Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 08/16] fuzzy: test the scrub stress subcommands before looping Darrick J. Wong
                     ` (5 subsequent siblings)
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Currently, each of the scrub stress testing background threads
open-codes logic to decide if it should exit the loop.  This decision is
based entirely on TIME_FACTOR*30 seconds having gone by, which means
that we ignore external factors, such as the user pressing ^C, which (in
theory) will invoke cleanup functions to tear everything down.

This is not a great user experience, so refactor the loop exit test into
a helper function and establish a sentinel file that must be present to
continue looping.  If the user presses ^C, the cleanup function will
remove the sentinel file and kill the background thread children, which
should be enough to stop everything more or less immediately.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   39 ++++++++++++++++++++++++++++-----------
 1 file changed, 28 insertions(+), 11 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index 8d3e30e32b..6519d5c1e2 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -338,11 +338,18 @@ __stress_scrub_filter_output() {
 		    -e '/No space left on device/d'
 }
 
+# Decide if we want to keep running stress tests.  The first argument is the
+# stop time, and second argument is the path to the sentinel file.
+__stress_scrub_running() {
+	test -e "$2" && test "$(date +%s)" -lt "$1"
+}
+
 # Run fs freeze and thaw in a tight loop.
 __stress_scrub_freeze_loop() {
 	local end="$1"
+	local runningfile="$2"
 
-	while [ "$(date +%s)" -lt $end ]; do
+	while __stress_scrub_running "$end" "$runningfile"; do
 		$XFS_IO_PROG -x -c 'freeze' -c 'thaw' $SCRATCH_MNT 2>&1 | \
 			__stress_freeze_filter_output
 	done
@@ -351,15 +358,16 @@ __stress_scrub_freeze_loop() {
 # Run individual XFS online fsck commands in a tight loop with xfs_io.
 __stress_one_scrub_loop() {
 	local end="$1"
-	local scrub_tgt="$2"
-	shift; shift
+	local runningfile="$2"
+	local scrub_tgt="$3"
+	shift; shift; shift
 
 	local xfs_io_args=()
 	for arg in "$@"; do
 		xfs_io_args+=('-c' "$arg")
 	done
 
-	while [ "$(date +%s)" -lt $end ]; do
+	while __stress_scrub_running "$end" "$runningfile"; do
 		$XFS_IO_PROG -x "${xfs_io_args[@]}" "$scrub_tgt" 2>&1 | \
 			__stress_scrub_filter_output
 	done
@@ -368,12 +376,16 @@ __stress_one_scrub_loop() {
 # Run fsstress while we're testing online fsck.
 __stress_scrub_fsstress_loop() {
 	local end="$1"
+	local runningfile="$2"
 
 	local args=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000 $FSSTRESS_AVOID)
+	echo "Running $FSSTRESS_PROG $args" >> $seqres.full
 
-	while [ "$(date +%s)" -lt $end ]; do
+	while __stress_scrub_running "$end" "$runningfile"; do
 		$FSSTRESS_PROG $args >> $seqres.full
+		echo "fsstress exits with $? at $(date)" >> $seqres.full
 	done
+	rm -f "$runningfile"
 }
 
 # Make sure we have everything we need to run stress and scrub
@@ -397,6 +409,7 @@ _require_xfs_stress_online_repair() {
 
 # Clean up after the loops in case they didn't do it themselves.
 _scratch_xfs_stress_scrub_cleanup() {
+	rm -f "$runningfile"
 	echo "Cleaning up scrub stress run at $(date)" >> $seqres.full
 
 	# Send SIGINT so that bash won't print a 'Terminated' message that
@@ -436,6 +449,10 @@ __stress_scrub_check_commands() {
 _scratch_xfs_stress_scrub() {
 	local one_scrub_args=()
 	local scrub_tgt="$SCRATCH_MNT"
+	local runningfile="$tmp.fsstress"
+
+	rm -f "$runningfile"
+	touch "$runningfile"
 
 	OPTIND=1
 	while getopts "s:t:" c; do
@@ -454,17 +471,17 @@ _scratch_xfs_stress_scrub() {
 	echo "Loop started at $(date --date="@${start}")," \
 		   "ending at $(date --date="@${end}")" >> $seqres.full
 
-	__stress_scrub_fsstress_loop $end &
-	__stress_scrub_freeze_loop $end &
+	__stress_scrub_fsstress_loop "$end" "$runningfile" &
+	__stress_scrub_freeze_loop "$end" "$runningfile" &
 
 	if [ "${#one_scrub_args[@]}" -gt 0 ]; then
-		__stress_one_scrub_loop "$end" "$scrub_tgt" \
+		__stress_one_scrub_loop "$end" "$runningfile" "$scrub_tgt" \
 				"${one_scrub_args[@]}" &
 	fi
 
-	# Wait until 2 seconds after the loops should have finished, then
-	# clean up after ourselves.
-	while [ "$(date +%s)" -lt $((end + 2)) ]; do
+	# Wait until the designated end time or fsstress dies, then kill all of
+	# our background processes.
+	while __stress_scrub_running "$end" "$runningfile"; do
 		sleep 1
 	done
 	_scratch_xfs_stress_scrub_cleanup


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 10/16] fuzzy: abort scrub stress testing if the scratch fs went down
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
                     ` (11 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 13/16] fuzzy: clean up frozen fses after scrub stress testing Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 14/16] fuzzy: make freezing optional for scrub stress tests Darrick J. Wong
                     ` (2 subsequent siblings)
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

There's no point in continuing a stress test of online fsck if the
filesystem goes down.  We can't query that kind of state directly, so as
a proxy we try to stat the mountpoint and interpret any error return as
a sign that the fs is down.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)


diff --git a/common/fuzzy b/common/fuzzy
index 6519d5c1e2..f1bc2dc756 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -338,10 +338,17 @@ __stress_scrub_filter_output() {
 		    -e '/No space left on device/d'
 }
 
+# Decide if the scratch filesystem is still alive.
+__stress_scrub_scratch_alive() {
+	# If we can't stat the scratch filesystem, there's a reasonably good
+	# chance that the fs shut down, which is not good.
+	stat "$SCRATCH_MNT" &>/dev/null
+}
+
 # Decide if we want to keep running stress tests.  The first argument is the
 # stop time, and second argument is the path to the sentinel file.
 __stress_scrub_running() {
-	test -e "$2" && test "$(date +%s)" -lt "$1"
+	test -e "$2" && test "$(date +%s)" -lt "$1" && __stress_scrub_scratch_alive
 }
 
 # Run fs freeze and thaw in a tight loop.
@@ -486,6 +493,10 @@ _scratch_xfs_stress_scrub() {
 	done
 	_scratch_xfs_stress_scrub_cleanup
 
+	# Warn the user if we think the scratch filesystem went down.
+	__stress_scrub_scratch_alive || \
+		echo "Did the scratch filesystem die?"
+
 	echo "Loop finished at $(date)" >> $seqres.full
 }
 


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 11/16] fuzzy: clear out the scratch filesystem if it's too full
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
                     ` (7 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 12/16] fuzzy: increase operation count for each fsstress invocation Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 09/16] fuzzy: make scrub stress loop control more robust Darrick J. Wong
                     ` (6 subsequent siblings)
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

If the online fsck stress tests run for long enough, they'll fill up the
scratch filesystem completely.  While it is interesting to test repair
functionality on a *nearly* full filesystem undergoing a heavy workload,
a totally full filesystem is really only exercising the ENOSPC handlers
in the kernel.  That's not what we came here to test, so change the
fsstress loop to detect a nearly full filesystem and erase everything
before starting fsstress again.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)


diff --git a/common/fuzzy b/common/fuzzy
index f1bc2dc756..01cf7f00d8 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -380,6 +380,20 @@ __stress_one_scrub_loop() {
 	done
 }
 
+# Clean the scratch filesystem between rounds of fsstress if there is 2%
+# available space or less because that isn't an interesting stress test.
+#
+# Returns 0 if we cleared anything, and 1 if we did nothing.
+__stress_scrub_clean_scratch() {
+	local used_pct="$(_used $SCRATCH_DEV)"
+
+	test "$used_pct" -lt 98 && return 1
+
+	echo "Clearing scratch fs at $(date)" >> $seqres.full
+	rm -r -f $SCRATCH_MNT/p*
+	return 0
+}
+
 # Run fsstress while we're testing online fsck.
 __stress_scrub_fsstress_loop() {
 	local end="$1"
@@ -389,6 +403,8 @@ __stress_scrub_fsstress_loop() {
 	echo "Running $FSSTRESS_PROG $args" >> $seqres.full
 
 	while __stress_scrub_running "$end" "$runningfile"; do
+		# Need to recheck running conditions if we cleared anything
+		__stress_scrub_clean_scratch && continue
 		$FSSTRESS_PROG $args >> $seqres.full
 		echo "fsstress exits with $? at $(date)" >> $seqres.full
 	done


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 12/16] fuzzy: increase operation count for each fsstress invocation
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
                     ` (6 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 02/16] xfs/422: move the fsstress/freeze/scrub racing logic to common/fuzzy Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2023-01-13 19:55     ` Zorro Lang
  2022-12-30 22:12   ` [PATCH 11/16] fuzzy: clear out the scratch filesystem if it's too full Darrick J. Wong
                     ` (7 subsequent siblings)
  15 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

For online fsck stress testing, increase the number of filesystem
operations per fsstress run to 2 million, now that we have the ability
to kill fsstress if the user should push ^C to abort the test early.
This should guarantee a couple of hours of continuous stress testing in
between clearing the scratch filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


diff --git a/common/fuzzy b/common/fuzzy
index 01cf7f00d8..3e23edc9e4 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -399,7 +399,9 @@ __stress_scrub_fsstress_loop() {
 	local end="$1"
 	local runningfile="$2"
 
-	local args=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000 $FSSTRESS_AVOID)
+	# As of March 2022, 2 million fsstress ops should be enough to keep
+	# any filesystem busy for a couple of hours.
+	local args=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000000 $FSSTRESS_AVOID)
 	echo "Running $FSSTRESS_PROG $args" >> $seqres.full
 
 	while __stress_scrub_running "$end" "$runningfile"; do


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 13/16] fuzzy: clean up frozen fses after scrub stress testing
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
                     ` (10 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 08/16] fuzzy: test the scrub stress subcommands before looping Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 10/16] fuzzy: abort scrub stress testing if the scratch fs went down Darrick J. Wong
                     ` (3 subsequent siblings)
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Some of our scrub stress tests involve racing scrub, fsstress, and a
program that repeatedly freeze and thaws the scratch filesystem.  The
current cleanup code suffers from the deficiency that it doesn't
actually wait for the child processes to exit.  First, change it to do
that.

However, that exposes a second problem: there's a race condition with a
freezer process that leads to the stress test exiting with a frozen fs.
If the freezer process is blocked trying to acquire the unmount or
sb_write locks, the receipt of a signal (even a fatal one) doesn't cause
it to abort the freeze.  This causes further problems with fstests,
since ./check doesn't expect to regain control with the scratch fs
frozen.

Fix both problems by making the cleanup function smarter.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)


diff --git a/common/fuzzy b/common/fuzzy
index 3e23edc9e4..0f6fc91b80 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -439,8 +439,39 @@ _scratch_xfs_stress_scrub_cleanup() {
 
 	# Send SIGINT so that bash won't print a 'Terminated' message that
 	# distorts the golden output.
+	echo "Killing stressor processes at $(date)" >> $seqres.full
 	$KILLALL_PROG -INT xfs_io fsstress >> $seqres.full 2>&1
-	$XFS_IO_PROG -x -c 'thaw' $SCRATCH_MNT >> $seqres.full 2>&1
+
+	# Tests are not allowed to exit with the scratch fs frozen.  If we
+	# started a fs freeze/thaw background loop, wait for that loop to exit
+	# and then thaw the filesystem.  Cleanup for the freeze loop must be
+	# performed prior to waiting for the other children to avoid triggering
+	# a race condition that can hang fstests.
+	#
+	# If the xfs_io -c freeze process is asleep waiting for a write lock on
+	# s_umount or sb_write when the killall signal is delivered, it will
+	# not check for pending signals until after it has frozen the fs.  If
+	# even one thread of the stress test processes (xfs_io, fsstress, etc.)
+	# is waiting for read locks on sb_write when the killall signals are
+	# delivered, they will block in the kernel until someone thaws the fs,
+	# and the `wait' below will wait forever.
+	#
+	# Hence we issue the killall, wait for the freezer loop to exit, thaw
+	# the filesystem, and wait for the rest of the children.
+	if [ -n "$__SCRUB_STRESS_FREEZE_PID" ]; then
+		echo "Waiting for fs freezer $__SCRUB_STRESS_FREEZE_PID to exit at $(date)" >> $seqres.full
+		wait "$__SCRUB_STRESS_FREEZE_PID"
+
+		echo "Thawing filesystem at $(date)" >> $seqres.full
+		$XFS_IO_PROG -x -c 'thaw' $SCRATCH_MNT >> $seqres.full 2>&1
+		__SCRUB_STRESS_FREEZE_PID=""
+	fi
+
+	# Wait for the remaining children to exit.
+	echo "Waiting for children to exit at $(date)" >> $seqres.full
+	wait
+
+	echo "Cleanup finished at $(date)" >> $seqres.full
 }
 
 # Make sure the provided scrub/repair commands actually work on the scratch
@@ -476,6 +507,7 @@ _scratch_xfs_stress_scrub() {
 	local scrub_tgt="$SCRATCH_MNT"
 	local runningfile="$tmp.fsstress"
 
+	__SCRUB_STRESS_FREEZE_PID=""
 	rm -f "$runningfile"
 	touch "$runningfile"
 
@@ -498,6 +530,7 @@ _scratch_xfs_stress_scrub() {
 
 	__stress_scrub_fsstress_loop "$end" "$runningfile" &
 	__stress_scrub_freeze_loop "$end" "$runningfile" &
+	__SCRUB_STRESS_FREEZE_PID="$!"
 
 	if [ "${#one_scrub_args[@]}" -gt 0 ]; then
 		__stress_one_scrub_loop "$end" "$runningfile" "$scrub_tgt" \


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 14/16] fuzzy: make freezing optional for scrub stress tests
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
                     ` (12 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 10/16] fuzzy: abort scrub stress testing if the scratch fs went down Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 15/16] fuzzy: allow substitution of AG numbers when configuring scrub stress test Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 16/16] fuzzy: delay the start of the scrub loop when stress-testing scrub Darrick J. Wong
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Make the freeze/thaw loop optional, since that's a significant change in
behavior if it's enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy  |   13 ++++++++++---
 tests/xfs/422 |    2 +-
 2 files changed, 11 insertions(+), 4 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index 0f6fc91b80..219dd3bb0a 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -499,6 +499,8 @@ __stress_scrub_check_commands() {
 #
 # Various options include:
 #
+# -f	Run a freeze/thaw loop while we're doing other things.  Defaults to
+#	disabled, unless XFS_SCRUB_STRESS_FREEZE is set.
 # -s	Pass this command to xfs_io to test scrub.  If zero -s options are
 #	specified, xfs_io will not be run.
 # -t	Run online scrub against this file; $SCRATCH_MNT is the default.
@@ -506,14 +508,16 @@ _scratch_xfs_stress_scrub() {
 	local one_scrub_args=()
 	local scrub_tgt="$SCRATCH_MNT"
 	local runningfile="$tmp.fsstress"
+	local freeze="${XFS_SCRUB_STRESS_FREEZE}"
 
 	__SCRUB_STRESS_FREEZE_PID=""
 	rm -f "$runningfile"
 	touch "$runningfile"
 
 	OPTIND=1
-	while getopts "s:t:" c; do
+	while getopts "fs:t:" c; do
 		case "$c" in
+			f) freeze=yes;;
 			s) one_scrub_args+=("$OPTARG");;
 			t) scrub_tgt="$OPTARG";;
 			*) return 1; ;;
@@ -529,8 +533,11 @@ _scratch_xfs_stress_scrub() {
 		   "ending at $(date --date="@${end}")" >> $seqres.full
 
 	__stress_scrub_fsstress_loop "$end" "$runningfile" &
-	__stress_scrub_freeze_loop "$end" "$runningfile" &
-	__SCRUB_STRESS_FREEZE_PID="$!"
+
+	if [ -n "$freeze" ]; then
+		__stress_scrub_freeze_loop "$end" "$runningfile" &
+		__SCRUB_STRESS_FREEZE_PID="$!"
+	fi
 
 	if [ "${#one_scrub_args[@]}" -gt 0 ]; then
 		__stress_one_scrub_loop "$end" "$runningfile" "$scrub_tgt" \
diff --git a/tests/xfs/422 b/tests/xfs/422
index faea5d6792..ac88713257 100755
--- a/tests/xfs/422
+++ b/tests/xfs/422
@@ -31,7 +31,7 @@ _require_xfs_stress_online_repair
 _scratch_mkfs > "$seqres.full" 2>&1
 _scratch_mount
 _require_xfs_has_feature "$SCRATCH_MNT" rmapbt
-_scratch_xfs_stress_online_repair -s "repair rmapbt 0" -s "repair rmapbt 1"
+_scratch_xfs_stress_online_repair -f -s "repair rmapbt 0" -s "repair rmapbt 1"
 
 # success, all done
 echo Silence is golden


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 15/16] fuzzy: allow substitution of AG numbers when configuring scrub stress test
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
                     ` (13 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 14/16] fuzzy: make freezing optional for scrub stress tests Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 16/16] fuzzy: delay the start of the scrub loop when stress-testing scrub Darrick J. Wong
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Allow the test program to use the metavariable '%agno%' when passing
scrub commands to the scrub stress loop.  This makes it easier for tests
to scrub or repair every AG in the filesystem without a lot of work.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy  |   14 ++++++++++++--
 tests/xfs/422 |    2 +-
 2 files changed, 13 insertions(+), 3 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index 219dd3bb0a..e42e2ccec1 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -368,10 +368,19 @@ __stress_one_scrub_loop() {
 	local runningfile="$2"
 	local scrub_tgt="$3"
 	shift; shift; shift
+	local agcount="$(_xfs_mount_agcount $SCRATCH_MNT)"
 
 	local xfs_io_args=()
 	for arg in "$@"; do
-		xfs_io_args+=('-c' "$arg")
+		if echo "$arg" | grep -q -w '%agno%'; then
+			# Substitute the AG number
+			for ((agno = 0; agno < agcount; agno++)); do
+				local ag_arg="$(echo "$arg" | sed -e "s|%agno%|$agno|g")"
+				xfs_io_args+=('-c' "$ag_arg")
+			done
+		else
+			xfs_io_args+=('-c' "$arg")
+		fi
 	done
 
 	while __stress_scrub_running "$end" "$runningfile"; do
@@ -481,7 +490,8 @@ __stress_scrub_check_commands() {
 	shift
 
 	for arg in "$@"; do
-		testio=`$XFS_IO_PROG -x -c "$arg" $scrub_tgt 2>&1`
+		local cooked_arg="$(echo "$arg" | sed -e "s/%agno%/0/g")"
+		testio=`$XFS_IO_PROG -x -c "$cooked_arg" $scrub_tgt 2>&1`
 		echo $testio | grep -q "Unknown type" && \
 			_notrun "xfs_io scrub subcommand support is missing"
 		echo $testio | grep -q "Inappropriate ioctl" && \
diff --git a/tests/xfs/422 b/tests/xfs/422
index ac88713257..995f612166 100755
--- a/tests/xfs/422
+++ b/tests/xfs/422
@@ -31,7 +31,7 @@ _require_xfs_stress_online_repair
 _scratch_mkfs > "$seqres.full" 2>&1
 _scratch_mount
 _require_xfs_has_feature "$SCRATCH_MNT" rmapbt
-_scratch_xfs_stress_online_repair -f -s "repair rmapbt 0" -s "repair rmapbt 1"
+_scratch_xfs_stress_online_repair -f -s "repair rmapbt %agno%"
 
 # success, all done
 echo Silence is golden


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 16/16] fuzzy: delay the start of the scrub loop when stress-testing scrub
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
                     ` (14 preceding siblings ...)
  2022-12-30 22:12   ` [PATCH 15/16] fuzzy: allow substitution of AG numbers when configuring scrub stress test Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

By default, online fsck stress testing kicks off the loops for fsstress
and online fsck at the same time.  However, in certain debugging
scenarios it can help if we let fsstress get a head-start in filling up
the filesystem.  Plumb in a means to delay the start of the scrub loop.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index e42e2ccec1..1df51a6dd8 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -367,7 +367,8 @@ __stress_one_scrub_loop() {
 	local end="$1"
 	local runningfile="$2"
 	local scrub_tgt="$3"
-	shift; shift; shift
+	local scrub_startat="$4"
+	shift; shift; shift; shift
 	local agcount="$(_xfs_mount_agcount $SCRATCH_MNT)"
 
 	local xfs_io_args=()
@@ -383,6 +384,10 @@ __stress_one_scrub_loop() {
 		fi
 	done
 
+	while __stress_scrub_running "$scrub_startat" "$runningfile"; do
+		sleep 1
+	done
+
 	while __stress_scrub_running "$end" "$runningfile"; do
 		$XFS_IO_PROG -x "${xfs_io_args[@]}" "$scrub_tgt" 2>&1 | \
 			__stress_scrub_filter_output
@@ -514,22 +519,27 @@ __stress_scrub_check_commands() {
 # -s	Pass this command to xfs_io to test scrub.  If zero -s options are
 #	specified, xfs_io will not be run.
 # -t	Run online scrub against this file; $SCRATCH_MNT is the default.
+# -w	Delay the start of the scrub/repair loop by this number of seconds.
+#	Defaults to no delay unless XFS_SCRUB_STRESS_DELAY is set.  This value
+#	will be clamped to ten seconds before the end time.
 _scratch_xfs_stress_scrub() {
 	local one_scrub_args=()
 	local scrub_tgt="$SCRATCH_MNT"
 	local runningfile="$tmp.fsstress"
 	local freeze="${XFS_SCRUB_STRESS_FREEZE}"
+	local scrub_delay="${XFS_SCRUB_STRESS_DELAY:--1}"
 
 	__SCRUB_STRESS_FREEZE_PID=""
 	rm -f "$runningfile"
 	touch "$runningfile"
 
 	OPTIND=1
-	while getopts "fs:t:" c; do
+	while getopts "fs:t:w:" c; do
 		case "$c" in
 			f) freeze=yes;;
 			s) one_scrub_args+=("$OPTARG");;
 			t) scrub_tgt="$OPTARG";;
+			w) scrub_delay="$OPTARG";;
 			*) return 1; ;;
 		esac
 	done
@@ -538,6 +548,9 @@ _scratch_xfs_stress_scrub() {
 
 	local start="$(date +%s)"
 	local end="$((start + (30 * TIME_FACTOR) ))"
+	local scrub_startat="$((start + scrub_delay))"
+	test "$scrub_startat" -gt "$((end - 10))" &&
+		scrub_startat="$((end - 10))"
 
 	echo "Loop started at $(date --date="@${start}")," \
 		   "ending at $(date --date="@${end}")" >> $seqres.full
@@ -551,7 +564,7 @@ _scratch_xfs_stress_scrub() {
 
 	if [ "${#one_scrub_args[@]}" -gt 0 ]; then
 		__stress_one_scrub_loop "$end" "$runningfile" "$scrub_tgt" \
-				"${one_scrub_args[@]}" &
+				"$scrub_startat" "${one_scrub_args[@]}" &
 	fi
 
 	# Wait until the designated end time or fsstress dies, then kill all of


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/3] fstests: refactor GETFSMAP stress tests
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (19 preceding siblings ...)
  2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
@ 2022-12-30 22:12 ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/3] fuzzy: enhance scrub stress testing to use fsx Darrick J. Wong
                     ` (2 more replies)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/2] fstests: race online scrub with mount state changes Darrick J. Wong
  2023-01-13 20:10 ` [NYE DELUGE 1/4] xfs: all pending online scrub improvements Zorro Lang
  22 siblings, 3 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

Refactor the fsmap racing tests to use the general scrub stress loop
infrastructure that we've now created, and then add a bit more
functionality so that we can test racing remounting the filesystem
readonly and readwrite.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-fsmap-stress
---
 common/fuzzy      |  161 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 ltp/fsstress.c    |   18 ++++++
 tests/xfs/517     |   91 +-----------------------------
 tests/xfs/517.out |    4 -
 tests/xfs/732     |   38 +++++++++++++
 tests/xfs/732.out |    2 +
 tests/xfs/847     |   38 +++++++++++++
 tests/xfs/847.out |    2 +
 tests/xfs/848     |   38 +++++++++++++
 tests/xfs/848.out |    2 +
 10 files changed, 300 insertions(+), 94 deletions(-)
 create mode 100755 tests/xfs/732
 create mode 100644 tests/xfs/732.out
 create mode 100755 tests/xfs/847
 create mode 100644 tests/xfs/847.out
 create mode 100755 tests/xfs/848
 create mode 100644 tests/xfs/848.out


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/3] fuzzy: enhance scrub stress testing to use fsx
  2022-12-30 22:12 ` [PATCHSET v24.0 0/3] fstests: refactor GETFSMAP stress tests Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2023-01-05  5:49     ` Zorro Lang
  2023-01-05 18:28     ` [PATCH v24.1 " Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/3] xfs: race fsmap with readonly remounts to detect crash or livelock Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/3] fuzzy: refactor fsmap stress test to use our helper functions Darrick J. Wong
  2 siblings, 2 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Add a couple of new online fsck stress tests that race fsx against
online fsck.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy      |   39 ++++++++++++++++++++++++++++++++++++---
 tests/xfs/847     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/847.out |    2 ++
 tests/xfs/848     |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/848.out |    2 ++
 5 files changed, 116 insertions(+), 3 deletions(-)
 create mode 100755 tests/xfs/847
 create mode 100644 tests/xfs/847.out
 create mode 100755 tests/xfs/848
 create mode 100644 tests/xfs/848.out


diff --git a/common/fuzzy b/common/fuzzy
index 1df51a6dd8..3512e95e02 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -408,6 +408,30 @@ __stress_scrub_clean_scratch() {
 	return 0
 }
 
+# Run fsx while we're testing online fsck.
+__stress_scrub_fsx_loop() {
+	local end="$1"
+	local runningfile="$2"
+	local focus=(-q -X)	# quiet, validate file contents
+
+	# As of November 2022, 2 million fsx ops should be enough to keep
+	# any filesystem busy for a couple of hours.
+	focus+=(-N 2000000)
+	focus+=(-o $((128000 * LOAD_FACTOR)) )
+	focus+=(-l $((600000 * LOAD_FACTOR)) )
+
+	local args="$FSX_AVOID ${focus[@]} ${SCRATCH_MNT}/fsx.$seq"
+	echo "Running $here/ltp/fsx $args" >> $seqres.full
+
+	while __stress_scrub_running "$end" "$runningfile"; do
+		# Need to recheck running conditions if we cleared anything
+		__stress_scrub_clean_scratch && continue
+		$here/ltp/fsx $args >> $seqres.full
+		echo "fsx exits with $? at $(date)" >> $seqres.full
+	done
+	rm -f "$runningfile"
+}
+
 # Run fsstress while we're testing online fsck.
 __stress_scrub_fsstress_loop() {
 	local end="$1"
@@ -454,7 +478,7 @@ _scratch_xfs_stress_scrub_cleanup() {
 	# Send SIGINT so that bash won't print a 'Terminated' message that
 	# distorts the golden output.
 	echo "Killing stressor processes at $(date)" >> $seqres.full
-	$KILLALL_PROG -INT xfs_io fsstress >> $seqres.full 2>&1
+	$KILLALL_PROG -INT xfs_io fsstress fsx >> $seqres.full 2>&1
 
 	# Tests are not allowed to exit with the scratch fs frozen.  If we
 	# started a fs freeze/thaw background loop, wait for that loop to exit
@@ -522,30 +546,39 @@ __stress_scrub_check_commands() {
 # -w	Delay the start of the scrub/repair loop by this number of seconds.
 #	Defaults to no delay unless XFS_SCRUB_STRESS_DELAY is set.  This value
 #	will be clamped to ten seconds before the end time.
+# -X	Run this program to exercise the filesystem.  Currently supported
+#       options are 'fsx' and 'fsstress'.  The default is 'fsstress'.
 _scratch_xfs_stress_scrub() {
 	local one_scrub_args=()
 	local scrub_tgt="$SCRATCH_MNT"
 	local runningfile="$tmp.fsstress"
 	local freeze="${XFS_SCRUB_STRESS_FREEZE}"
 	local scrub_delay="${XFS_SCRUB_STRESS_DELAY:--1}"
+	local exerciser="fsstress"
 
 	__SCRUB_STRESS_FREEZE_PID=""
 	rm -f "$runningfile"
 	touch "$runningfile"
 
 	OPTIND=1
-	while getopts "fs:t:w:" c; do
+	while getopts "fs:t:w:X:" c; do
 		case "$c" in
 			f) freeze=yes;;
 			s) one_scrub_args+=("$OPTARG");;
 			t) scrub_tgt="$OPTARG";;
 			w) scrub_delay="$OPTARG";;
+			X) exerciser="$OPTARG";;
 			*) return 1; ;;
 		esac
 	done
 
 	__stress_scrub_check_commands "$scrub_tgt" "${one_scrub_args[@]}"
 
+	if ! command -v "__stress_scrub_${exerciser}_loop" &>/dev/null; then
+		echo "${exerciser}: Unknown fs exercise program."
+		return 1
+	fi
+
 	local start="$(date +%s)"
 	local end="$((start + (30 * TIME_FACTOR) ))"
 	local scrub_startat="$((start + scrub_delay))"
@@ -555,7 +588,7 @@ _scratch_xfs_stress_scrub() {
 	echo "Loop started at $(date --date="@${start}")," \
 		   "ending at $(date --date="@${end}")" >> $seqres.full
 
-	__stress_scrub_fsstress_loop "$end" "$runningfile" &
+	"__stress_scrub_${exerciser}_loop" "$end" "$runningfile" &
 
 	if [ -n "$freeze" ]; then
 		__stress_scrub_freeze_loop "$end" "$runningfile" &
diff --git a/tests/xfs/847 b/tests/xfs/847
new file mode 100755
index 0000000000..856e9a6c26
--- /dev/null
+++ b/tests/xfs/847
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle, Inc.  All Rights Reserved.
+#
+# FS QA Test No. 847
+#
+# Race fsx and xfs_scrub in read-only mode for a while to see if we crash
+# or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	cd /
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -S '-n' -X 'fsx'
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/847.out b/tests/xfs/847.out
new file mode 100644
index 0000000000..b7041db159
--- /dev/null
+++ b/tests/xfs/847.out
@@ -0,0 +1,2 @@
+QA output created by 847
+Silence is golden
diff --git a/tests/xfs/848 b/tests/xfs/848
new file mode 100755
index 0000000000..ab32020624
--- /dev/null
+++ b/tests/xfs/848
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle, Inc.  All Rights Reserved.
+#
+# FS QA Test No. 848
+#
+# Race fsx and xfs_scrub in force-repair mode for a while to see if we
+# crash or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	cd /
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -S '-k' -X 'fsx'
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/848.out b/tests/xfs/848.out
new file mode 100644
index 0000000000..23f674045c
--- /dev/null
+++ b/tests/xfs/848.out
@@ -0,0 +1,2 @@
+QA output created by 848
+Silence is golden


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/3] fuzzy: refactor fsmap stress test to use our helper functions
  2022-12-30 22:12 ` [PATCHSET v24.0 0/3] fstests: refactor GETFSMAP stress tests Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/3] fuzzy: enhance scrub stress testing to use fsx Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 3/3] xfs: race fsmap with readonly remounts to detect crash or livelock Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Refactor xfs/517 (which races fsstress with fsmap) to use our new
control loop functions instead of open-coding everything.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy      |   30 +++++++++++++++++
 tests/xfs/517     |   91 ++---------------------------------------------------
 tests/xfs/517.out |    4 +-
 3 files changed, 34 insertions(+), 91 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index 3512e95e02..58e299d34b 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -362,6 +362,23 @@ __stress_scrub_freeze_loop() {
 	done
 }
 
+# Run individual xfs_io commands in a tight loop.
+__stress_xfs_io_loop() {
+	local end="$1"
+	local runningfile="$2"
+	shift; shift
+
+	local xfs_io_args=()
+	for arg in "$@"; do
+		xfs_io_args+=('-c' "$arg")
+	done
+
+	while __stress_scrub_running "$end" "$runningfile"; do
+		$XFS_IO_PROG -x "${xfs_io_args[@]}" "$SCRATCH_MNT" \
+				> /dev/null 2>> $seqres.full
+	done
+}
+
 # Run individual XFS online fsck commands in a tight loop with xfs_io.
 __stress_one_scrub_loop() {
 	local end="$1"
@@ -540,6 +557,10 @@ __stress_scrub_check_commands() {
 #
 # -f	Run a freeze/thaw loop while we're doing other things.  Defaults to
 #	disabled, unless XFS_SCRUB_STRESS_FREEZE is set.
+# -i	Pass this command to xfs_io to exercise something that is not scrub
+#	in a separate loop.  If zero -i options are specified, do not run.
+#	Callers must check each of these commands (via _require_xfs_io_command)
+#	before calling here.
 # -s	Pass this command to xfs_io to test scrub.  If zero -s options are
 #	specified, xfs_io will not be run.
 # -t	Run online scrub against this file; $SCRATCH_MNT is the default.
@@ -555,15 +576,17 @@ _scratch_xfs_stress_scrub() {
 	local freeze="${XFS_SCRUB_STRESS_FREEZE}"
 	local scrub_delay="${XFS_SCRUB_STRESS_DELAY:--1}"
 	local exerciser="fsstress"
+	local io_args=()
 
 	__SCRUB_STRESS_FREEZE_PID=""
 	rm -f "$runningfile"
 	touch "$runningfile"
 
 	OPTIND=1
-	while getopts "fs:t:w:X:" c; do
+	while getopts "fi:s:t:w:X:" c; do
 		case "$c" in
 			f) freeze=yes;;
+			i) io_args+=("$OPTARG");;
 			s) one_scrub_args+=("$OPTARG");;
 			t) scrub_tgt="$OPTARG";;
 			w) scrub_delay="$OPTARG";;
@@ -595,6 +618,11 @@ _scratch_xfs_stress_scrub() {
 		__SCRUB_STRESS_FREEZE_PID="$!"
 	fi
 
+	if [ "${#io_args[@]}" -gt 0 ]; then
+		__stress_xfs_io_loop "$end" "$runningfile" \
+				"${io_args[@]}" &
+	fi
+
 	if [ "${#one_scrub_args[@]}" -gt 0 ]; then
 		__stress_one_scrub_loop "$end" "$runningfile" "$scrub_tgt" \
 				"$scrub_startat" "${one_scrub_args[@]}" &
diff --git a/tests/xfs/517 b/tests/xfs/517
index 99fc89b05f..4481ba41da 100755
--- a/tests/xfs/517
+++ b/tests/xfs/517
@@ -11,29 +11,11 @@ _begin_fstest auto quick fsmap freeze
 
 _register_cleanup "_cleanup" BUS
 
-# First kill and wait the freeze loop so it won't try to freeze fs again
-# Then make sure fs is not frozen
-# Then kill and wait for the rest of the workers
-# Because if fs is frozen a killed writer will never exit
-kill_loops() {
-	local sig=$1
-
-	[ -n "$freeze_pid" ] && kill $sig $freeze_pid
-	wait $freeze_pid
-	unset freeze_pid
-	$XFS_IO_PROG -x -c 'thaw' $SCRATCH_MNT
-	[ -n "$stress_pid" ] && kill $sig $stress_pid
-	[ -n "$fsmap_pid" ] && kill $sig $fsmap_pid
-	wait
-	unset stress_pid
-	unset fsmap_pid
-}
-
 # Override the default cleanup function.
 _cleanup()
 {
-	kill_loops -9 > /dev/null 2>&1
 	cd /
+	_scratch_xfs_stress_scrub_cleanup
 	rm -rf $tmp.*
 }
 
@@ -46,78 +28,13 @@ _cleanup()
 _supported_fs xfs
 _require_xfs_scratch_rmapbt
 _require_xfs_io_command "fsmap"
-_require_command "$KILLALL_PROG" killall
-_require_freeze
+_require_xfs_stress_scrub
 
-echo "Format and populate"
 _scratch_mkfs > "$seqres.full" 2>&1
 _scratch_mount
-
-STRESS_DIR="$SCRATCH_MNT/testdir"
-mkdir -p $STRESS_DIR
-
-for i in $(seq 0 9); do
-	mkdir -p $STRESS_DIR/$i
-	for j in $(seq 0 9); do
-		mkdir -p $STRESS_DIR/$i/$j
-		for k in $(seq 0 9); do
-			echo x > $STRESS_DIR/$i/$j/$k
-		done
-	done
-done
-
-cpus=$(( $(src/feature -o) * 4 * LOAD_FACTOR))
-
-echo "Concurrent fsmap and freeze"
-filter_output() {
-	grep -E -v '(Device or resource busy|Invalid argument)'
-}
-freeze_loop() {
-	end="$1"
-
-	while [ "$(date +%s)" -lt $end ]; do
-		$XFS_IO_PROG -x -c 'freeze' $SCRATCH_MNT 2>&1 | filter_output
-		$XFS_IO_PROG -x -c 'thaw' $SCRATCH_MNT 2>&1 | filter_output
-	done
-}
-fsmap_loop() {
-	end="$1"
-
-	while [ "$(date +%s)" -lt $end ]; do
-		$XFS_IO_PROG -c 'fsmap -v' $SCRATCH_MNT > /dev/null
-	done
-}
-stress_loop() {
-	end="$1"
-
-	FSSTRESS_ARGS=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000 $FSSTRESS_AVOID)
-	while [ "$(date +%s)" -lt $end ]; do
-		$FSSTRESS_PROG $FSSTRESS_ARGS >> $seqres.full
-	done
-}
-
-start=$(date +%s)
-end=$((start + (30 * TIME_FACTOR) ))
-
-echo "Loop started at $(date --date="@${start}"), ending at $(date --date="@${end}")" >> $seqres.full
-stress_loop $end &
-stress_pid=$!
-freeze_loop $end &
-freeze_pid=$!
-fsmap_loop $end &
-fsmap_pid=$!
-
-# Wait until 2 seconds after the loops should have finished...
-while [ "$(date +%s)" -lt $((end + 2)) ]; do
-	sleep 1
-done
-
-# ...and clean up after the loops in case they didn't do it themselves.
-kill_loops >> $seqres.full 2>&1
-
-echo "Loop finished at $(date)" >> $seqres.full
-echo "Test done"
+_scratch_xfs_stress_scrub -i 'fsmap -v'
 
 # success, all done
+echo "Silence is golden"
 status=0
 exit
diff --git a/tests/xfs/517.out b/tests/xfs/517.out
index da6366e52b..49c53bcaa9 100644
--- a/tests/xfs/517.out
+++ b/tests/xfs/517.out
@@ -1,4 +1,2 @@
 QA output created by 517
-Format and populate
-Concurrent fsmap and freeze
-Test done
+Silence is golden


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 3/3] xfs: race fsmap with readonly remounts to detect crash or livelock
  2022-12-30 22:12 ` [PATCHSET v24.0 0/3] fstests: refactor GETFSMAP stress tests Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 1/3] fuzzy: enhance scrub stress testing to use fsx Darrick J. Wong
@ 2022-12-30 22:12   ` Darrick J. Wong
  2022-12-30 22:12   ` [PATCH 2/3] fuzzy: refactor fsmap stress test to use our helper functions Darrick J. Wong
  2 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:12 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Add a new test that races the GETFSMAP ioctl with ro/rw remounting to
make sure we don't livelock on the empty transaction that fsmap uses to
avoid deadlocking on rmap btree cycles.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy      |   98 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 ltp/fsstress.c    |   18 +++++++++-
 tests/xfs/732     |   38 +++++++++++++++++++++
 tests/xfs/732.out |    2 +
 4 files changed, 153 insertions(+), 3 deletions(-)
 create mode 100755 tests/xfs/732
 create mode 100644 tests/xfs/732.out


diff --git a/common/fuzzy b/common/fuzzy
index 58e299d34b..ee97aa4298 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -429,6 +429,7 @@ __stress_scrub_clean_scratch() {
 __stress_scrub_fsx_loop() {
 	local end="$1"
 	local runningfile="$2"
+	local remount_period="$3"
 	local focus=(-q -X)	# quiet, validate file contents
 
 	# As of November 2022, 2 million fsx ops should be enough to keep
@@ -440,6 +441,43 @@ __stress_scrub_fsx_loop() {
 	local args="$FSX_AVOID ${focus[@]} ${SCRATCH_MNT}/fsx.$seq"
 	echo "Running $here/ltp/fsx $args" >> $seqres.full
 
+	if [ -n "$remount_period" ]; then
+		local mode="rw"
+		local rw_arg=""
+		while __stress_scrub_running "$end" "$runningfile"; do
+			# Need to recheck running conditions if we cleared
+			# anything.
+			test "$mode" = "rw" && __stress_scrub_clean_scratch && continue
+
+			timeout -s TERM "$remount_period" $here/ltp/fsx \
+					$args $rw_arg >> $seqres.full
+			res=$?
+			echo "$mode fsx exits with $res at $(date)" >> $seqres.full
+			if [ "$res" -ne 0 ] && [ "$res" -ne 124 ]; then
+				# Stop if fsstress returns error.  Mask off
+				# the magic code 124 because that is how the
+				# timeout(1) program communicates that we ran
+				# out of time.
+				break;
+			fi
+			if [ "$mode" = "rw" ]; then
+				mode="ro"
+				rw_arg="-t 0 -w 0 -FHzCIJBE0"
+			else
+				mode="rw"
+				rw_arg=""
+			fi
+
+			# Try remounting until we get the result we wanted
+			while ! _scratch_remount "$mode" &>/dev/null && \
+			      __stress_scrub_running "$end" "$runningfile"; do
+				sleep 0.2
+			done
+		done
+		rm -f "$runningfile"
+		return 0
+	fi
+
 	while __stress_scrub_running "$end" "$runningfile"; do
 		# Need to recheck running conditions if we cleared anything
 		__stress_scrub_clean_scratch && continue
@@ -453,12 +491,50 @@ __stress_scrub_fsx_loop() {
 __stress_scrub_fsstress_loop() {
 	local end="$1"
 	local runningfile="$2"
+	local remount_period="$3"
 
 	# As of March 2022, 2 million fsstress ops should be enough to keep
 	# any filesystem busy for a couple of hours.
 	local args=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000000 $FSSTRESS_AVOID)
 	echo "Running $FSSTRESS_PROG $args" >> $seqres.full
 
+	if [ -n "$remount_period" ]; then
+		local mode="rw"
+		local rw_arg=""
+		while __stress_scrub_running "$end" "$runningfile"; do
+			# Need to recheck running conditions if we cleared
+			# anything.
+			test "$mode" = "rw" && __stress_scrub_clean_scratch && continue
+
+			timeout -s TERM "$remount_period" $FSSTRESS_PROG \
+					$args $rw_arg >> $seqres.full
+			res=$?
+			echo "$mode fsstress exits with $res at $(date)" >> $seqres.full
+			if [ "$res" -ne 0 ] && [ "$res" -ne 124 ]; then
+				# Stop if fsstress returns error.  Mask off
+				# the magic code 124 because that is how the
+				# timeout(1) program communicates that we ran
+				# out of time.
+				break;
+			fi
+			if [ "$mode" = "rw" ]; then
+				mode="ro"
+				rw_arg="-R"
+			else
+				mode="rw"
+				rw_arg=""
+			fi
+
+			# Try remounting until we get the result we wanted
+			while ! _scratch_remount "$mode" &>/dev/null && \
+			      __stress_scrub_running "$end" "$runningfile"; do
+				sleep 0.2
+			done
+		done
+		rm -f "$runningfile"
+		return 0
+	fi
+
 	while __stress_scrub_running "$end" "$runningfile"; do
 		# Need to recheck running conditions if we cleared anything
 		__stress_scrub_clean_scratch && continue
@@ -526,6 +602,13 @@ _scratch_xfs_stress_scrub_cleanup() {
 	echo "Waiting for children to exit at $(date)" >> $seqres.full
 	wait
 
+	# Ensure the scratch fs is also writable before we exit.
+	if [ -n "$__SCRUB_STRESS_REMOUNT_LOOP" ]; then
+		echo "Remounting rw at $(date)" >> $seqres.full
+		_scratch_remount rw >> $seqres.full 2>&1
+		__SCRUB_STRESS_REMOUNT_LOOP=""
+	fi
+
 	echo "Cleanup finished at $(date)" >> $seqres.full
 }
 
@@ -561,6 +644,9 @@ __stress_scrub_check_commands() {
 #	in a separate loop.  If zero -i options are specified, do not run.
 #	Callers must check each of these commands (via _require_xfs_io_command)
 #	before calling here.
+# -r	Run fsstress for this amount of time, then remount the fs ro or rw.
+#	The default is to run fsstress continuously with no remount, unless
+#	XFS_SCRUB_STRESS_REMOUNT_PERIOD is set.
 # -s	Pass this command to xfs_io to test scrub.  If zero -s options are
 #	specified, xfs_io will not be run.
 # -t	Run online scrub against this file; $SCRATCH_MNT is the default.
@@ -577,16 +663,19 @@ _scratch_xfs_stress_scrub() {
 	local scrub_delay="${XFS_SCRUB_STRESS_DELAY:--1}"
 	local exerciser="fsstress"
 	local io_args=()
+	local remount_period="${XFS_SCRUB_STRESS_REMOUNT_PERIOD}"
 
 	__SCRUB_STRESS_FREEZE_PID=""
+	__SCRUB_STRESS_REMOUNT_LOOP=""
 	rm -f "$runningfile"
 	touch "$runningfile"
 
 	OPTIND=1
-	while getopts "fi:s:t:w:X:" c; do
+	while getopts "fi:r:s:t:w:X:" c; do
 		case "$c" in
 			f) freeze=yes;;
 			i) io_args+=("$OPTARG");;
+			r) remount_period="$OPTARG";;
 			s) one_scrub_args+=("$OPTARG");;
 			t) scrub_tgt="$OPTARG";;
 			w) scrub_delay="$OPTARG";;
@@ -611,7 +700,12 @@ _scratch_xfs_stress_scrub() {
 	echo "Loop started at $(date --date="@${start}")," \
 		   "ending at $(date --date="@${end}")" >> $seqres.full
 
-	"__stress_scrub_${exerciser}_loop" "$end" "$runningfile" &
+	if [ -n "$remount_period" ]; then
+		__SCRUB_STRESS_REMOUNT_LOOP="1"
+	fi
+
+	"__stress_scrub_${exerciser}_loop" "$end" "$runningfile" \
+			"$remount_period" &
 
 	if [ -n "$freeze" ]; then
 		__stress_scrub_freeze_loop "$end" "$runningfile" &
diff --git a/ltp/fsstress.c b/ltp/fsstress.c
index b395bc4da2..10608fb554 100644
--- a/ltp/fsstress.c
+++ b/ltp/fsstress.c
@@ -426,6 +426,7 @@ int	symlink_path(const char *, pathname_t *);
 int	truncate64_path(pathname_t *, off64_t);
 int	unlink_path(pathname_t *);
 void	usage(void);
+void	read_freq(void);
 void	write_freq(void);
 void	zero_freq(void);
 void	non_btrfs_freq(const char *);
@@ -472,7 +473,7 @@ int main(int argc, char **argv)
 	xfs_error_injection_t	        err_inj;
 	struct sigaction action;
 	int		loops = 1;
-	const char	*allopts = "cd:e:f:i:l:m:M:n:o:p:rs:S:vVwx:X:zH";
+	const char	*allopts = "cd:e:f:i:l:m:M:n:o:p:rRs:S:vVwx:X:zH";
 
 	errrange = errtag = 0;
 	umask(0);
@@ -538,6 +539,9 @@ int main(int argc, char **argv)
 		case 'r':
 			namerand = 1;
 			break;
+		case 'R':
+			read_freq();
+			break;
 		case 's':
 			seed = strtoul(optarg, NULL, 0);
 			break;
@@ -1917,6 +1921,7 @@ usage(void)
 	printf("   -o logfile       specifies logfile name\n");
 	printf("   -p nproc         specifies the no. of processes (default 1)\n");
 	printf("   -r               specifies random name padding\n");
+	printf("   -R               zeros frequencies of write operations\n");
 	printf("   -s seed          specifies the seed for the random generator (default random)\n");
 	printf("   -v               specifies verbose mode\n");
 	printf("   -w               zeros frequencies of non-write operations\n");
@@ -1928,6 +1933,17 @@ usage(void)
 	printf("   -H               prints usage and exits\n");
 }
 
+void
+read_freq(void)
+{
+	opdesc_t	*p;
+
+	for (p = ops; p < ops_end; p++) {
+		if (p->iswrite)
+			p->freq = 0;
+	}
+}
+
 void
 write_freq(void)
 {
diff --git a/tests/xfs/732 b/tests/xfs/732
new file mode 100755
index 0000000000..ed6fb3c977
--- /dev/null
+++ b/tests/xfs/732
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 732
+#
+# Race GETFSMAP and ro remount for a while to see if we crash or livelock.
+#
+. ./common/preamble
+_begin_fstest auto quick fsmap remount
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	_scratch_xfs_stress_scrub_cleanup
+	rm -rf $tmp.*
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_xfs_scratch_rmapbt
+_require_xfs_io_command "fsmap"
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -r 5 -i 'fsmap -v'
+
+# success, all done
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/xfs/732.out b/tests/xfs/732.out
new file mode 100644
index 0000000000..451f82ce2d
--- /dev/null
+++ b/tests/xfs/732.out
@@ -0,0 +1,2 @@
+QA output created by 732
+Silence is golden


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCHSET v24.0 0/2] fstests: race online scrub with mount state changes
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (20 preceding siblings ...)
  2022-12-30 22:12 ` [PATCHSET v24.0 0/3] fstests: refactor GETFSMAP stress tests Darrick J. Wong
@ 2022-12-30 22:13 ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/2] xfs: stress test xfs_scrub(8) with fsstress Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/2] xfs: stress test xfs_scrub(8) with freeze and ro-remount loops Darrick J. Wong
  2023-01-13 20:10 ` [NYE DELUGE 1/4] xfs: all pending online scrub improvements Zorro Lang
  22 siblings, 2 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

Hi all,

Introduce the ability to run xfs_scrub(8) itself from our online fsck
stress test harness.  Create two new tests to race scrub and repair
against fsstress, and four more tests to do the same but racing against
fs freeze and ro remounts.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes
---
 common/fuzzy      |   63 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 tests/xfs/285     |   44 ++++++++++---------------------------
 tests/xfs/285.out |    4 +--
 tests/xfs/286     |   46 ++++++++++-----------------------------
 tests/xfs/286.out |    4 +--
 tests/xfs/733     |   39 +++++++++++++++++++++++++++++++++
 tests/xfs/733.out |    2 ++
 tests/xfs/771     |   39 +++++++++++++++++++++++++++++++++
 tests/xfs/771.out |    2 ++
 tests/xfs/824     |   40 ++++++++++++++++++++++++++++++++++
 tests/xfs/824.out |    2 ++
 tests/xfs/825     |   40 ++++++++++++++++++++++++++++++++++
 tests/xfs/825.out |    2 ++
 13 files changed, 252 insertions(+), 75 deletions(-)
 create mode 100755 tests/xfs/733
 create mode 100644 tests/xfs/733.out
 create mode 100755 tests/xfs/771
 create mode 100644 tests/xfs/771.out
 create mode 100755 tests/xfs/824
 create mode 100644 tests/xfs/824.out
 create mode 100755 tests/xfs/825
 create mode 100644 tests/xfs/825.out


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 1/2] xfs: stress test xfs_scrub(8) with fsstress
  2022-12-30 22:13 ` [PATCHSET v24.0 0/2] fstests: race online scrub with mount state changes Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 2/2] xfs: stress test xfs_scrub(8) with freeze and ro-remount loops Darrick J. Wong
  1 sibling, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Port the two existing tests that check that xfs_scrub(8) (aka the main
userspace driver program) doesn't clash with fsstress to use our new
framework.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 common/fuzzy      |   63 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 tests/xfs/285     |   44 ++++++++++---------------------------
 tests/xfs/285.out |    4 +--
 tests/xfs/286     |   46 ++++++++++-----------------------------
 tests/xfs/286.out |    4 +--
 5 files changed, 86 insertions(+), 75 deletions(-)


diff --git a/common/fuzzy b/common/fuzzy
index ee97aa4298..e39f787e78 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -411,6 +411,42 @@ __stress_one_scrub_loop() {
 	done
 }
 
+# Run xfs_scrub online fsck in a tight loop.
+__stress_xfs_scrub_loop() {
+	local end="$1"
+	local runningfile="$2"
+	local scrub_startat="$3"
+	shift; shift; shift
+	local sigint_ret="$(( $(kill -l SIGINT) + 128 ))"
+	local scrublog="$tmp.scrub"
+
+	while __stress_scrub_running "$scrub_startat" "$runningfile"; do
+		sleep 1
+	done
+
+	while __stress_scrub_running "$end" "$runningfile"; do
+		_scratch_scrub "$@" &> $scrublog
+		res=$?
+		if [ "$res" -eq "$sigint_ret" ]; then
+			# Ignore SIGINT because the cleanup function sends
+			# that to terminate xfs_scrub
+			res=0
+		fi
+		echo "xfs_scrub exits with $res at $(date)" >> $seqres.full
+		if [ "$res" -ge 128 ]; then
+			# Report scrub death due to fatal signals
+			echo "xfs_scrub died with SIG$(kill -l $res)"
+			cat $scrublog >> $seqres.full 2>/dev/null
+		elif [ "$((res & 0x1))" -gt 0 ]; then
+			# Report uncorrected filesystem errors
+			echo "xfs_scrub reports uncorrected errors:"
+			grep -E '(Repair unsuccessful;|Corruption:)' $scrublog
+			cat $scrublog >> $seqres.full 2>/dev/null
+		fi
+		rm -f $scrublog
+	done
+}
+
 # Clean the scratch filesystem between rounds of fsstress if there is 2%
 # available space or less because that isn't an interesting stress test.
 #
@@ -571,7 +607,7 @@ _scratch_xfs_stress_scrub_cleanup() {
 	# Send SIGINT so that bash won't print a 'Terminated' message that
 	# distorts the golden output.
 	echo "Killing stressor processes at $(date)" >> $seqres.full
-	$KILLALL_PROG -INT xfs_io fsstress fsx >> $seqres.full 2>&1
+	$KILLALL_PROG -INT xfs_io fsstress fsx xfs_scrub >> $seqres.full 2>&1
 
 	# Tests are not allowed to exit with the scratch fs frozen.  If we
 	# started a fs freeze/thaw background loop, wait for that loop to exit
@@ -649,6 +685,8 @@ __stress_scrub_check_commands() {
 #	XFS_SCRUB_STRESS_REMOUNT_PERIOD is set.
 # -s	Pass this command to xfs_io to test scrub.  If zero -s options are
 #	specified, xfs_io will not be run.
+# -S	Pass this option to xfs_scrub.  If zero -S options are specified,
+#	xfs_scrub will not be run.  To select repair mode, pass '-k' or '-v'.
 # -t	Run online scrub against this file; $SCRATCH_MNT is the default.
 # -w	Delay the start of the scrub/repair loop by this number of seconds.
 #	Defaults to no delay unless XFS_SCRUB_STRESS_DELAY is set.  This value
@@ -657,6 +695,7 @@ __stress_scrub_check_commands() {
 #       options are 'fsx' and 'fsstress'.  The default is 'fsstress'.
 _scratch_xfs_stress_scrub() {
 	local one_scrub_args=()
+	local xfs_scrub_args=()
 	local scrub_tgt="$SCRATCH_MNT"
 	local runningfile="$tmp.fsstress"
 	local freeze="${XFS_SCRUB_STRESS_FREEZE}"
@@ -671,12 +710,13 @@ _scratch_xfs_stress_scrub() {
 	touch "$runningfile"
 
 	OPTIND=1
-	while getopts "fi:r:s:t:w:X:" c; do
+	while getopts "fi:r:s:S:t:w:X:" c; do
 		case "$c" in
 			f) freeze=yes;;
 			i) io_args+=("$OPTARG");;
 			r) remount_period="$OPTARG";;
 			s) one_scrub_args+=("$OPTARG");;
+			S) xfs_scrub_args+=("$OPTARG");;
 			t) scrub_tgt="$OPTARG";;
 			w) scrub_delay="$OPTARG";;
 			X) exerciser="$OPTARG";;
@@ -691,6 +731,18 @@ _scratch_xfs_stress_scrub() {
 		return 1
 	fi
 
+	if [ "${#xfs_scrub_args[@]}" -gt 0 ]; then
+		_scratch_scrub "${xfs_scrub_args[@]}" &> "$tmp.scrub"
+		res=$?
+		if [ $res -ne 0 ]; then
+			echo "xfs_scrub ${xfs_scrub_args[@]} failed, err $res" >> $seqres.full
+			cat "$tmp.scrub" >> $seqres.full
+			rm -f "$tmp.scrub"
+			_notrun 'scrub not supported on scratch filesystem'
+		fi
+		rm -f "$tmp.scrub"
+	fi
+
 	local start="$(date +%s)"
 	local end="$((start + (30 * TIME_FACTOR) ))"
 	local scrub_startat="$((start + scrub_delay))"
@@ -722,6 +774,11 @@ _scratch_xfs_stress_scrub() {
 				"$scrub_startat" "${one_scrub_args[@]}" &
 	fi
 
+	if [ "${#xfs_scrub_args[@]}" -gt 0 ]; then
+		__stress_xfs_scrub_loop "$end" "$runningfile" "$scrub_startat" \
+				"${xfs_scrub_args[@]}" &
+	fi
+
 	# Wait until the designated end time or fsstress dies, then kill all of
 	# our background processes.
 	while __stress_scrub_running "$end" "$runningfile"; do
@@ -741,5 +798,5 @@ _scratch_xfs_stress_scrub() {
 # Same requirements and arguments as _scratch_xfs_stress_scrub.
 _scratch_xfs_stress_online_repair() {
 	$XFS_IO_PROG -x -c 'inject force_repair' $SCRATCH_MNT
-	_scratch_xfs_stress_scrub "$@"
+	XFS_SCRUB_FORCE_REPAIR=1 _scratch_xfs_stress_scrub "$@"
 }
diff --git a/tests/xfs/285 b/tests/xfs/285
index 711211d412..0056baeb1c 100755
--- a/tests/xfs/285
+++ b/tests/xfs/285
@@ -4,55 +4,35 @@
 #
 # FS QA Test No. 285
 #
-# Race fio and xfs_scrub for a while to see if we crash or livelock.
+# Race fsstress and xfs_scrub in read-only mode for a while to see if we crash
+# or livelock.
 #
 . ./common/preamble
-_begin_fstest dangerous_fuzzers dangerous_scrub
+_begin_fstest scrub dangerous_fsstress_scrub
 
+_cleanup() {
+	cd /
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	rm -r -f $tmp.*
+}
 _register_cleanup "_cleanup" BUS
 
 # Import common functions.
 . ./common/filter
 . ./common/fuzzy
 . ./common/inject
+. ./common/xfs
 
 # real QA test starts here
 _supported_fs xfs
-_require_test_program "feature"
-_require_command "$KILLALL_PROG" killall
-_require_command "$TIMEOUT_PROG" timeout
-_require_scrub
 _require_scratch
+_require_xfs_stress_scrub
 
-echo "Format and populate"
 _scratch_mkfs > "$seqres.full" 2>&1
 _scratch_mount
-
-STRESS_DIR="$SCRATCH_MNT/testdir"
-mkdir -p $STRESS_DIR
-
-cpus=$(( $($here/src/feature -o) * 4 * LOAD_FACTOR))
-$FSSTRESS_PROG -d $STRESS_DIR -p $cpus -n $((cpus * 100000)) $FSSTRESS_AVOID >/dev/null 2>&1 &
-$XFS_SCRUB_PROG -d -T -v -n $SCRATCH_MNT >> $seqres.full
-
-killstress() {
-	sleep $(( 60 * TIME_FACTOR ))
-	$KILLALL_PROG -q $FSSTRESS_PROG
-}
-
-echo "Concurrent scrub"
-start=$(date +%s)
-end=$((start + (60 * TIME_FACTOR) ))
-killstress &
-echo "Scrub started at $(date --date="@${start}"), ending at $(date --date="@${end}")" >> $seqres.full
-while [ "$(date +%s)" -lt "$end" ]; do
-	$TIMEOUT_PROG -s TERM $(( end - $(date +%s) + 2 )) $XFS_SCRUB_PROG -d -T -v -n $SCRATCH_MNT >> $seqres.full 2>&1
-done
-
-echo "Test done"
-echo "Scrub finished at $(date)" >> $seqres.full
-$KILLALL_PROG -q $FSSTRESS_PROG
+_scratch_xfs_stress_scrub -S '-n'
 
 # success, all done
+echo Silence is golden
 status=0
 exit
diff --git a/tests/xfs/285.out b/tests/xfs/285.out
index be6b49a9fb..ab12da9ae7 100644
--- a/tests/xfs/285.out
+++ b/tests/xfs/285.out
@@ -1,4 +1,2 @@
 QA output created by 285
-Format and populate
-Concurrent scrub
-Test done
+Silence is golden
diff --git a/tests/xfs/286 b/tests/xfs/286
index 7edc9c427b..0f61a924db 100755
--- a/tests/xfs/286
+++ b/tests/xfs/286
@@ -4,57 +4,35 @@
 #
 # FS QA Test No. 286
 #
-# Race fio and xfs_scrub for a while to see if we crash or livelock.
+# Race fsstress and xfs_scrub in force-repair mode for a while to see if we
+# crash or livelock.
 #
 . ./common/preamble
-_begin_fstest dangerous_fuzzers dangerous_scrub dangerous_online_repair
+_begin_fstest online_repair dangerous_fsstress_repair
 
+_cleanup() {
+	cd /
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	rm -r -f $tmp.*
+}
 _register_cleanup "_cleanup" BUS
 
 # Import common functions.
 . ./common/filter
 . ./common/fuzzy
 . ./common/inject
+. ./common/xfs
 
 # real QA test starts here
 _supported_fs xfs
-_require_test_program "feature"
-_require_command "$KILLALL_PROG" killall
-_require_command "$TIMEOUT_PROG" timeout
-_require_scrub
 _require_scratch
-# xfs_scrub will turn on error injection itself
-_require_xfs_io_error_injection "force_repair"
+_require_xfs_stress_online_repair
 
-echo "Format and populate"
 _scratch_mkfs > "$seqres.full" 2>&1
 _scratch_mount
-
-STRESS_DIR="$SCRATCH_MNT/testdir"
-mkdir -p $STRESS_DIR
-
-cpus=$(( $($here/src/feature -o) * 4 * LOAD_FACTOR))
-$FSSTRESS_PROG -d $STRESS_DIR -p $cpus -n $((cpus * 100000)) $FSSTRESS_AVOID >/dev/null 2>&1 &
-$XFS_SCRUB_PROG -d -T -v -n $SCRATCH_MNT >> $seqres.full
-
-killstress() {
-	sleep $(( 60 * TIME_FACTOR ))
-	$KILLALL_PROG -q $FSSTRESS_PROG
-}
-
-echo "Concurrent repair"
-start=$(date +%s)
-end=$((start + (60 * TIME_FACTOR) ))
-killstress &
-echo "Repair started at $(date --date="@${start}"), ending at $(date --date="@${end}")" >> $seqres.full
-while [ "$(date +%s)" -lt "$end" ]; do
-	XFS_SCRUB_FORCE_REPAIR=1 $TIMEOUT_PROG -s TERM $(( end - $(date +%s) + 2 )) $XFS_SCRUB_PROG -d -T -v $SCRATCH_MNT >> $seqres.full
-done
-
-echo "Test done"
-echo "Repair finished at $(date)" >> $seqres.full
-$KILLALL_PROG -q $FSSTRESS_PROG
+_scratch_xfs_stress_online_repair -S '-k'
 
 # success, all done
+echo Silence is golden
 status=0
 exit
diff --git a/tests/xfs/286.out b/tests/xfs/286.out
index 80e12b5495..35c4800694 100644
--- a/tests/xfs/286.out
+++ b/tests/xfs/286.out
@@ -1,4 +1,2 @@
 QA output created by 286
-Format and populate
-Concurrent repair
-Test done
+Silence is golden


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 2/2] xfs: stress test xfs_scrub(8) with freeze and ro-remount loops
  2022-12-30 22:13 ` [PATCHSET v24.0 0/2] fstests: race online scrub with mount state changes Darrick J. Wong
  2022-12-30 22:13   ` [PATCH 1/2] xfs: stress test xfs_scrub(8) with fsstress Darrick J. Wong
@ 2022-12-30 22:13   ` Darrick J. Wong
  1 sibling, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-12-30 22:13 UTC (permalink / raw)
  To: zlang, djwong; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Make sure we don't trip over any asserts or livelock when scrub races
with filesystem freezing and readonly remounts.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/733     |   39 +++++++++++++++++++++++++++++++++++++++
 tests/xfs/733.out |    2 ++
 tests/xfs/771     |   39 +++++++++++++++++++++++++++++++++++++++
 tests/xfs/771.out |    2 ++
 tests/xfs/824     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/824.out |    2 ++
 tests/xfs/825     |   40 ++++++++++++++++++++++++++++++++++++++++
 tests/xfs/825.out |    2 ++
 8 files changed, 166 insertions(+)
 create mode 100755 tests/xfs/733
 create mode 100644 tests/xfs/733.out
 create mode 100755 tests/xfs/771
 create mode 100644 tests/xfs/771.out
 create mode 100755 tests/xfs/824
 create mode 100644 tests/xfs/824.out
 create mode 100755 tests/xfs/825
 create mode 100644 tests/xfs/825.out


diff --git a/tests/xfs/733 b/tests/xfs/733
new file mode 100755
index 0000000000..ee9a0a26ee
--- /dev/null
+++ b/tests/xfs/733
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 733
+#
+# Race xfs_scrub in check-only mode and ro remount for a while to see if we
+# crash or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	_scratch_remount rw
+	rm -rf $tmp.*
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -r 5 -S '-n'
+
+# success, all done
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/xfs/733.out b/tests/xfs/733.out
new file mode 100644
index 0000000000..7118d5ddf0
--- /dev/null
+++ b/tests/xfs/733.out
@@ -0,0 +1,2 @@
+QA output created by 733
+Silence is golden
diff --git a/tests/xfs/771 b/tests/xfs/771
new file mode 100755
index 0000000000..8c8d124f12
--- /dev/null
+++ b/tests/xfs/771
@@ -0,0 +1,39 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 771
+#
+# Race xfs_scrub in check-only mode and freeze for a while to see if we crash
+# or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	_scratch_remount rw
+	rm -rf $tmp.*
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -f -S '-n'
+
+# success, all done
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/xfs/771.out b/tests/xfs/771.out
new file mode 100644
index 0000000000..c2345c7be3
--- /dev/null
+++ b/tests/xfs/771.out
@@ -0,0 +1,2 @@
+QA output created by 771
+Silence is golden
diff --git a/tests/xfs/824 b/tests/xfs/824
new file mode 100755
index 0000000000..65eeb3a6c9
--- /dev/null
+++ b/tests/xfs/824
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 824
+#
+# Race xfs_scrub in force-repair mdoe and freeze for a while to see if we crash
+# or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	_scratch_remount rw
+	rm -rf $tmp.*
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/xfs
+. ./common/inject
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -f -S '-k'
+
+# success, all done
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/xfs/824.out b/tests/xfs/824.out
new file mode 100644
index 0000000000..6cf432abbd
--- /dev/null
+++ b/tests/xfs/824.out
@@ -0,0 +1,2 @@
+QA output created by 824
+Silence is golden
diff --git a/tests/xfs/825 b/tests/xfs/825
new file mode 100755
index 0000000000..80ce06932d
--- /dev/null
+++ b/tests/xfs/825
@@ -0,0 +1,40 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2022 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 825
+#
+# Race xfs_scrub in force-repair mode and ro remount for a while to see if we
+# crash or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	_scratch_remount rw
+	rm -rf $tmp.*
+}
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/xfs
+. ./common/inject
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -r 5 -S '-k'
+
+# success, all done
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/xfs/825.out b/tests/xfs/825.out
new file mode 100644
index 0000000000..d0e970dfd6
--- /dev/null
+++ b/tests/xfs/825.out
@@ -0,0 +1,2 @@
+QA output created by 825
+Silence is golden


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 1/3] fuzzy: enhance scrub stress testing to use fsx
  2022-12-30 22:12   ` [PATCH 1/3] fuzzy: enhance scrub stress testing to use fsx Darrick J. Wong
@ 2023-01-05  5:49     ` Zorro Lang
  2023-01-05 18:28       ` Darrick J. Wong
  2023-01-05 18:28     ` [PATCH v24.1 " Darrick J. Wong
  1 sibling, 1 reply; 220+ messages in thread
From: Zorro Lang @ 2023-01-05  5:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:12:57PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add a couple of new online fsck stress tests that race fsx against
> online fsck.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  common/fuzzy      |   39 ++++++++++++++++++++++++++++++++++++---
>  tests/xfs/847     |   38 ++++++++++++++++++++++++++++++++++++++
>  tests/xfs/847.out |    2 ++
>  tests/xfs/848     |   38 ++++++++++++++++++++++++++++++++++++++
>  tests/xfs/848.out |    2 ++
>  5 files changed, 116 insertions(+), 3 deletions(-)
>  create mode 100755 tests/xfs/847
>  create mode 100644 tests/xfs/847.out
>  create mode 100755 tests/xfs/848
>  create mode 100644 tests/xfs/848.out
> 
> 
> diff --git a/common/fuzzy b/common/fuzzy
> index 1df51a6dd8..3512e95e02 100644
> --- a/common/fuzzy
> +++ b/common/fuzzy
> @@ -408,6 +408,30 @@ __stress_scrub_clean_scratch() {
>  	return 0
>  }
>  
> +# Run fsx while we're testing online fsck.
> +__stress_scrub_fsx_loop() {
> +	local end="$1"
> +	local runningfile="$2"
> +	local focus=(-q -X)	# quiet, validate file contents
> +
> +	# As of November 2022, 2 million fsx ops should be enough to keep
> +	# any filesystem busy for a couple of hours.
> +	focus+=(-N 2000000)
> +	focus+=(-o $((128000 * LOAD_FACTOR)) )
> +	focus+=(-l $((600000 * LOAD_FACTOR)) )
> +
> +	local args="$FSX_AVOID ${focus[@]} ${SCRATCH_MNT}/fsx.$seq"
> +	echo "Running $here/ltp/fsx $args" >> $seqres.full
> +
> +	while __stress_scrub_running "$end" "$runningfile"; do
> +		# Need to recheck running conditions if we cleared anything
> +		__stress_scrub_clean_scratch && continue
> +		$here/ltp/fsx $args >> $seqres.full
> +		echo "fsx exits with $? at $(date)" >> $seqres.full
> +	done
> +	rm -f "$runningfile"
> +}
> +
>  # Run fsstress while we're testing online fsck.
>  __stress_scrub_fsstress_loop() {
>  	local end="$1"
> @@ -454,7 +478,7 @@ _scratch_xfs_stress_scrub_cleanup() {
>  	# Send SIGINT so that bash won't print a 'Terminated' message that
>  	# distorts the golden output.
>  	echo "Killing stressor processes at $(date)" >> $seqres.full
> -	$KILLALL_PROG -INT xfs_io fsstress >> $seqres.full 2>&1
> +	$KILLALL_PROG -INT xfs_io fsstress fsx >> $seqres.full 2>&1
>  
>  	# Tests are not allowed to exit with the scratch fs frozen.  If we
>  	# started a fs freeze/thaw background loop, wait for that loop to exit
> @@ -522,30 +546,39 @@ __stress_scrub_check_commands() {
>  # -w	Delay the start of the scrub/repair loop by this number of seconds.
>  #	Defaults to no delay unless XFS_SCRUB_STRESS_DELAY is set.  This value
>  #	will be clamped to ten seconds before the end time.
> +# -X	Run this program to exercise the filesystem.  Currently supported
> +#       options are 'fsx' and 'fsstress'.  The default is 'fsstress'.
>  _scratch_xfs_stress_scrub() {
>  	local one_scrub_args=()
>  	local scrub_tgt="$SCRATCH_MNT"
>  	local runningfile="$tmp.fsstress"
>  	local freeze="${XFS_SCRUB_STRESS_FREEZE}"
>  	local scrub_delay="${XFS_SCRUB_STRESS_DELAY:--1}"
> +	local exerciser="fsstress"
>  
>  	__SCRUB_STRESS_FREEZE_PID=""
>  	rm -f "$runningfile"
>  	touch "$runningfile"
>  
>  	OPTIND=1
> -	while getopts "fs:t:w:" c; do
> +	while getopts "fs:t:w:X:" c; do
>  		case "$c" in
>  			f) freeze=yes;;
>  			s) one_scrub_args+=("$OPTARG");;
>  			t) scrub_tgt="$OPTARG";;
>  			w) scrub_delay="$OPTARG";;
> +			X) exerciser="$OPTARG";;
>  			*) return 1; ;;
>  		esac
>  	done
>  
>  	__stress_scrub_check_commands "$scrub_tgt" "${one_scrub_args[@]}"
>  
> +	if ! command -v "__stress_scrub_${exerciser}_loop" &>/dev/null; then
> +		echo "${exerciser}: Unknown fs exercise program."
> +		return 1
> +	fi
> +
>  	local start="$(date +%s)"
>  	local end="$((start + (30 * TIME_FACTOR) ))"
>  	local scrub_startat="$((start + scrub_delay))"
> @@ -555,7 +588,7 @@ _scratch_xfs_stress_scrub() {
>  	echo "Loop started at $(date --date="@${start}")," \
>  		   "ending at $(date --date="@${end}")" >> $seqres.full
>  
> -	__stress_scrub_fsstress_loop "$end" "$runningfile" &
> +	"__stress_scrub_${exerciser}_loop" "$end" "$runningfile" &
>  
>  	if [ -n "$freeze" ]; then
>  		__stress_scrub_freeze_loop "$end" "$runningfile" &
> diff --git a/tests/xfs/847 b/tests/xfs/847
> new file mode 100755
> index 0000000000..856e9a6c26
> --- /dev/null
> +++ b/tests/xfs/847
> @@ -0,0 +1,38 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle, Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 847
> +#
> +# Race fsx and xfs_scrub in read-only mode for a while to see if we crash
> +# or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest scrub dangerous_fsstress_scrub

Hi Darrick,

Such huge patchsets :) I'll try to review them one by one (patchset).

Now I'm trying to review "[NYE DELUGE 1/4]", but I can't find the
"dangerous_fsstress_scrub" group in the whole patchsets. Is there any
prepositive patch(set)? Or you'd like to use "dangerous_fsstress_repair"?

P.S: More cases use "dangerous_fsstress_scrub" in your new patchsets.

Thanks,
Zorro

> +
> +_cleanup() {
> +	cd /
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_scrub
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_scrub -S '-n' -X 'fsx'
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/847.out b/tests/xfs/847.out
> new file mode 100644
> index 0000000000..b7041db159
> --- /dev/null
> +++ b/tests/xfs/847.out
> @@ -0,0 +1,2 @@
> +QA output created by 847
> +Silence is golden
> diff --git a/tests/xfs/848 b/tests/xfs/848
> new file mode 100755
> index 0000000000..ab32020624
> --- /dev/null
> +++ b/tests/xfs/848
> @@ -0,0 +1,38 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2022 Oracle, Inc.  All Rights Reserved.
> +#
> +# FS QA Test No. 848
> +#
> +# Race fsx and xfs_scrub in force-repair mode for a while to see if we
> +# crash or livelock.
> +#
> +. ./common/preamble
> +_begin_fstest online_repair dangerous_fsstress_repair
> +
> +_cleanup() {
> +	cd /
> +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> +	rm -r -f $tmp.*
> +}
> +_register_cleanup "_cleanup" BUS
> +
> +# Import common functions.
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/inject
> +. ./common/xfs
> +
> +# real QA test starts here
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_stress_online_repair
> +
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount
> +_scratch_xfs_stress_online_repair -S '-k' -X 'fsx'
> +
> +# success, all done
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/848.out b/tests/xfs/848.out
> new file mode 100644
> index 0000000000..23f674045c
> --- /dev/null
> +++ b/tests/xfs/848.out
> @@ -0,0 +1,2 @@
> +QA output created by 848
> +Silence is golden
> 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 06/14] xfs: document how online fsck deals with eventual consistency
  2022-12-30 22:10   ` [PATCH 06/14] xfs: document how online fsck deals with eventual consistency Darrick J. Wong
@ 2023-01-05  9:08     ` Amir Goldstein
  2023-01-05 19:40       ` Darrick J. Wong
  2023-01-31  6:11     ` Allison Henderson
  1 sibling, 1 reply; 220+ messages in thread
From: Amir Goldstein @ 2023-01-05  9:08 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

On Sat, Dec 31, 2022 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Writes to an XFS filesystem employ an eventual consistency update model
> to break up complex multistep metadata updates into small chained
> transactions.  This is generally good for performance and scalability
> because XFS doesn't need to prepare for enormous transactions, but it
> also means that online fsck must be careful not to attempt a fsck action
> unless it can be shown that there are no other threads processing a
> transaction chain.  This part of the design documentation covers the
> thinking behind the consistency model and how scrub deals with it.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  303 ++++++++++++++++++++
>  1 file changed, 303 insertions(+)
>
>
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
> index f45bf97fa9c4..419eb54ee200 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -1443,3 +1443,306 @@ This step is critical for enabling system administrator to monitor the status
>  of the filesystem and the progress of any repairs.
>  For developers, it is a useful means to judge the efficacy of error detection
>  and correction in the online and offline checking tools.
> +
> +Eventual Consistency vs. Online Fsck
> +------------------------------------
> +
> +Midway through the development of online scrubbing, the fsstress tests
> +uncovered a misinteraction between online fsck and compound transaction chains
> +created by other writer threads that resulted in false reports of metadata
> +inconsistency.
> +The root cause of these reports is the eventual consistency model introduced by
> +the expansion of deferred work items and compound transaction chains when
> +reverse mapping and reflink were introduced.
> +
> +Originally, transaction chains were added to XFS to avoid deadlocks when
> +unmapping space from files.
> +Deadlock avoidance rules require that AGs only be locked in increasing order,
> +which makes it impossible (say) to use a single transaction to free a space
> +extent in AG 7 and then try to free a now superfluous block mapping btree block
> +in AG 3.
> +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
> +items to commit to freeing some space in one transaction while deferring the
> +actual metadata updates to a fresh transaction.
> +The transaction sequence looks like this:
> +
> +1. The first transaction contains a physical update to the file's block mapping
> +   structures to remove the mapping from the btree blocks.
> +   It then attaches to the in-memory transaction an action item to schedule
> +   deferred freeing of space.
> +   Concretely, each transaction maintains a list of ``struct
> +   xfs_defer_pending`` objects, each of which maintains a list of ``struct
> +   xfs_extent_free_item`` objects.
> +   Returning to the example above, the action item tracks the freeing of both
> +   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
> +   AG 3.
> +   Deferred frees recorded in this manner are committed in the log by creating
> +   an EFI log item from the ``struct xfs_extent_free_item`` object and
> +   attaching the log item to the transaction.
> +   When the log is persisted to disk, the EFI item is written into the ondisk
> +   transaction record.
> +   EFIs can list up to 16 extents to free, all sorted in AG order.
> +
> +2. The second transaction contains a physical update to the free space btrees
> +   of AG 3 to release the former BMBT block and a second physical update to the
> +   free space btrees of AG 7 to release the unmapped file space.
> +   Observe that the the physical updates are resequenced in the correct order
> +   when possible.
> +   Attached to the transaction is a an extent free done (EFD) log item.
> +   The EFD contains a pointer to the EFI logged in transaction #1 so that log
> +   recovery can tell if the EFI needs to be replayed.
> +
> +If the system goes down after transaction #1 is written back to the filesystem
> +but before #2 is committed, a scan of the filesystem metadata would show
> +inconsistent filesystem metadata because there would not appear to be any owner
> +of the unmapped space.
> +Happily, log recovery corrects this inconsistency for us -- when recovery finds
> +an intent log item but does not find a corresponding intent done item, it will
> +reconstruct the incore state of the intent item and finish it.
> +In the example above, the log must replay both frees described in the recovered
> +EFI to complete the recovery phase.
> +
> +There are two subtleties to XFS' transaction chaining strategy to consider.
> +The first is that log items must be added to a transaction in the correct order
> +to prevent conflicts with principal objects that are not held by the
> +transaction.
> +In other words, all per-AG metadata updates for an unmapped block must be
> +completed before the last update to free the extent, and extents should not
> +be reallocated until that last update commits to the log.
> +The second subtlety comes from the fact that AG header buffers are (usually)
> +released between each transaction in a chain.
> +This means that other threads can observe an AG in an intermediate state,
> +but as long as the first subtlety is handled, this should not affect the
> +correctness of filesystem operations.
> +Unmounting the filesystem flushes all pending work to disk, which means that
> +offline fsck never sees the temporary inconsistencies caused by deferred work
> +item processing.
> +In this manner, XFS employs a form of eventual consistency to avoid deadlocks
> +and increase parallelism.
> +
> +During the design phase of the reverse mapping and reflink features, it was
> +decided that it was impractical to cram all the reverse mapping updates for a
> +single filesystem change into a single transaction because a single file
> +mapping operation can explode into many small updates:
> +
> +* The block mapping update itself
> +* A reverse mapping update for the block mapping update
> +* Fixing the freelist
> +* A reverse mapping update for the freelist fix
> +
> +* A shape change to the block mapping btree
> +* A reverse mapping update for the btree update
> +* Fixing the freelist (again)
> +* A reverse mapping update for the freelist fix
> +
> +* An update to the reference counting information
> +* A reverse mapping update for the refcount update
> +* Fixing the freelist (a third time)
> +* A reverse mapping update for the freelist fix
> +
> +* Freeing any space that was unmapped and not owned by any other file
> +* Fixing the freelist (a fourth time)
> +* A reverse mapping update for the freelist fix
> +
> +* Freeing the space used by the block mapping btree
> +* Fixing the freelist (a fifth time)
> +* A reverse mapping update for the freelist fix
> +
> +Free list fixups are not usually needed more than once per AG per transaction
> +chain, but it is theoretically possible if space is very tight.
> +For copy-on-write updates this is even worse, because this must be done once to
> +remove the space from a staging area and again to map it into the file!
> +
> +To deal with this explosion in a calm manner, XFS expands its use of deferred
> +work items to cover most reverse mapping updates and all refcount updates.
> +This reduces the worst case size of transaction reservations by breaking the
> +work into a long chain of small updates, which increases the degree of eventual
> +consistency in the system.
> +Again, this generally isn't a problem because XFS orders its deferred work
> +items carefully to avoid resource reuse conflicts between unsuspecting threads.
> +
> +However, online fsck changes the rules -- remember that although physical
> +updates to per-AG structures are coordinated by locking the buffers for AG
> +headers, buffer locks are dropped between transactions.
> +Once scrub acquires resources and takes locks for a data structure, it must do
> +all the validation work without releasing the lock.
> +If the main lock for a space btree is an AG header buffer lock, scrub may have
> +interrupted another thread that is midway through finishing a chain.
> +For example, if a thread performing a copy-on-write has completed a reverse
> +mapping update but not the corresponding refcount update, the two AG btrees
> +will appear inconsistent to scrub and an observation of corruption will be
> +recorded.  This observation will not be correct.
> +If a repair is attempted in this state, the results will be catastrophic!
> +
> +Several solutions to this problem were evaluated upon discovery of this flaw:
> +
> +1. Add a higher level lock to allocation groups and require writer threads to
> +   acquire the higher level lock in AG order before making any changes.
> +   This would be very difficult to implement in practice because it is
> +   difficult to determine which locks need to be obtained, and in what order,
> +   without simulating the entire operation.
> +   Performing a dry run of a file operation to discover necessary locks would
> +   make the filesystem very slow.
> +
> +2. Make the deferred work coordinator code aware of consecutive intent items
> +   targeting the same AG and have it hold the AG header buffers locked across
> +   the transaction roll between updates.
> +   This would introduce a lot of complexity into the coordinator since it is
> +   only loosely coupled with the actual deferred work items.
> +   It would also fail to solve the problem because deferred work items can
> +   generate new deferred subtasks, but all subtasks must be complete before
> +   work can start on a new sibling task.
> +
> +3. Teach online fsck to walk all transactions waiting for whichever lock(s)
> +   protect the data structure being scrubbed to look for pending operations.
> +   The checking and repair operations must factor these pending operations into
> +   the evaluations being performed.
> +   This solution is a nonstarter because it is *extremely* invasive to the main
> +   filesystem.
> +
> +4. Recognize that only online fsck has this requirement of total consistency
> +   of AG metadata, and that online fsck should be relatively rare as compared
> +   to filesystem change operations.
> +   For each AG, maintain a count of intent items targetting that AG.
> +   When online fsck wants to examine an AG, it should lock the AG header
> +   buffers to quiesce all transaction chains that want to modify that AG, and
> +   only proceed with the scrub if the count is zero.
> +   In other words, scrub only proceeds if it can lock the AG header buffers and
> +   there can't possibly be any intents in progress.
> +   This may lead to fairness and starvation issues, but regular filesystem
> +   updates take precedence over online fsck activity.
> +

Is there any guarantee that some silly real life regular filesystem workload
won't starve online fsck forever?
IOW, is forward progress of online fsck guaranteed?

Good luck with landing online fsck before the 2024 NYE deluge ;)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 1/3] fuzzy: enhance scrub stress testing to use fsx
  2023-01-05  5:49     ` Zorro Lang
@ 2023-01-05 18:28       ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-01-05 18:28 UTC (permalink / raw)
  To: Zorro Lang; +Cc: linux-xfs, fstests

On Thu, Jan 05, 2023 at 01:49:20PM +0800, Zorro Lang wrote:
> On Fri, Dec 30, 2022 at 02:12:57PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add a couple of new online fsck stress tests that race fsx against
> > online fsck.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  common/fuzzy      |   39 ++++++++++++++++++++++++++++++++++++---
> >  tests/xfs/847     |   38 ++++++++++++++++++++++++++++++++++++++
> >  tests/xfs/847.out |    2 ++
> >  tests/xfs/848     |   38 ++++++++++++++++++++++++++++++++++++++
> >  tests/xfs/848.out |    2 ++
> >  5 files changed, 116 insertions(+), 3 deletions(-)
> >  create mode 100755 tests/xfs/847
> >  create mode 100644 tests/xfs/847.out
> >  create mode 100755 tests/xfs/848
> >  create mode 100644 tests/xfs/848.out
> > 
> > 
> > diff --git a/common/fuzzy b/common/fuzzy
> > index 1df51a6dd8..3512e95e02 100644
> > --- a/common/fuzzy
> > +++ b/common/fuzzy
> > @@ -408,6 +408,30 @@ __stress_scrub_clean_scratch() {
> >  	return 0
> >  }
> >  
> > +# Run fsx while we're testing online fsck.
> > +__stress_scrub_fsx_loop() {
> > +	local end="$1"
> > +	local runningfile="$2"
> > +	local focus=(-q -X)	# quiet, validate file contents
> > +
> > +	# As of November 2022, 2 million fsx ops should be enough to keep
> > +	# any filesystem busy for a couple of hours.
> > +	focus+=(-N 2000000)
> > +	focus+=(-o $((128000 * LOAD_FACTOR)) )
> > +	focus+=(-l $((600000 * LOAD_FACTOR)) )
> > +
> > +	local args="$FSX_AVOID ${focus[@]} ${SCRATCH_MNT}/fsx.$seq"
> > +	echo "Running $here/ltp/fsx $args" >> $seqres.full
> > +
> > +	while __stress_scrub_running "$end" "$runningfile"; do
> > +		# Need to recheck running conditions if we cleared anything
> > +		__stress_scrub_clean_scratch && continue
> > +		$here/ltp/fsx $args >> $seqres.full
> > +		echo "fsx exits with $? at $(date)" >> $seqres.full
> > +	done
> > +	rm -f "$runningfile"
> > +}
> > +
> >  # Run fsstress while we're testing online fsck.
> >  __stress_scrub_fsstress_loop() {
> >  	local end="$1"
> > @@ -454,7 +478,7 @@ _scratch_xfs_stress_scrub_cleanup() {
> >  	# Send SIGINT so that bash won't print a 'Terminated' message that
> >  	# distorts the golden output.
> >  	echo "Killing stressor processes at $(date)" >> $seqres.full
> > -	$KILLALL_PROG -INT xfs_io fsstress >> $seqres.full 2>&1
> > +	$KILLALL_PROG -INT xfs_io fsstress fsx >> $seqres.full 2>&1
> >  
> >  	# Tests are not allowed to exit with the scratch fs frozen.  If we
> >  	# started a fs freeze/thaw background loop, wait for that loop to exit
> > @@ -522,30 +546,39 @@ __stress_scrub_check_commands() {
> >  # -w	Delay the start of the scrub/repair loop by this number of seconds.
> >  #	Defaults to no delay unless XFS_SCRUB_STRESS_DELAY is set.  This value
> >  #	will be clamped to ten seconds before the end time.
> > +# -X	Run this program to exercise the filesystem.  Currently supported
> > +#       options are 'fsx' and 'fsstress'.  The default is 'fsstress'.
> >  _scratch_xfs_stress_scrub() {
> >  	local one_scrub_args=()
> >  	local scrub_tgt="$SCRATCH_MNT"
> >  	local runningfile="$tmp.fsstress"
> >  	local freeze="${XFS_SCRUB_STRESS_FREEZE}"
> >  	local scrub_delay="${XFS_SCRUB_STRESS_DELAY:--1}"
> > +	local exerciser="fsstress"
> >  
> >  	__SCRUB_STRESS_FREEZE_PID=""
> >  	rm -f "$runningfile"
> >  	touch "$runningfile"
> >  
> >  	OPTIND=1
> > -	while getopts "fs:t:w:" c; do
> > +	while getopts "fs:t:w:X:" c; do
> >  		case "$c" in
> >  			f) freeze=yes;;
> >  			s) one_scrub_args+=("$OPTARG");;
> >  			t) scrub_tgt="$OPTARG";;
> >  			w) scrub_delay="$OPTARG";;
> > +			X) exerciser="$OPTARG";;
> >  			*) return 1; ;;
> >  		esac
> >  	done
> >  
> >  	__stress_scrub_check_commands "$scrub_tgt" "${one_scrub_args[@]}"
> >  
> > +	if ! command -v "__stress_scrub_${exerciser}_loop" &>/dev/null; then
> > +		echo "${exerciser}: Unknown fs exercise program."
> > +		return 1
> > +	fi
> > +
> >  	local start="$(date +%s)"
> >  	local end="$((start + (30 * TIME_FACTOR) ))"
> >  	local scrub_startat="$((start + scrub_delay))"
> > @@ -555,7 +588,7 @@ _scratch_xfs_stress_scrub() {
> >  	echo "Loop started at $(date --date="@${start}")," \
> >  		   "ending at $(date --date="@${end}")" >> $seqres.full
> >  
> > -	__stress_scrub_fsstress_loop "$end" "$runningfile" &
> > +	"__stress_scrub_${exerciser}_loop" "$end" "$runningfile" &
> >  
> >  	if [ -n "$freeze" ]; then
> >  		__stress_scrub_freeze_loop "$end" "$runningfile" &
> > diff --git a/tests/xfs/847 b/tests/xfs/847
> > new file mode 100755
> > index 0000000000..856e9a6c26
> > --- /dev/null
> > +++ b/tests/xfs/847
> > @@ -0,0 +1,38 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle, Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 847
> > +#
> > +# Race fsx and xfs_scrub in read-only mode for a while to see if we crash
> > +# or livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest scrub dangerous_fsstress_scrub
> 
> Hi Darrick,
> 
> Such huge patchsets :) I'll try to review them one by one (patchset).
> 
> Now I'm trying to review "[NYE DELUGE 1/4]", but I can't find the
> "dangerous_fsstress_scrub" group in the whole patchsets. Is there any
> prepositive patch(set)? Or you'd like to use "dangerous_fsstress_repair"?
> 
> P.S: More cases use "dangerous_fsstress_scrub" in your new patchsets.

Oops.  The group was originally added in "xfs: race fsstress with online
scrubbers for AG and fs metadata".  Then I created a few more patches at
the top of my stack, tested that, and then decided that their proper
placement was closer to the bottom than the patch that added the group.

Ok, I'll modify the build system to shellcheck any bash scripts in the
current commit (because running it on the full repo took hours and
produced many hundreds of errors, mostly in tests/btrfs/) and go do a
push-and-build of all three stgit repos.

--D

> Thanks,
> Zorro
> 
> > +
> > +_cleanup() {
> > +	cd /
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_scrub
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_scratch_xfs_stress_scrub -S '-n' -X 'fsx'
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/847.out b/tests/xfs/847.out
> > new file mode 100644
> > index 0000000000..b7041db159
> > --- /dev/null
> > +++ b/tests/xfs/847.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 847
> > +Silence is golden
> > diff --git a/tests/xfs/848 b/tests/xfs/848
> > new file mode 100755
> > index 0000000000..ab32020624
> > --- /dev/null
> > +++ b/tests/xfs/848
> > @@ -0,0 +1,38 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2022 Oracle, Inc.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 848
> > +#
> > +# Race fsx and xfs_scrub in force-repair mode for a while to see if we
> > +# crash or livelock.
> > +#
> > +. ./common/preamble
> > +_begin_fstest online_repair dangerous_fsstress_repair
> > +
> > +_cleanup() {
> > +	cd /
> > +	_scratch_xfs_stress_scrub_cleanup &> /dev/null
> > +	rm -r -f $tmp.*
> > +}
> > +_register_cleanup "_cleanup" BUS
> > +
> > +# Import common functions.
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/inject
> > +. ./common/xfs
> > +
> > +# real QA test starts here
> > +_supported_fs xfs
> > +_require_scratch
> > +_require_xfs_stress_online_repair
> > +
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount
> > +_scratch_xfs_stress_online_repair -S '-k' -X 'fsx'
> > +
> > +# success, all done
> > +echo Silence is golden
> > +status=0
> > +exit
> > diff --git a/tests/xfs/848.out b/tests/xfs/848.out
> > new file mode 100644
> > index 0000000000..23f674045c
> > --- /dev/null
> > +++ b/tests/xfs/848.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 848
> > +Silence is golden
> > 
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH v24.1 1/3] fuzzy: enhance scrub stress testing to use fsx
  2022-12-30 22:12   ` [PATCH 1/3] fuzzy: enhance scrub stress testing to use fsx Darrick J. Wong
  2023-01-05  5:49     ` Zorro Lang
@ 2023-01-05 18:28     ` Darrick J. Wong
  1 sibling, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-01-05 18:28 UTC (permalink / raw)
  To: zlang; +Cc: linux-xfs, fstests, guan

From: Darrick J. Wong <djwong@kernel.org>

Add a couple of new online fsck stress tests that race fsx against
online fsck.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
v24.1: move the addition of the group to this patch
---
 common/fuzzy        |   39 ++++++++++++++++++++++++++++++++++++---
 doc/group-names.txt |    1 +
 tests/xfs/847       |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/847.out   |    2 ++
 tests/xfs/848       |   38 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/848.out   |    2 ++
 6 files changed, 117 insertions(+), 3 deletions(-)
 create mode 100755 tests/xfs/847
 create mode 100644 tests/xfs/847.out
 create mode 100755 tests/xfs/848
 create mode 100644 tests/xfs/848.out

diff --git a/common/fuzzy b/common/fuzzy
index 7994665ef7..a764de461e 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -417,6 +417,30 @@ __stress_scrub_clean_scratch() {
 	return 0
 }
 
+# Run fsx while we're testing online fsck.
+__stress_scrub_fsx_loop() {
+	local end="$1"
+	local runningfile="$2"
+	local focus=(-q -X)	# quiet, validate file contents
+
+	# As of November 2022, 2 million fsx ops should be enough to keep
+	# any filesystem busy for a couple of hours.
+	focus+=(-N 2000000)
+	focus+=(-o $((128000 * LOAD_FACTOR)) )
+	focus+=(-l $((600000 * LOAD_FACTOR)) )
+
+	local args="$FSX_AVOID ${focus[@]} ${SCRATCH_MNT}/fsx.$seq"
+	echo "Running $here/ltp/fsx $args" >> $seqres.full
+
+	while __stress_scrub_running "$end" "$runningfile"; do
+		# Need to recheck running conditions if we cleared anything
+		__stress_scrub_clean_scratch && continue
+		$here/ltp/fsx $args >> $seqres.full
+		echo "fsx exits with $? at $(date)" >> $seqres.full
+	done
+	rm -f "$runningfile"
+}
+
 # Run fsstress while we're testing online fsck.
 __stress_scrub_fsstress_loop() {
 	local end="$1"
@@ -463,7 +487,7 @@ _scratch_xfs_stress_scrub_cleanup() {
 	# Send SIGINT so that bash won't print a 'Terminated' message that
 	# distorts the golden output.
 	echo "Killing stressor processes at $(date)" >> $seqres.full
-	$KILLALL_PROG -INT xfs_io fsstress >> $seqres.full 2>&1
+	$KILLALL_PROG -INT xfs_io fsstress fsx >> $seqres.full 2>&1
 
 	# Tests are not allowed to exit with the scratch fs frozen.  If we
 	# started a fs freeze/thaw background loop, wait for that loop to exit
@@ -531,30 +555,39 @@ __stress_scrub_check_commands() {
 # -w	Delay the start of the scrub/repair loop by this number of seconds.
 #	Defaults to no delay unless XFS_SCRUB_STRESS_DELAY is set.  This value
 #	will be clamped to ten seconds before the end time.
+# -X	Run this program to exercise the filesystem.  Currently supported
+#       options are 'fsx' and 'fsstress'.  The default is 'fsstress'.
 _scratch_xfs_stress_scrub() {
 	local one_scrub_args=()
 	local scrub_tgt="$SCRATCH_MNT"
 	local runningfile="$tmp.fsstress"
 	local freeze="${XFS_SCRUB_STRESS_FREEZE}"
 	local scrub_delay="${XFS_SCRUB_STRESS_DELAY:--1}"
+	local exerciser="fsstress"
 
 	__SCRUB_STRESS_FREEZE_PID=""
 	rm -f "$runningfile"
 	touch "$runningfile"
 
 	OPTIND=1
-	while getopts "fs:t:w:" c; do
+	while getopts "fs:t:w:X:" c; do
 		case "$c" in
 			f) freeze=yes;;
 			s) one_scrub_args+=("$OPTARG");;
 			t) scrub_tgt="$OPTARG";;
 			w) scrub_delay="$OPTARG";;
+			X) exerciser="$OPTARG";;
 			*) return 1; ;;
 		esac
 	done
 
 	__stress_scrub_check_commands "$scrub_tgt" "${one_scrub_args[@]}"
 
+	if ! command -v "__stress_scrub_${exerciser}_loop" &>/dev/null; then
+		echo "${exerciser}: Unknown fs exercise program."
+		return 1
+	fi
+
 	local start="$(date +%s)"
 	local end="$((start + (30 * TIME_FACTOR) ))"
 	local scrub_startat="$((start + scrub_delay))"
@@ -564,7 +597,7 @@ _scratch_xfs_stress_scrub() {
 	echo "Loop started at $(date --date="@${start}")," \
 		   "ending at $(date --date="@${end}")" >> $seqres.full
 
-	__stress_scrub_fsstress_loop "$end" "$runningfile" &
+	"__stress_scrub_${exerciser}_loop" "$end" "$runningfile" &
 
 	if [ -n "$freeze" ]; then
 		__stress_scrub_freeze_loop "$end" "$runningfile" &
diff --git a/doc/group-names.txt b/doc/group-names.txt
index ac219e05b3..771ce937ae 100644
--- a/doc/group-names.txt
+++ b/doc/group-names.txt
@@ -35,6 +35,7 @@ dangerous_fuzzers	fuzzers that can crash your computer
 dangerous_norepair	fuzzers to evaluate kernel metadata verifiers
 dangerous_online_repair	fuzzers to evaluate xfs_scrub online repair
 dangerous_fsstress_repair	race fsstress and xfs_scrub online repair
+dangerous_fsstress_scrub	race fsstress and xfs_scrub checking
 dangerous_repair	fuzzers to evaluate xfs_repair offline repair
 dangerous_scrub		fuzzers to evaluate xfs_scrub checking
 data			data loss checkers
diff --git a/tests/xfs/847 b/tests/xfs/847
new file mode 100755
index 0000000000..856e9a6c26
--- /dev/null
+++ b/tests/xfs/847
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle, Inc.  All Rights Reserved.
+#
+# FS QA Test No. 847
+#
+# Race fsx and xfs_scrub in read-only mode for a while to see if we crash
+# or livelock.
+#
+. ./common/preamble
+_begin_fstest scrub dangerous_fsstress_scrub
+
+_cleanup() {
+	cd /
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_scrub
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_scrub -S '-n' -X 'fsx'
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/847.out b/tests/xfs/847.out
new file mode 100644
index 0000000000..b7041db159
--- /dev/null
+++ b/tests/xfs/847.out
@@ -0,0 +1,2 @@
+QA output created by 847
+Silence is golden
diff --git a/tests/xfs/848 b/tests/xfs/848
new file mode 100755
index 0000000000..ab32020624
--- /dev/null
+++ b/tests/xfs/848
@@ -0,0 +1,38 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022 Oracle, Inc.  All Rights Reserved.
+#
+# FS QA Test No. 848
+#
+# Race fsx and xfs_scrub in force-repair mode for a while to see if we
+# crash or livelock.
+#
+. ./common/preamble
+_begin_fstest online_repair dangerous_fsstress_repair
+
+_cleanup() {
+	cd /
+	_scratch_xfs_stress_scrub_cleanup &> /dev/null
+	rm -r -f $tmp.*
+}
+_register_cleanup "_cleanup" BUS
+
+# Import common functions.
+. ./common/filter
+. ./common/fuzzy
+. ./common/inject
+. ./common/xfs
+
+# real QA test starts here
+_supported_fs xfs
+_require_scratch
+_require_xfs_stress_online_repair
+
+_scratch_mkfs > "$seqres.full" 2>&1
+_scratch_mount
+_scratch_xfs_stress_online_repair -S '-k' -X 'fsx'
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/848.out b/tests/xfs/848.out
new file mode 100644
index 0000000000..23f674045c
--- /dev/null
+++ b/tests/xfs/848.out
@@ -0,0 +1,2 @@
+QA output created by 848
+Silence is golden

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 06/14] xfs: document how online fsck deals with eventual consistency
  2023-01-05  9:08     ` Amir Goldstein
@ 2023-01-05 19:40       ` Darrick J. Wong
  2023-01-06  3:33         ` Amir Goldstein
  0 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2023-01-05 19:40 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

On Thu, Jan 05, 2023 at 11:08:51AM +0200, Amir Goldstein wrote:
> On Sat, Dec 31, 2022 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Writes to an XFS filesystem employ an eventual consistency update model
> > to break up complex multistep metadata updates into small chained
> > transactions.  This is generally good for performance and scalability
> > because XFS doesn't need to prepare for enormous transactions, but it
> > also means that online fsck must be careful not to attempt a fsck action
> > unless it can be shown that there are no other threads processing a
> > transaction chain.  This part of the design documentation covers the
> > thinking behind the consistency model and how scrub deals with it.
> >
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  303 ++++++++++++++++++++
> >  1 file changed, 303 insertions(+)
> >
> >
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index f45bf97fa9c4..419eb54ee200 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -1443,3 +1443,306 @@ This step is critical for enabling system administrator to monitor the status
> >  of the filesystem and the progress of any repairs.
> >  For developers, it is a useful means to judge the efficacy of error detection
> >  and correction in the online and offline checking tools.
> > +
> > +Eventual Consistency vs. Online Fsck
> > +------------------------------------
> > +
> > +Midway through the development of online scrubbing, the fsstress tests
> > +uncovered a misinteraction between online fsck and compound transaction chains
> > +created by other writer threads that resulted in false reports of metadata
> > +inconsistency.
> > +The root cause of these reports is the eventual consistency model introduced by
> > +the expansion of deferred work items and compound transaction chains when
> > +reverse mapping and reflink were introduced.
> > +
> > +Originally, transaction chains were added to XFS to avoid deadlocks when
> > +unmapping space from files.
> > +Deadlock avoidance rules require that AGs only be locked in increasing order,
> > +which makes it impossible (say) to use a single transaction to free a space
> > +extent in AG 7 and then try to free a now superfluous block mapping btree block
> > +in AG 3.
> > +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
> > +items to commit to freeing some space in one transaction while deferring the
> > +actual metadata updates to a fresh transaction.
> > +The transaction sequence looks like this:
> > +
> > +1. The first transaction contains a physical update to the file's block mapping
> > +   structures to remove the mapping from the btree blocks.
> > +   It then attaches to the in-memory transaction an action item to schedule
> > +   deferred freeing of space.
> > +   Concretely, each transaction maintains a list of ``struct
> > +   xfs_defer_pending`` objects, each of which maintains a list of ``struct
> > +   xfs_extent_free_item`` objects.
> > +   Returning to the example above, the action item tracks the freeing of both
> > +   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
> > +   AG 3.
> > +   Deferred frees recorded in this manner are committed in the log by creating
> > +   an EFI log item from the ``struct xfs_extent_free_item`` object and
> > +   attaching the log item to the transaction.
> > +   When the log is persisted to disk, the EFI item is written into the ondisk
> > +   transaction record.
> > +   EFIs can list up to 16 extents to free, all sorted in AG order.
> > +
> > +2. The second transaction contains a physical update to the free space btrees
> > +   of AG 3 to release the former BMBT block and a second physical update to the
> > +   free space btrees of AG 7 to release the unmapped file space.
> > +   Observe that the the physical updates are resequenced in the correct order
> > +   when possible.
> > +   Attached to the transaction is a an extent free done (EFD) log item.
> > +   The EFD contains a pointer to the EFI logged in transaction #1 so that log
> > +   recovery can tell if the EFI needs to be replayed.
> > +
> > +If the system goes down after transaction #1 is written back to the filesystem
> > +but before #2 is committed, a scan of the filesystem metadata would show
> > +inconsistent filesystem metadata because there would not appear to be any owner
> > +of the unmapped space.
> > +Happily, log recovery corrects this inconsistency for us -- when recovery finds
> > +an intent log item but does not find a corresponding intent done item, it will
> > +reconstruct the incore state of the intent item and finish it.
> > +In the example above, the log must replay both frees described in the recovered
> > +EFI to complete the recovery phase.
> > +
> > +There are two subtleties to XFS' transaction chaining strategy to consider.
> > +The first is that log items must be added to a transaction in the correct order
> > +to prevent conflicts with principal objects that are not held by the
> > +transaction.
> > +In other words, all per-AG metadata updates for an unmapped block must be
> > +completed before the last update to free the extent, and extents should not
> > +be reallocated until that last update commits to the log.
> > +The second subtlety comes from the fact that AG header buffers are (usually)
> > +released between each transaction in a chain.
> > +This means that other threads can observe an AG in an intermediate state,
> > +but as long as the first subtlety is handled, this should not affect the
> > +correctness of filesystem operations.
> > +Unmounting the filesystem flushes all pending work to disk, which means that
> > +offline fsck never sees the temporary inconsistencies caused by deferred work
> > +item processing.
> > +In this manner, XFS employs a form of eventual consistency to avoid deadlocks
> > +and increase parallelism.
> > +
> > +During the design phase of the reverse mapping and reflink features, it was
> > +decided that it was impractical to cram all the reverse mapping updates for a
> > +single filesystem change into a single transaction because a single file
> > +mapping operation can explode into many small updates:
> > +
> > +* The block mapping update itself
> > +* A reverse mapping update for the block mapping update
> > +* Fixing the freelist
> > +* A reverse mapping update for the freelist fix
> > +
> > +* A shape change to the block mapping btree
> > +* A reverse mapping update for the btree update
> > +* Fixing the freelist (again)
> > +* A reverse mapping update for the freelist fix
> > +
> > +* An update to the reference counting information
> > +* A reverse mapping update for the refcount update
> > +* Fixing the freelist (a third time)
> > +* A reverse mapping update for the freelist fix
> > +
> > +* Freeing any space that was unmapped and not owned by any other file
> > +* Fixing the freelist (a fourth time)
> > +* A reverse mapping update for the freelist fix
> > +
> > +* Freeing the space used by the block mapping btree
> > +* Fixing the freelist (a fifth time)
> > +* A reverse mapping update for the freelist fix
> > +
> > +Free list fixups are not usually needed more than once per AG per transaction
> > +chain, but it is theoretically possible if space is very tight.
> > +For copy-on-write updates this is even worse, because this must be done once to
> > +remove the space from a staging area and again to map it into the file!
> > +
> > +To deal with this explosion in a calm manner, XFS expands its use of deferred
> > +work items to cover most reverse mapping updates and all refcount updates.
> > +This reduces the worst case size of transaction reservations by breaking the
> > +work into a long chain of small updates, which increases the degree of eventual
> > +consistency in the system.
> > +Again, this generally isn't a problem because XFS orders its deferred work
> > +items carefully to avoid resource reuse conflicts between unsuspecting threads.
> > +
> > +However, online fsck changes the rules -- remember that although physical
> > +updates to per-AG structures are coordinated by locking the buffers for AG
> > +headers, buffer locks are dropped between transactions.
> > +Once scrub acquires resources and takes locks for a data structure, it must do
> > +all the validation work without releasing the lock.
> > +If the main lock for a space btree is an AG header buffer lock, scrub may have
> > +interrupted another thread that is midway through finishing a chain.
> > +For example, if a thread performing a copy-on-write has completed a reverse
> > +mapping update but not the corresponding refcount update, the two AG btrees
> > +will appear inconsistent to scrub and an observation of corruption will be
> > +recorded.  This observation will not be correct.
> > +If a repair is attempted in this state, the results will be catastrophic!
> > +
> > +Several solutions to this problem were evaluated upon discovery of this flaw:
> > +
> > +1. Add a higher level lock to allocation groups and require writer threads to
> > +   acquire the higher level lock in AG order before making any changes.
> > +   This would be very difficult to implement in practice because it is
> > +   difficult to determine which locks need to be obtained, and in what order,
> > +   without simulating the entire operation.
> > +   Performing a dry run of a file operation to discover necessary locks would
> > +   make the filesystem very slow.
> > +
> > +2. Make the deferred work coordinator code aware of consecutive intent items
> > +   targeting the same AG and have it hold the AG header buffers locked across
> > +   the transaction roll between updates.
> > +   This would introduce a lot of complexity into the coordinator since it is
> > +   only loosely coupled with the actual deferred work items.
> > +   It would also fail to solve the problem because deferred work items can
> > +   generate new deferred subtasks, but all subtasks must be complete before
> > +   work can start on a new sibling task.
> > +
> > +3. Teach online fsck to walk all transactions waiting for whichever lock(s)
> > +   protect the data structure being scrubbed to look for pending operations.
> > +   The checking and repair operations must factor these pending operations into
> > +   the evaluations being performed.
> > +   This solution is a nonstarter because it is *extremely* invasive to the main
> > +   filesystem.
> > +
> > +4. Recognize that only online fsck has this requirement of total consistency
> > +   of AG metadata, and that online fsck should be relatively rare as compared
> > +   to filesystem change operations.
> > +   For each AG, maintain a count of intent items targetting that AG.
> > +   When online fsck wants to examine an AG, it should lock the AG header
> > +   buffers to quiesce all transaction chains that want to modify that AG, and
> > +   only proceed with the scrub if the count is zero.
> > +   In other words, scrub only proceeds if it can lock the AG header buffers and
> > +   there can't possibly be any intents in progress.
> > +   This may lead to fairness and starvation issues, but regular filesystem
> > +   updates take precedence over online fsck activity.
> > +
> 
> Is there any guarantee that some silly real life regular filesystem workload
> won't starve online fsck forever?
> IOW, is forward progress of online fsck guaranteed?

Nope, forward progress isn't guaranteed.  The kernel checks for fatal
signals every time it backs off a scrub so at least we don't end up with
unkillable processes.  At one point I added a timeout field to the ioctl
interface so that the kernel could time out an operation if it took too
long to acquire the necessary resources.  So far, the "race fsstress and
xfs_scrub" tests have not shown scrub failing to make any forward
progress.

That said, I have /not/ yet had a chance to try it out any of these
massive 1000-core systems with an according workload.

> Good luck with landing online fsck before the 2024 NYE deluge ;)

Thank *you* for reading this chapter of the design document!! :)

--D

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 06/14] xfs: document how online fsck deals with eventual consistency
  2023-01-05 19:40       ` Darrick J. Wong
@ 2023-01-06  3:33         ` Amir Goldstein
  2023-01-11 17:54           ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Amir Goldstein @ 2023-01-06  3:33 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

On Thu, Jan 5, 2023 at 9:40 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Jan 05, 2023 at 11:08:51AM +0200, Amir Goldstein wrote:
> > On Sat, Dec 31, 2022 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Writes to an XFS filesystem employ an eventual consistency update model
> > > to break up complex multistep metadata updates into small chained
> > > transactions.  This is generally good for performance and scalability
> > > because XFS doesn't need to prepare for enormous transactions, but it
> > > also means that online fsck must be careful not to attempt a fsck action
> > > unless it can be shown that there are no other threads processing a
> > > transaction chain.  This part of the design documentation covers the
> > > thinking behind the consistency model and how scrub deals with it.
> > >
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  303 ++++++++++++++++++++
> > >  1 file changed, 303 insertions(+)
> > >
> > >
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index f45bf97fa9c4..419eb54ee200 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -1443,3 +1443,306 @@ This step is critical for enabling system administrator to monitor the status
> > >  of the filesystem and the progress of any repairs.
> > >  For developers, it is a useful means to judge the efficacy of error detection
> > >  and correction in the online and offline checking tools.
> > > +
> > > +Eventual Consistency vs. Online Fsck
> > > +------------------------------------
> > > +
> > > +Midway through the development of online scrubbing, the fsstress tests
> > > +uncovered a misinteraction between online fsck and compound transaction chains
> > > +created by other writer threads that resulted in false reports of metadata
> > > +inconsistency.
> > > +The root cause of these reports is the eventual consistency model introduced by
> > > +the expansion of deferred work items and compound transaction chains when
> > > +reverse mapping and reflink were introduced.
> > > +
> > > +Originally, transaction chains were added to XFS to avoid deadlocks when
> > > +unmapping space from files.
> > > +Deadlock avoidance rules require that AGs only be locked in increasing order,
> > > +which makes it impossible (say) to use a single transaction to free a space
> > > +extent in AG 7 and then try to free a now superfluous block mapping btree block
> > > +in AG 3.
> > > +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
> > > +items to commit to freeing some space in one transaction while deferring the
> > > +actual metadata updates to a fresh transaction.
> > > +The transaction sequence looks like this:
> > > +
> > > +1. The first transaction contains a physical update to the file's block mapping
> > > +   structures to remove the mapping from the btree blocks.
> > > +   It then attaches to the in-memory transaction an action item to schedule
> > > +   deferred freeing of space.
> > > +   Concretely, each transaction maintains a list of ``struct
> > > +   xfs_defer_pending`` objects, each of which maintains a list of ``struct
> > > +   xfs_extent_free_item`` objects.
> > > +   Returning to the example above, the action item tracks the freeing of both
> > > +   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
> > > +   AG 3.
> > > +   Deferred frees recorded in this manner are committed in the log by creating
> > > +   an EFI log item from the ``struct xfs_extent_free_item`` object and
> > > +   attaching the log item to the transaction.
> > > +   When the log is persisted to disk, the EFI item is written into the ondisk
> > > +   transaction record.
> > > +   EFIs can list up to 16 extents to free, all sorted in AG order.
> > > +
> > > +2. The second transaction contains a physical update to the free space btrees
> > > +   of AG 3 to release the former BMBT block and a second physical update to the
> > > +   free space btrees of AG 7 to release the unmapped file space.
> > > +   Observe that the the physical updates are resequenced in the correct order
> > > +   when possible.
> > > +   Attached to the transaction is a an extent free done (EFD) log item.
> > > +   The EFD contains a pointer to the EFI logged in transaction #1 so that log
> > > +   recovery can tell if the EFI needs to be replayed.
> > > +
> > > +If the system goes down after transaction #1 is written back to the filesystem
> > > +but before #2 is committed, a scan of the filesystem metadata would show
> > > +inconsistent filesystem metadata because there would not appear to be any owner
> > > +of the unmapped space.
> > > +Happily, log recovery corrects this inconsistency for us -- when recovery finds
> > > +an intent log item but does not find a corresponding intent done item, it will
> > > +reconstruct the incore state of the intent item and finish it.
> > > +In the example above, the log must replay both frees described in the recovered
> > > +EFI to complete the recovery phase.
> > > +
> > > +There are two subtleties to XFS' transaction chaining strategy to consider.
> > > +The first is that log items must be added to a transaction in the correct order
> > > +to prevent conflicts with principal objects that are not held by the
> > > +transaction.
> > > +In other words, all per-AG metadata updates for an unmapped block must be
> > > +completed before the last update to free the extent, and extents should not
> > > +be reallocated until that last update commits to the log.
> > > +The second subtlety comes from the fact that AG header buffers are (usually)
> > > +released between each transaction in a chain.
> > > +This means that other threads can observe an AG in an intermediate state,
> > > +but as long as the first subtlety is handled, this should not affect the
> > > +correctness of filesystem operations.
> > > +Unmounting the filesystem flushes all pending work to disk, which means that
> > > +offline fsck never sees the temporary inconsistencies caused by deferred work
> > > +item processing.
> > > +In this manner, XFS employs a form of eventual consistency to avoid deadlocks
> > > +and increase parallelism.
> > > +
> > > +During the design phase of the reverse mapping and reflink features, it was
> > > +decided that it was impractical to cram all the reverse mapping updates for a
> > > +single filesystem change into a single transaction because a single file
> > > +mapping operation can explode into many small updates:
> > > +
> > > +* The block mapping update itself
> > > +* A reverse mapping update for the block mapping update
> > > +* Fixing the freelist
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* A shape change to the block mapping btree
> > > +* A reverse mapping update for the btree update
> > > +* Fixing the freelist (again)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* An update to the reference counting information
> > > +* A reverse mapping update for the refcount update
> > > +* Fixing the freelist (a third time)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* Freeing any space that was unmapped and not owned by any other file
> > > +* Fixing the freelist (a fourth time)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* Freeing the space used by the block mapping btree
> > > +* Fixing the freelist (a fifth time)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +Free list fixups are not usually needed more than once per AG per transaction
> > > +chain, but it is theoretically possible if space is very tight.
> > > +For copy-on-write updates this is even worse, because this must be done once to
> > > +remove the space from a staging area and again to map it into the file!
> > > +
> > > +To deal with this explosion in a calm manner, XFS expands its use of deferred
> > > +work items to cover most reverse mapping updates and all refcount updates.
> > > +This reduces the worst case size of transaction reservations by breaking the
> > > +work into a long chain of small updates, which increases the degree of eventual
> > > +consistency in the system.
> > > +Again, this generally isn't a problem because XFS orders its deferred work
> > > +items carefully to avoid resource reuse conflicts between unsuspecting threads.
> > > +
> > > +However, online fsck changes the rules -- remember that although physical
> > > +updates to per-AG structures are coordinated by locking the buffers for AG
> > > +headers, buffer locks are dropped between transactions.
> > > +Once scrub acquires resources and takes locks for a data structure, it must do
> > > +all the validation work without releasing the lock.
> > > +If the main lock for a space btree is an AG header buffer lock, scrub may have
> > > +interrupted another thread that is midway through finishing a chain.
> > > +For example, if a thread performing a copy-on-write has completed a reverse
> > > +mapping update but not the corresponding refcount update, the two AG btrees
> > > +will appear inconsistent to scrub and an observation of corruption will be
> > > +recorded.  This observation will not be correct.
> > > +If a repair is attempted in this state, the results will be catastrophic!
> > > +
> > > +Several solutions to this problem were evaluated upon discovery of this flaw:
> > > +
> > > +1. Add a higher level lock to allocation groups and require writer threads to
> > > +   acquire the higher level lock in AG order before making any changes.
> > > +   This would be very difficult to implement in practice because it is
> > > +   difficult to determine which locks need to be obtained, and in what order,
> > > +   without simulating the entire operation.
> > > +   Performing a dry run of a file operation to discover necessary locks would
> > > +   make the filesystem very slow.
> > > +
> > > +2. Make the deferred work coordinator code aware of consecutive intent items
> > > +   targeting the same AG and have it hold the AG header buffers locked across
> > > +   the transaction roll between updates.
> > > +   This would introduce a lot of complexity into the coordinator since it is
> > > +   only loosely coupled with the actual deferred work items.
> > > +   It would also fail to solve the problem because deferred work items can
> > > +   generate new deferred subtasks, but all subtasks must be complete before
> > > +   work can start on a new sibling task.
> > > +
> > > +3. Teach online fsck to walk all transactions waiting for whichever lock(s)
> > > +   protect the data structure being scrubbed to look for pending operations.
> > > +   The checking and repair operations must factor these pending operations into
> > > +   the evaluations being performed.
> > > +   This solution is a nonstarter because it is *extremely* invasive to the main
> > > +   filesystem.
> > > +
> > > +4. Recognize that only online fsck has this requirement of total consistency
> > > +   of AG metadata, and that online fsck should be relatively rare as compared
> > > +   to filesystem change operations.
> > > +   For each AG, maintain a count of intent items targetting that AG.
> > > +   When online fsck wants to examine an AG, it should lock the AG header
> > > +   buffers to quiesce all transaction chains that want to modify that AG, and
> > > +   only proceed with the scrub if the count is zero.
> > > +   In other words, scrub only proceeds if it can lock the AG header buffers and
> > > +   there can't possibly be any intents in progress.
> > > +   This may lead to fairness and starvation issues, but regular filesystem
> > > +   updates take precedence over online fsck activity.
> > > +
> >
> > Is there any guarantee that some silly real life regular filesystem workload
> > won't starve online fsck forever?
> > IOW, is forward progress of online fsck guaranteed?
>
> Nope, forward progress isn't guaranteed.

That sounds like a problem.

> The kernel checks for fatal
> signals every time it backs off a scrub so at least we don't end up with
> unkillable processes.  At one point I added a timeout field to the ioctl
> interface so that the kernel could time out an operation if it took too
> long to acquire the necessary resources.  So far, the "race fsstress and
> xfs_scrub" tests have not shown scrub failing to make any forward
> progress.
>
> That said, I have /not/ yet had a chance to try it out any of these
> massive 1000-core systems with an according workload.
>

Don't know if fsstress is the best way to check the worst case scenario.

Can you think of a workload, say several threads creating and deleting
temp files, with deferred parent pointer items preventing the queue from
ever draining?

Considering that a "full journal" scenario is always going to be a possible
worst case incident, how bad would it be to block new transactions
instead of the possibility of starving scrub consistency checks forever?

Wouldn't the consistency checks be much faster than freeing journal
space would be in a "full journal" situation?

I don't know if there is a "mission statement" for online fsck, but
I think it would say "minimal user interference" not "no user interference".
It sounds like the interference we are trying to avoid is light years away
from the downtime of offline fsck, so online fsck would still be a huge win.
online fsck that never ends OTOH... maybe less so.

> > Good luck with landing online fsck before the 2024 NYE deluge ;)
>
> Thank *you* for reading this chapter of the design document!! :)
>

Oh I read them all at the summer submission, but it took me so long
that I forgot to follow up..

My other question was regarding memory usage control.
I have horrid memories from e2fsck unpredictable memory usage
and unpredictable runtime due to swapping.

xfs_repair -m was a huge improvement compared to e2fsck.
I don't remember reading about memory usage limits for online repair,
so I was concerned about unpredictable memory usage and swapping.
Can you say something to ease those concerns?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 01/14] xfs: document the motivation for online fsck design
  2022-12-30 22:10   ` [PATCH 01/14] xfs: document the motivation for " Darrick J. Wong
@ 2023-01-07  5:01     ` Allison Henderson
  2023-01-11 19:10       ` Darrick J. Wong
  2023-01-12  0:10       ` Darrick J. Wong
  0 siblings, 2 replies; 220+ messages in thread
From: Allison Henderson @ 2023-01-07  5:01 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Start the first chapter of the online fsck design documentation.
> This covers the motivations for creating this in the first place.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  Documentation/filesystems/index.rst                |    1 
>  .../filesystems/xfs-online-fsck-design.rst         |  199
> ++++++++++++++++++++
>  2 files changed, 200 insertions(+)
>  create mode 100644 Documentation/filesystems/xfs-online-fsck-
> design.rst
> 
> 
> diff --git a/Documentation/filesystems/index.rst
> b/Documentation/filesystems/index.rst
> index bee63d42e5ec..fbb2b5ada95b 100644
> --- a/Documentation/filesystems/index.rst
> +++ b/Documentation/filesystems/index.rst
> @@ -123,4 +123,5 @@ Documentation for filesystem implementations.
>     vfat
>     xfs-delayed-logging-design
>     xfs-self-describing-metadata
> +   xfs-online-fsck-design
>     zonefs
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> new file mode 100644
> index 000000000000..25717ebb5f80
> --- /dev/null
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -0,0 +1,199 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. _xfs_online_fsck_design:
> +
> +..
> +        Mapping of heading styles within this document:
> +        Heading 1 uses "====" above and below
> +        Heading 2 uses "===="
> +        Heading 3 uses "----"
> +        Heading 4 uses "````"
> +        Heading 5 uses "^^^^"
> +        Heading 6 uses "~~~~"
> +        Heading 7 uses "...."
> +
> +        Sections are manually numbered because apparently that's
> what everyone
> +        does in the kernel.
> +
> +======================
> +XFS Online Fsck Design
> +======================
> +
> +This document captures the design of the online filesystem check
> feature for
> +XFS.
> +The purpose of this document is threefold:
> +
> +- To help kernel distributors understand exactly what the XFS online
> fsck
> +  feature is, and issues about which they should be aware.
> +
> +- To help people reading the code to familiarize themselves with the
> relevant
> +  concepts and design points before they start digging into the
> code.
> +
> +- To help developers maintaining the system by capturing the reasons
> +  supporting higher level decisionmaking.
nit: decision making

> +
> +As the online fsck code is merged, the links in this document to
> topic branches
> +will be replaced with links to code.
> +
> +This document is licensed under the terms of the GNU Public License,
> v2.
> +The primary author is Darrick J. Wong.
> +
> +This design document is split into seven parts.
> +Part 1 defines what fsck tools are and the motivations for writing a
> new one.
> +Parts 2 and 3 present a high level overview of how online fsck
> process works
> +and how it is tested to ensure correct functionality.
> +Part 4 discusses the user interface and the intended usage modes of
> the new
> +program.
> +Parts 5 and 6 show off the high level components and how they fit
> together, and
> +then present case studies of how each repair function actually
> works.
> +Part 7 sums up what has been discussed so far and speculates about
> what else
> +might be built atop online fsck.
> +
> +.. contents:: Table of Contents
> +   :local:
> +

Something that I've noticed in my training sessions is that often
times, less is more.  People really only absorb so much over a
particular duration of time, so sometimes having too much detail in the
context is not as helpful as you might think.  A lot of times,
paraphrasing excerpts to reflect the same info in a more compact format
will help you keep audience on track (a little longer at least). 

> +1. What is a Filesystem Check?
> +==============================
> +
> +A Unix filesystem has three main jobs: to provide a hierarchy of
> names through
> +which application programs can associate arbitrary blobs of data for
> any
> +length of time, to virtualize physical storage media across those
> names, and
> +to retrieve the named data blobs at any time.
Consider the following paraphrase:

A Unix filesystem has three main jobs:
 * Provide a hierarchy of names by which applications access data for a
length of time.
 * Store or retrieve that data at any time.
 * Virtualize physical storage media across those names

Also... I dont think it would be inappropriate to just skip the above,
and jump right into fsck.  That's a very limited view of a filesystem,
likely a reader seeking an fsck doc probably has some idea of what a fs
is otherwise supposed to be doing.  
   

> +The filesystem check (fsck) tool examines all the metadata in a
> filesystem
> +to look for errors.
> +Simple tools only check for obvious corruptions, but the more
> sophisticated
> +ones cross-reference metadata records to look for inconsistencies.
> +People do not like losing data, so most fsck tools also contains
> some ability
> +to deal with any problems found.

While simple tools can detect data corruptions, a filesystem check
(fsck) uses metadata records as a cross-reference to find and correct
more inconsistencies.

?

> +As a word of caution -- the primary goal of most Linux fsck tools is
> to restore
> +the filesystem metadata to a consistent state, not to maximize the
> data
> +recovered.
> +That precedent will not be challenged here.
> +
> +Filesystems of the 20th century generally lacked any redundancy in
> the ondisk
> +format, which means that fsck can only respond to errors by erasing
> files until
> +errors are no longer detected.
> +System administrators avoid data loss by increasing the number of
> separate
> +storage systems through the creation of backups; 


> and they avoid downtime by
> +increasing the redundancy of each storage system through the
> creation of RAID.
Mmm, raids help more for hardware failures right?  They dont really
have a notion of when the fs is corrupted.  While an fsck can help
navigate around a corruption possibly caused by a hardware failure, I
think it's really a different kind of redundancy. I think I'd probably
drop the last line and keep the selling point focused online repair.

> +More recent filesystem designs contain enough redundancy in their
> metadata that
> +it is now possible to regenerate data structures when non-
> catastrophic errors
> +occur; 


> this capability aids both strategies.
> +Over the past few years, XFS has added a storage space reverse
> mapping index to
> +make it easy to find which files or metadata objects think they own
> a
> +particular range of storage.
> +Efforts are under way to develop a similar reverse mapping index for
> the naming
> +hierarchy, which will involve storing directory parent pointers in
> each file.
> +With these two pieces in place, XFS uses secondary information to
> perform more
> +sophisticated repairs.
This part here I think I would either let go or relocate.  The topic of
this section is supposed to discuss roughly what a filesystem check is.
Ideally so we can start talking about how ofsck is different.  It feels
like a bit of a jump to suddenly hop into rmap and pptrs, and for
"sophisticated repairs" that we havn't really gotten into the details
of yet.  So I think it would read easier if we saved this part until we
start talking about how they are used later.  

> +
> +TLDR; Show Me the Code!
> +-----------------------
> +
> +Code is posted to the kernel.org git trees as follows:
> +`kernel changes
> <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git
> /log/?h=repair-symlink>`_,
> +`userspace changes
> <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.
> git/log/?h=scrub-media-scan-service>`_, and
> +`QA test changes
> <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.
> git/log/?h=repair-dirs>`_.
> +Each kernel patchset adding an online repair function will use the
> same branch
> +name across the kernel, xfsprogs, and fstests git repos.
> +
> +Existing Tools
> +--------------
> +
> +The online fsck tool described here will be the third tool in the
> history of
> +XFS (on Linux) to check and repair filesystems.
> +Two programs precede it:
> +
> +The first program, ``xfs_check``, was created as part of the XFS
> debugger
> +(``xfs_db``) and can only be used with unmounted filesystems.
> +It walks all metadata in the filesystem looking for inconsistencies
> in the
> +metadata, though it lacks any ability to repair what it finds.
> +Due to its high memory requirements and inability to repair things,
> this
> +program is now deprecated and will not be discussed further.
> +
> +The second program, ``xfs_repair``, was created to be faster and
> more robust
> +than the first program.
> +Like its predecessor, it can only be used with unmounted
> filesystems.
> +It uses extent-based in-memory data structures to reduce memory
> consumption,
> +and tries to schedule readahead IO appropriately to reduce I/O
> waiting time
> +while it scans the metadata of the entire filesystem.
> +The most important feature of this tool is its ability to respond to
> +inconsistencies in file metadata and directory tree by erasing
> things as needed
> +to eliminate problems.
> +Space usage metadata are rebuilt from the observed file metadata.
> +
> +Problem Statement
> +-----------------
> +
> +The current XFS tools leave several problems unsolved:
> +
> +1. **User programs** suddenly **lose access** to information in the
> computer
> +   when unexpected shutdowns occur as a result of silent corruptions
> in the
> +   filesystem metadata.
> +   These occur **unpredictably** and often without warning.


1. **User programs** suddenly **lose access** to the filesystem
   when unexpected shutdowns occur as a result of silent corruptions
that could have otherwise been avoided with an online repair

While some of these issues are not untrue, I think it makes sense to
limit them to the issue you plan to solve, and therefore discuss.

> +
> +2. **Users** experience a **total loss of service** during the
> recovery period
> +   after an **unexpected shutdown** occurs.
> +
> +3. **Users** experience a **total loss of service** if the
> filesystem is taken
> +   offline to **look for problems** proactively.
> +
> +4. **Data owners** cannot **check the integrity** of their stored
> data without
> +   reading all of it.

> +   This may expose them to substantial billing costs when a linear
> media scan
> +   might suffice.
Ok, I had to re-read this one a few times, but I think this reads a
little cleaner:

    Customers that are billed for data egress may incur unnecessary
cost when a background media scan on the host may have sufficed

?

> +
> +5. **System administrators** cannot **schedule** a maintenance
> window to deal
> +   with corruptions if they **lack the means** to assess filesystem
> health
> +   while the filesystem is online.
> +
> +6. **Fleet monitoring tools** cannot **automate periodic checks** of
> filesystem
> +   health when doing so requires **manual intervention** and
> downtime.
> +
> +7. **Users** can be tricked into **doing things they do not desire**
> when
> +   malicious actors **exploit quirks of Unicode** to place
> misleading names
> +   in directories.
hrmm, I guess I'm not immediately extrapolating what things users are
being tricked into doing, or how ofsck solves this?  Otherwise I might
drop the last one here, I think the rest of the bullets are plenty of
motivation.


> +
> +Given this definition of the problems to be solved and the actors
> who would
> +benefit, the proposed solution is a third fsck tool that acts on a
> running
> +filesystem.
> +
> +This new third program has three components: an in-kernel facility
> to check
> +metadata, an in-kernel facility to repair metadata, and a userspace
> driver
> +program to drive fsck activity on a live filesystem.
> +``xfs_scrub`` is the name of the driver program.
> +The rest of this document presents the goals and use cases of the
> new fsck
> +tool, describes its major design points in connection to those
> goals, and
> +discusses the similarities and differences with existing tools.
> +
> ++-------------------------------------------------------------------
> -------+
> +|
> **Note**:                                                            
>     |
> ++-------------------------------------------------------------------
> -------+
> +| Throughout this document, the existing offline fsck tool can also
> be     |
> +| referred to by its current name
> "``xfs_repair``".                        |
> +| The userspace driver program for the new online fsck tool can
> be         |
> +| referred to as
> "``xfs_scrub``".                                          |
> +| The kernel portion of online fsck that validates metadata is
> called      |
> +| "online scrub", and portion of the kernel that fixes metadata is
> called  |
> +| "online
> repair".                                                         |
> ++-------------------------------------------------------------------
> -------+
> 

Hmm, maybe here might be a good spot to move rmap and pptrs?  It's not
otherwise clear to me what "secondary metadata" is.  If that is what it
is meant to refer to, I think the reader will more intuitively make the
connection if those two blurbs appear in the same context.
> +
> +Secondary metadata indices enable the reconstruction of parts of a
> damaged
> +primary metadata object from secondary information.

I would take out this blurb...
> +XFS filesystems shard themselves into multiple primary objects to
> enable better
> +performance on highly threaded systems and to contain the blast
> radius when
> +problems happen.


> +The naming hierarchy is broken up into objects known as directories
> and files;
> +and the physical space is split into pieces known as allocation
> groups.
And add here:

"This enables better performance on highly threaded systems and helps
to contain corruptions when they occur."

I think that reads cleaner

> +The division of the filesystem into principal objects (allocation
> groups and
> +inodes) means that there are ample opportunities to perform targeted
> checks and
> +repairs on a subset of the filesystem.
> +While this is going on, other parts continue processing IO requests.
> +Even if a piece of filesystem metadata can only be regenerated by
> scanning the
> +entire system, the scan can still be done in the background while
> other file
> +operations continue.
> +
> +In summary, online fsck takes advantage of resource sharding and
> redundant
> +metadata to enable targeted checking and repair operations while the
> system
> +is running.
> +This capability will be coupled to automatic system management so
> that
> +autonomous self-healing of XFS maximizes service availability.
> 

Nits and paraphrases aside, I think this looks pretty good?

Allison


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 02/14] xfs: document the general theory underlying online fsck design
  2022-12-30 22:10   ` [PATCH 02/14] xfs: document the general theory underlying online fsck design Darrick J. Wong
@ 2023-01-11  1:25     ` Allison Henderson
  2023-01-11 23:39       ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-01-11  1:25 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Start the second chapter of the online fsck design documentation.
> This covers the general theory underlying how online fsck works.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  366
> ++++++++++++++++++++
>  1 file changed, 366 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index 25717ebb5f80..a03a7b9f0250 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -197,3 +197,369 @@ metadata to enable targeted checking and repair
> operations while the system
>  is running.
>  This capability will be coupled to automatic system management so
> that
>  autonomous self-healing of XFS maximizes service availability.
> +
> +2. Theory of Operation
> +======================
> +
> +Because it is necessary for online fsck to lock and scan live
> metadata objects,
> +online fsck consists of three separate code components.
> +The first is the userspace driver program ``xfs_scrub``, which is
> responsible
> +for identifying individual metadata items, scheduling work items for
> them,
> +reacting to the outcomes appropriately, and reporting results to the
> system
> +administrator.
> +The second and third are in the kernel, which implements functions
> to check
> +and repair each type of online fsck work item.
> +
> ++------------------------------------------------------------------+
> +| **Note**:                                                        |
> ++------------------------------------------------------------------+
> +| For brevity, this document shortens the phrase "online fsck work |
> +| item" to "scrub item".                                           |
> ++------------------------------------------------------------------+
> +
> +Scrub item types are delineated in a manner consistent with the Unix
> design
> +philosophy, which is to say that each item should handle one aspect
> of a
> +metadata structure, and handle it well.
> +
> +Scope
> +-----
> +
> +In principle, online fsck should be able to check and to repair
> everything that
> +the offline fsck program can handle.
> +However, the adjective *online* brings with it the limitation that
> online fsck
> +cannot deal with anything that prevents the filesystem from going on
> line, i.e.
> +mounting.
Are there really any other operations that do that other than mount?  I
think this reads cleaner:

By definition, online fsck can only check and repair an online
filesystem.  It cannot check mounting operations which start from an
offline state.


> +This limitation means that maintenance of the offline fsck tool will
> continue.
> +A second limitation of online fsck is that it must follow the same
> resource
> +sharing and lock acquisition rules as the regular filesystem.
> +This means that scrub cannot take *any* shortcuts to save time,
> because doing
> +so could lead to concurrency problems.
> +In other words, online fsck will never be able to fix 100% of the
> +inconsistencies that offline fsck can repair, 
Hmm, what inconsistencies cannot repaired as a result of the "no
shortcut" rule?  I'm all for keeping things short and to the point, but
since this section is about scope, I'd give it at least a brief bullet
list

> and a complete run of online fsck
> +may take longer.
> +However, both of these limitations are acceptable tradeoffs to
> satisfy the
> +different motivations of online fsck, which are to **minimize system
> downtime**
> +and to **increase predictability of operation**.
> +
> +.. _scrubphases:
> +
> +Phases of Work
> +--------------
> +
> +The userspace driver program ``xfs_scrub`` splits the work of
> checking and
> +repairing an entire filesystem into seven phases.
> +Each phase concentrates on checking specific types of scrub items
> and depends
> +on the success of all previous phases.
> +The seven phases are as follows:
> +
> +1. Collect geometry information about the mounted filesystem and
> computer,
> +   discover the online fsck capabilities of the kernel, and open the
> +   underlying storage devices.
> +
> +2. Check allocation group metadata, all realtime volume metadata,
> and all quota
> +   files.
> +   Each metadata structure is scheduled as a separate scrub item.
Like an intent item?

> +   If corruption is found in the inode header or inode btree and
> ``xfs_scrub``
> +   is permitted to perform repairs, then those scrub items are
> repaired to
> +   prepare for phase 3.
> +   Repairs are implemented by resubmitting the scrub item to the
> kernel with
If I'm understanding this correctly:
Repairs are implemented as intent items that are queued and committed
just as any filesystem operation.

?

> +   the repair flag enabled; this is discussed in the next section.
> +   Optimizations and all other repairs are deferred to phase 4.
I guess I'll come back to it. 

> +
> +3. Check all metadata of every file in the filesystem.
> +   Each metadata structure is also scheduled as a separate scrub
> item.
> +   If repairs are needed, ``xfs_scrub`` is permitted to perform
> repairs,
If repairs are needed and ``xfs_scrub`` is permitted

?
> +   and there were no problems detected during phase 2, then those
> scrub items
> +   are repaired.
> +   Optimizations and unsuccessful repairs are deferred to phase 4.
> +
> +4. All remaining repairs and scheduled optimizations are performed
> during this
> +   phase, if the caller permits them.
> +   Before starting repairs, the summary counters are checked and any
Did we talk about summary counters yet?  Maybe worth a blub. Otherwise
this may not make sense with out skipping ahead or into the code


> necessary
> +   repairs are performed so that subsequent repairs will not fail
> the resource
> +   reservation step due to wildly incorrect summary counters.
> +   Unsuccesful repairs are requeued as long as forward progress on
> repairs is
> +   made somewhere in the filesystem.
> +   Free space in the filesystem is trimmed at the end of phase 4 if
> the
> +   filesystem is clean.
> +
> +5. By the start of this phase, all primary and secondary filesystem
> metadata
> +   must be correct.
I think maybe the definitions of primary and secondary metadata should
move up before the phases section.  Otherwise the reader has to skip
ahead to know what that means.

> +   Summary counters such as the free space counts and quota resource
> counts
> +   are checked and corrected.
> +   Directory entry names and extended attribute names are checked
> for
> +   suspicious entries such as control characters or confusing
> Unicode sequences
> +   appearing in names.
> +
> +6. If the caller asks for a media scan, read all allocated and
> written data
> +   file extents in the filesystem.
> +   The ability to use hardware-assisted data file integrity checking
> is new
> +   to online fsck; neither of the previous tools have this
> capability.
> +   If media errors occur, they will be mapped to the owning files
> and reported.
> +
> +7. Re-check the summary counters and presents the caller with a
> summary of
> +   space usage and file counts.
> +
> +Steps for Each Scrub Item
> +-------------------------
> +
> +The kernel scrub code uses a three-step strategy for checking and
> repairing
> +the one aspect of a metadata object represented by a scrub item:
> +
> +1. The scrub item of intere
> st is checked for corruptions; opportunities for
> +   optimization; and for values that are directly controlled by the
> system
> +   administrator but look suspicious.
> +   If the item is not corrupt or does not need optimization,
> resource are
> +   released and the positive scan results are returned to userspace.
> +   If the item is corrupt or could be optimized but the caller does
> not permit
> +   this, resources are released and the negative scan results are
> returned to
> +   userspace.
> +   Otherwise, the kernel moves on to the second step.
> +
> +2. The repair function is called to rebuild the data structure.
> +   Repair functions generally choose rebuild a structure from other
> metadata
> +   rather than try to salvage the existing structure.
> +   If the repair fails, the scan results from the first step are
> returned to
> +   userspace.
> +   Otherwise, the kernel moves on to the third step.
> +
> +3. In the third step, the kernel runs the same checks over the new
> metadata
> +   item to assess the efficacy of the repairs.
> +   The results of the reassessment are returned to userspace.
> +
> +Classification of Metadata
> +--------------------------
> +
> +Each type of metadata object (and therefore each type of scrub item)
> is
> +classified as follows:
> +
> +Primary Metadata
> +````````````````
> +
> +Metadata structures in this category should be most familiar to
> filesystem
> +users either because they are directly created by the user or they
> index
> +objects created by the user
I think I would just jump straight into a brief list.  The above is a
bit vague, and documentation that tells you you should already know
what it is, doesnt add much.  Again, I think too much poetry might be
why you're having a hard time getting responses.

> +Most filesystem objects fall into this class.
Most filesystem objects created by users fall into this class, such as
inode, directories, allocation groups and so on.
> +Resource and lock acquisition for scrub code follows the same order
> as regular
> +filesystem accesses.

Lock acquisition for these resources will follow the same order for
scrub as a regular filesystem access.

> +
> +Primary metadata objects are the simplest for scrub to process.
> +The principal filesystem object (either an allocation group or an
> inode) that
> +owns the item being scrubbed is locked to guard against concurrent
> updates.
> +The check function examines every record associated with the type
> for obvious
> +errors and cross-references healthy records against other metadata
> to look for
> +inconsistencies.
> +Repairs for this class of scrub item are simple, since the repair
> function
> +starts by holding all the resources acquired in the previous step.
> +The repair function scans available metadata as needed to record all
> the
> +observations needed to complete the structure.
> +Next, it stages the observations in a new ondisk structure and
> commits it
> +atomically to complete the repair.
> +Finally, the storage from the old data structure are carefully
> reaped.
> +
> +Because ``xfs_scrub`` locks a primary object for the duration of the
> repair,
> +this is effectively an offline repair operation performed on a
> subset of the
> +filesystem.
> +This minimizes the complexity of the repair code because it is not
> necessary to
> +handle concurrent updates from other threads, nor is it necessary to
> access
> +any other part of the filesystem.
> +As a result, indexed structures can be rebuilt very quickly, and
> programs
> +trying to access the damaged structure will be blocked until repairs
> complete.
> +The only infrastructure needed by the repair code are the staging
> area for
> +observations and a means to write new structures to disk.
> +Despite these limitations, the advantage that online repair holds is
> clear:
> +targeted work on individual shards of the filesystem avoids total
> loss of
> +service.
> +
> +This mechanism is described in section 2.1 ("Off-Line Algorithm") of
> +V. Srinivasan and M. J. Carey, `"Performance of On-Line Index
> Construction
> +Algorithms" <https://dl.acm.org/doi/10.5555/645336.649870>`_,
Hmm, this article is not displaying for me.  If the link is abandoned,
probably there's not much need to keep it around

> +*Extending Database Technology*, pp. 293-309, 1992.
> +
> +Most primary metadata repair functions stage their intermediate
> results in an
> +in-memory array prior to formatting the new ondisk structure, which
> is very
> +similar to the list-based algorithm discussed in section 2.3 ("List-
> Based
> +Algorithms") of Srinivasan.
> +However, any data structure builder that maintains a resource lock
> for the
> +duration of the repair is *always* an offline algorithm.
> +
> +Secondary Metadata
> +``````````````````
> +
> +Metadata structures in this category reflect records found in
> primary metadata,

such as rmap and parent pointer attributes.  But they are only
needed...

?

> +but are only needed for online fsck or for reorganization of the
> filesystem.
> +Resource and lock acquisition for scrub code do not follow the same
> order as
> +regular filesystem accesses, and may involve full filesystem scans.
> +
> +Secondary metadata objects are difficult for scrub to process,
> because scrub
> +attaches to the secondary object but needs to check primary
> metadata, which
> +runs counter to the usual order of resource acquisition.
bummer :-(

> +Check functions can be limited in scope to reduce runtime.
> +Repairs, however, require a full scan of primary metadata, which can
> take a
> +long time to complete.
> +Under these conditions, ``xfs_scrub`` cannot lock resources for the
> entire
> +duration of the repair.
> +
> +Instead, repair functions set up an in-memory staging structure to
> store
> +observations.
> +Depending on the requirements of the specific repair function, the
> staging


> +index can have the same format as the ondisk structure, or it can
> have a design
> +specific to that repair function.
...will have either the same format as the ondisk structure or a
structure specific to the repair function.

> +The next step is to release all locks and start the filesystem scan.
> +When the repair scanner needs to record an observation, the staging
> data are
> +locked long enough to apply the update.
> +Simultaneously, the repair function hooks relevant parts of the
> filesystem to
> +apply updates to the staging data if the the update pertains to an
> object that
> +has already been scanned by the index builder.
While a scan is in progress, function hooks are used to apply
filesystem updates to both the object and the staging data if the
object has already been scanned.

?

> +Once the scan is done, the owning object is re-locked, the live data
> is used to
> +write a new ondisk structure, and the repairs are committed
> atomically.
> +The hooks are disabled and the staging staging area is freed.
> +Finally, the storage from the old data structure are carefully
> reaped.
> +
> +Introducing concurrency helps online repair avoid various locking
> problems, but
> +comes at a high cost to code complexity.
> +Live filesystem code has to be hooked so that the repair function
> can observe
> +updates in progress.
> +The staging area has to become a fully functional parallel structure
> so that
> +updates can be merged from the hooks.
> +Finally, the hook, the filesystem scan, and the inode locking model
> must be
> +sufficiently well integrated that a hook event can decide if a given
> update
> +should be applied to the staging structure.
> +
> +In theory, the scrub implementation could apply these same
> techniques for
> +primary metadata, but doing so would make it massively more complex
> and less
> +performant.
> +Programs attempting to access the damaged structures are not blocked
> from
> +operation, which may cause application failure or an unplanned
> filesystem
> +shutdown.
> +
> +Inspiration for the secondary metadata repair strategy was drawn
> from section
> +2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without
> Side-File")
> +and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms
> for
> +Creating Indexes for Very Large Tables Without Quiescing Updates"
> +<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
This one works

> +
> +The sidecar index mentioned above bears some resemblance to the side
> file
> +method mentioned in Srinivasan and Mohan.
> +Their method consists of an index builder that extracts relevant
> record data to
> +build the new structure as quickly as possible; and an auxiliary
> structure that
> +captures all updates that would be committed to the index by other
> threads were
> +the new index already online.
> +After the index building scan finishes, the updates recorded in the
> side file
> +are applied to the new index.
> +To avoid conflicts between the index builder and other writer
> threads, the
> +builder maintains a publicly visible cursor that tracks the progress
> of the
> +scan through the record space.
> +To avoid duplication of work between the side file and the index
> builder, side
> +file updates are elided when the record ID for the update is greater
> than the
> +cursor position within the record ID space.
> +
> +To minimize changes to the rest of the codebase, XFS online repair
> keeps the
> +replacement index hidden until it's completely ready to go.
> +In other words, there is no attempt to expose the keyspace of the
> new index
> +while repair is running.
> +The complexity of such an approach would be very high and perhaps
> more
> +appropriate to building *new* indices.
> +
> +**Question**: Can the full scan and live update code used to
> facilitate a
> +repair also be used to implement a comprehensive check?
> +
> +*Answer*: Probably, though this has not been yet been studied.
I kinda feel like discussion Q&As need to be wrapped up before we can
call things done.  If this is all there was to the answer, then lets
clean out the discussion notes.

> +
> +Summary Information
> +```````````````````
> +
Oh, perhaps this section could move up with the other metadata
definitions.  That way the reader already has an idea of what these
terms are referring to before we get into how they are used during the
phases.

> +Metadata structures in this last category summarize the contents of
> primary
> +metadata records.
> +These are often used to speed up resource usage queries, and are
> many times
> +smaller than the primary metadata which they represent.
> +Check and repair both require full filesystem scans, but resource
> and lock
> +acquisition follow the same paths as regular filesystem accesses.
> +
> +The superblock summary counters have special requirements due to the
> underlying
> +implementation of the incore counters, and will be treated
> separately.
> +Check and repair of the other types of summary counters (quota
> resource counts
> +and file link counts) employ the same filesystem scanning and
> hooking
> +techniques as outlined above, but because the underlying data are
> sets of
> +integer counters, the staging data need not be a fully functional
> mirror of the
> +ondisk structure.
> +
> +Inspiration for quota and file link count repair strategies were
> drawn from
> +sections 2.12 ("Online Index Operations") through 2.14 ("Incremental
> View
> +Maintenace") of G.  Graefe, `"Concurrent Queries and Updates in
> Summary Views
> +and Their Indexes"
> +<
> http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`
> _, 2011.
I wonder if these citations would do better as foot notes?  Just to
kinda keep the body of the document tidy and flowing well.

> +
> +Since quotas are non-negative integer counts of resource usage,
> online
> +quotacheck can use the incremental view deltas described in section
> 2.14 to
> +track pending changes to the block and inode usage counts in each
> transaction,
> +and commit those changes to a dquot side file when the transaction
> commits.
> +Delta tracking is necessary for dquots because the index builder
> scans inodes,
> +whereas the data structure being rebuilt is an index of dquots.
> +Link count checking combines the view deltas and commit step into
> one because
> +it sets attributes of the objects being scanned instead of writing
> them to a
> +separate data structure.
> +Each online fsck function will be discussed as case studies later in
> this
> +document.
> +
> +Risk Management
> +---------------
> +
> +During the development of online fsck, several risk factors were
> identified
> +that may make the feature unsuitable for certain distributors and
> users.
> +Steps can be taken to mitigate or eliminate those risks, though at a
> cost to
> +functionality.
> +
> +- **Decreased performance**: Adding metadata indices to the
> filesystem
> +  increases the time cost of persisting changes to disk, and the
> reverse space
> +  mapping and directory parent pointers are no exception.
> +  System administrators who require the maximum performance can
> disable the
> +  reverse mapping features at format time, though this choice
> dramatically
> +  reduces the ability of online fsck to find inconsistencies and
> repair them.
> +
> +- **Incorrect repairs**: As with all software, there might be
> defects in the
> +  software that result in incorrect repairs being written to the
> filesystem.
> +  Systematic fuzz testing (detailed in the next section) is employed
> by the
> +  authors to find bugs early, but it might not catch everything.
> +  The kernel build system provides Kconfig options
> (``CONFIG_XFS_ONLINE_SCRUB``
> +  and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose
> not to
> +  accept this risk.
> +  The xfsprogs build system has a configure option (``--enable-
> scrub=no``) that
> +  disables building of the ``xfs_scrub`` binary, though this is not
> a risk
> +  mitigation if the kernel functionality remains enabled.
> +
> +- **Inability to repair**: Sometimes, a filesystem is too badly
> damaged to be
> +  repairable.
> +  If the keyspaces of several metadata indices overlap in some
> manner but a
> +  coherent narrative cannot be formed from records collected, then
> the repair
> +  fails.
> +  To reduce the chance that a repair will fail with a dirty
> transaction and
> +  render the filesystem unusable, the online repair functions have
> been
> +  designed to stage and validate all new records before committing
> the new
> +  structure.
> +
> +- **Misbehavior**: Online fsck requires many privileges -- raw IO to
> block
> +  devices, opening files by handle, ignoring Unix discretionary
> access control,
> +  and the ability to perform administrative changes.
> +  Running this automatically in the background scares people, so the
> systemd
> +  background service is configured to run with only the privileges
> required.
> +  Obviously, this cannot address certain problems like the kernel
> crashing or
> +  deadlocking, but it should be sufficient to prevent the scrub
> process from
> +  escaping and reconfiguring the system.
> +  The cron job does not have this protection.
> +

I think the fuzz part is one I would consider letting go.  All features
need to go through a period of stabilizing, and we cant really control
how some people respond to it, so I don't think this part adds much.  I
think the document would do well to be trimmed where it can so as to
stay more focused 
> +- **Fuzz Kiddiez**: There are many people now who seem to think that
> running
> +  automated fuzz testing of ondisk artifacts to find mischevious
> behavior and
> +  spraying exploit code onto the public mailing list for instant
> zero-day
> +  disclosure is somehow of some social benefit.
> +  In the view of this author, the benefit is realized only when the
> fuzz
> +  operators help to **fix** the flaws, but this opinion apparently
> is not
> +  widely shared among security "researchers".
> +  The XFS maintainers' continuing ability to manage these events
> presents an
> +  ongoing risk to the stability of the development process.
> +  Automated testing should front-load some of the risk while the
> feature is
> +  considered EXPERIMENTAL.
> +
> +Many of these risks are inherent to software programming.
> +Despite this, it is hoped that this new functionality will prove
> useful in
> +reducing unexpected downtime.
> 

Paraphrasing and reorganizing suggestions aside, I think it looks
pretty good

Allison

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 06/14] xfs: document how online fsck deals with eventual consistency
  2023-01-06  3:33         ` Amir Goldstein
@ 2023-01-11 17:54           ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-01-11 17:54 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

On Fri, Jan 06, 2023 at 05:33:00AM +0200, Amir Goldstein wrote:
> On Thu, Jan 5, 2023 at 9:40 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, Jan 05, 2023 at 11:08:51AM +0200, Amir Goldstein wrote:
> > > On Sat, Dec 31, 2022 at 12:32 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > >
> > > > Writes to an XFS filesystem employ an eventual consistency update model
> > > > to break up complex multistep metadata updates into small chained
> > > > transactions.  This is generally good for performance and scalability
> > > > because XFS doesn't need to prepare for enormous transactions, but it
> > > > also means that online fsck must be careful not to attempt a fsck action
> > > > unless it can be shown that there are no other threads processing a
> > > > transaction chain.  This part of the design documentation covers the
> > > > thinking behind the consistency model and how scrub deals with it.
> > > >
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  .../filesystems/xfs-online-fsck-design.rst         |  303 ++++++++++++++++++++
> > > >  1 file changed, 303 insertions(+)
> > > >
> > > >
> > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > index f45bf97fa9c4..419eb54ee200 100644
> > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > @@ -1443,3 +1443,306 @@ This step is critical for enabling system administrator to monitor the status
> > > >  of the filesystem and the progress of any repairs.
> > > >  For developers, it is a useful means to judge the efficacy of error detection
> > > >  and correction in the online and offline checking tools.
> > > > +
> > > > +Eventual Consistency vs. Online Fsck
> > > > +------------------------------------
> > > > +
> > > > +Midway through the development of online scrubbing, the fsstress tests
> > > > +uncovered a misinteraction between online fsck and compound transaction chains
> > > > +created by other writer threads that resulted in false reports of metadata
> > > > +inconsistency.
> > > > +The root cause of these reports is the eventual consistency model introduced by
> > > > +the expansion of deferred work items and compound transaction chains when
> > > > +reverse mapping and reflink were introduced.
> > > > +
> > > > +Originally, transaction chains were added to XFS to avoid deadlocks when
> > > > +unmapping space from files.
> > > > +Deadlock avoidance rules require that AGs only be locked in increasing order,
> > > > +which makes it impossible (say) to use a single transaction to free a space
> > > > +extent in AG 7 and then try to free a now superfluous block mapping btree block
> > > > +in AG 3.
> > > > +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
> > > > +items to commit to freeing some space in one transaction while deferring the
> > > > +actual metadata updates to a fresh transaction.
> > > > +The transaction sequence looks like this:
> > > > +
> > > > +1. The first transaction contains a physical update to the file's block mapping
> > > > +   structures to remove the mapping from the btree blocks.
> > > > +   It then attaches to the in-memory transaction an action item to schedule
> > > > +   deferred freeing of space.
> > > > +   Concretely, each transaction maintains a list of ``struct
> > > > +   xfs_defer_pending`` objects, each of which maintains a list of ``struct
> > > > +   xfs_extent_free_item`` objects.
> > > > +   Returning to the example above, the action item tracks the freeing of both
> > > > +   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
> > > > +   AG 3.
> > > > +   Deferred frees recorded in this manner are committed in the log by creating
> > > > +   an EFI log item from the ``struct xfs_extent_free_item`` object and
> > > > +   attaching the log item to the transaction.
> > > > +   When the log is persisted to disk, the EFI item is written into the ondisk
> > > > +   transaction record.
> > > > +   EFIs can list up to 16 extents to free, all sorted in AG order.
> > > > +
> > > > +2. The second transaction contains a physical update to the free space btrees
> > > > +   of AG 3 to release the former BMBT block and a second physical update to the
> > > > +   free space btrees of AG 7 to release the unmapped file space.
> > > > +   Observe that the the physical updates are resequenced in the correct order
> > > > +   when possible.
> > > > +   Attached to the transaction is a an extent free done (EFD) log item.
> > > > +   The EFD contains a pointer to the EFI logged in transaction #1 so that log
> > > > +   recovery can tell if the EFI needs to be replayed.
> > > > +
> > > > +If the system goes down after transaction #1 is written back to the filesystem
> > > > +but before #2 is committed, a scan of the filesystem metadata would show
> > > > +inconsistent filesystem metadata because there would not appear to be any owner
> > > > +of the unmapped space.
> > > > +Happily, log recovery corrects this inconsistency for us -- when recovery finds
> > > > +an intent log item but does not find a corresponding intent done item, it will
> > > > +reconstruct the incore state of the intent item and finish it.
> > > > +In the example above, the log must replay both frees described in the recovered
> > > > +EFI to complete the recovery phase.
> > > > +
> > > > +There are two subtleties to XFS' transaction chaining strategy to consider.
> > > > +The first is that log items must be added to a transaction in the correct order
> > > > +to prevent conflicts with principal objects that are not held by the
> > > > +transaction.
> > > > +In other words, all per-AG metadata updates for an unmapped block must be
> > > > +completed before the last update to free the extent, and extents should not
> > > > +be reallocated until that last update commits to the log.
> > > > +The second subtlety comes from the fact that AG header buffers are (usually)
> > > > +released between each transaction in a chain.
> > > > +This means that other threads can observe an AG in an intermediate state,
> > > > +but as long as the first subtlety is handled, this should not affect the
> > > > +correctness of filesystem operations.
> > > > +Unmounting the filesystem flushes all pending work to disk, which means that
> > > > +offline fsck never sees the temporary inconsistencies caused by deferred work
> > > > +item processing.
> > > > +In this manner, XFS employs a form of eventual consistency to avoid deadlocks
> > > > +and increase parallelism.
> > > > +
> > > > +During the design phase of the reverse mapping and reflink features, it was
> > > > +decided that it was impractical to cram all the reverse mapping updates for a
> > > > +single filesystem change into a single transaction because a single file
> > > > +mapping operation can explode into many small updates:
> > > > +
> > > > +* The block mapping update itself
> > > > +* A reverse mapping update for the block mapping update
> > > > +* Fixing the freelist
> > > > +* A reverse mapping update for the freelist fix
> > > > +
> > > > +* A shape change to the block mapping btree
> > > > +* A reverse mapping update for the btree update
> > > > +* Fixing the freelist (again)
> > > > +* A reverse mapping update for the freelist fix
> > > > +
> > > > +* An update to the reference counting information
> > > > +* A reverse mapping update for the refcount update
> > > > +* Fixing the freelist (a third time)
> > > > +* A reverse mapping update for the freelist fix
> > > > +
> > > > +* Freeing any space that was unmapped and not owned by any other file
> > > > +* Fixing the freelist (a fourth time)
> > > > +* A reverse mapping update for the freelist fix
> > > > +
> > > > +* Freeing the space used by the block mapping btree
> > > > +* Fixing the freelist (a fifth time)
> > > > +* A reverse mapping update for the freelist fix
> > > > +
> > > > +Free list fixups are not usually needed more than once per AG per transaction
> > > > +chain, but it is theoretically possible if space is very tight.
> > > > +For copy-on-write updates this is even worse, because this must be done once to
> > > > +remove the space from a staging area and again to map it into the file!
> > > > +
> > > > +To deal with this explosion in a calm manner, XFS expands its use of deferred
> > > > +work items to cover most reverse mapping updates and all refcount updates.
> > > > +This reduces the worst case size of transaction reservations by breaking the
> > > > +work into a long chain of small updates, which increases the degree of eventual
> > > > +consistency in the system.
> > > > +Again, this generally isn't a problem because XFS orders its deferred work
> > > > +items carefully to avoid resource reuse conflicts between unsuspecting threads.
> > > > +
> > > > +However, online fsck changes the rules -- remember that although physical
> > > > +updates to per-AG structures are coordinated by locking the buffers for AG
> > > > +headers, buffer locks are dropped between transactions.
> > > > +Once scrub acquires resources and takes locks for a data structure, it must do
> > > > +all the validation work without releasing the lock.
> > > > +If the main lock for a space btree is an AG header buffer lock, scrub may have
> > > > +interrupted another thread that is midway through finishing a chain.
> > > > +For example, if a thread performing a copy-on-write has completed a reverse
> > > > +mapping update but not the corresponding refcount update, the two AG btrees
> > > > +will appear inconsistent to scrub and an observation of corruption will be
> > > > +recorded.  This observation will not be correct.
> > > > +If a repair is attempted in this state, the results will be catastrophic!
> > > > +
> > > > +Several solutions to this problem were evaluated upon discovery of this flaw:
> > > > +
> > > > +1. Add a higher level lock to allocation groups and require writer threads to
> > > > +   acquire the higher level lock in AG order before making any changes.
> > > > +   This would be very difficult to implement in practice because it is
> > > > +   difficult to determine which locks need to be obtained, and in what order,
> > > > +   without simulating the entire operation.
> > > > +   Performing a dry run of a file operation to discover necessary locks would
> > > > +   make the filesystem very slow.
> > > > +
> > > > +2. Make the deferred work coordinator code aware of consecutive intent items
> > > > +   targeting the same AG and have it hold the AG header buffers locked across
> > > > +   the transaction roll between updates.
> > > > +   This would introduce a lot of complexity into the coordinator since it is
> > > > +   only loosely coupled with the actual deferred work items.
> > > > +   It would also fail to solve the problem because deferred work items can
> > > > +   generate new deferred subtasks, but all subtasks must be complete before
> > > > +   work can start on a new sibling task.
> > > > +
> > > > +3. Teach online fsck to walk all transactions waiting for whichever lock(s)
> > > > +   protect the data structure being scrubbed to look for pending operations.
> > > > +   The checking and repair operations must factor these pending operations into
> > > > +   the evaluations being performed.
> > > > +   This solution is a nonstarter because it is *extremely* invasive to the main
> > > > +   filesystem.
> > > > +
> > > > +4. Recognize that only online fsck has this requirement of total consistency
> > > > +   of AG metadata, and that online fsck should be relatively rare as compared
> > > > +   to filesystem change operations.
> > > > +   For each AG, maintain a count of intent items targetting that AG.
> > > > +   When online fsck wants to examine an AG, it should lock the AG header
> > > > +   buffers to quiesce all transaction chains that want to modify that AG, and
> > > > +   only proceed with the scrub if the count is zero.
> > > > +   In other words, scrub only proceeds if it can lock the AG header buffers and
> > > > +   there can't possibly be any intents in progress.
> > > > +   This may lead to fairness and starvation issues, but regular filesystem
> > > > +   updates take precedence over online fsck activity.
> > > > +
> > >
> > > Is there any guarantee that some silly real life regular filesystem workload
> > > won't starve online fsck forever?
> > > IOW, is forward progress of online fsck guaranteed?
> >
> > Nope, forward progress isn't guaranteed.
> 
> That sounds like a problem.

So far it hasn't been.  I prefer to sacrifice performance of the
background fsck service for the sake of foreground tasks.  The fsstress
and fsx fstests haven't shown any particularly serious issues.  I've
also kicked off xfs_scrub on the same VM hosts that are running the fuzz
test suite (~52 VMs per host) and scrub can still finish the filesystem
in a couple of hours.

Things get markedly worse on spinning rust with a lot of parallel
unwritten extent conversions and allocations going on (aka the disk
backup systems).  Normally a backup from flash to rust takes about an
hour; with scrub and backup contending for the head actuator, it'll go
up to about 2-3 hours, but both tasks can make (verrrry slow) forward
progress.

That said -- the backup program spends a lot of iowait time waiting for
file data blocks to read in or get written back, so the contention is on
the storage hardware, not the filesystem locks.

> > The kernel checks for fatal
> > signals every time it backs off a scrub so at least we don't end up with
> > unkillable processes.  At one point I added a timeout field to the ioctl
> > interface so that the kernel could time out an operation if it took too
> > long to acquire the necessary resources.  So far, the "race fsstress and
> > xfs_scrub" tests have not shown scrub failing to make any forward
> > progress.
> >
> > That said, I have /not/ yet had a chance to try it out any of these
> > massive 1000-core systems with an according workload.
> >
> 
> Don't know if fsstress is the best way to check the worst case scenario.
> 
> Can you think of a workload, say several threads creating and deleting
> temp files, with deferred parent pointer items preventing the queue from
> ever draining?

The worst workload would be one that is entirely metadata based -- a
giant directory tree full of empty files with all information being
stored as extended attributes.

> Considering that a "full journal" scenario is always going to be a possible
> worst case incident, how bad would it be to block new transactions
> instead of the possibility of starving scrub consistency checks forever?

First of all, scrub has already allocated a transaction by the time it
gets to the intent drain step.  There's no good way to block new
transactions once we've reached this stage, nor should there be.
Blocking transactions stalls xfs garbage collection and memory reclaim.

> Wouldn't the consistency checks be much faster than freeing journal
> space would be in a "full journal" situation?

I haven't investigated this in depth, but yes, scrub should be faster
than forcing the log and checkpointing the log to move the log tail
forward to empty out the journal.

> I don't know if there is a "mission statement" for online fsck, but
> I think it would say "minimal user interference" not "no user interference".

Yes.  The section about eventual consistency states that "...regular
filesystem updates take precedence over online fsck activity".

> It sounds like the interference we are trying to avoid is light years away
> from the downtime of offline fsck, so online fsck would still be a huge win.
> online fsck that never ends OTOH... maybe less so.

Well you /can/ just kill the xfs_scrub processes if they are taking too
much time.  One of the nastier papercuts of the background scrub is that
the fs cannot be unmounted while it's running, and systemd doesn't have
a good mechanism for "kill this service before stopping this mount".  Or
maybe it does and I haven't yet found it?

(The cronjob variant definitely suffers from that...)

> > > Good luck with landing online fsck before the 2024 NYE deluge ;)
> >
> > Thank *you* for reading this chapter of the design document!! :)
> >
> 
> Oh I read them all at the summer submission, but it took me so long
> that I forgot to follow up..

Yeah, that seems to be a common problem with large new features. :/

> My other question was regarding memory usage control.
> I have horrid memories from e2fsck unpredictable memory usage
> and unpredictable runtime due to swapping.
> 
> xfs_repair -m was a huge improvement compared to e2fsck.
> I don't remember reading about memory usage limits for online repair,
> so I was concerned about unpredictable memory usage and swapping.
> Can you say something to ease those concerns?

Both e2fsck and xfs_repair have to be capable of repairing the entire
filesystem all at once, which means that they allocate many many of
incore objects from which all of the ondisk space metadata (ag btrees in
the case of xfs, bitmaps for e2fsck) is regenerated.  Since the fs is
offline, it's considered advantageous to perform *one* scan and rebuild
everything all at once, even if the memory cost is high.

xfs_scrub scans and repairs each metadata object individually, which
means that it only needs to allocate as much (kernel/xfile) memory as
needed to scan a single btree/inode record/quota record/bitmap.  For
scans the memory requirements are usually minimal since it creates a
bunch of btree cursors and cross-references records.

For repairs, the memory requirements are on the order of the size of the
new data structure that will be written out.  We scan the fs to build
the new recordset in memory, compute the size of the new btree, allocate
some blocks, and format the records into the blocks before committing
the btree root.

For summary data (e.g. link counts, dquots) we build a shadow copy in
memory, so the memory requirements are on the order of the number of
files in the fs and the number of uid/gid/projid in the filesystem,
respectively.

Most of the intermediate structures are stuffed into a tmpfs file, which
means they can be paged out to disk.  If there's really no memory
available, scrub can abort all the way out to userspace provided it
hasn't committed anything to disk yet.

IOWs, online fsck generally only requires enough memory to build a new
copy of whichever objects it happens to be scanning at any given moment.
The background service runs single-threaded to avoid consuming a lot of
CPU or memory.

--D

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 01/14] xfs: document the motivation for online fsck design
  2023-01-07  5:01     ` Allison Henderson
@ 2023-01-11 19:10       ` Darrick J. Wong
  2023-01-18  0:03         ` Allison Henderson
  2023-01-12  0:10       ` Darrick J. Wong
  1 sibling, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2023-01-11 19:10 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Sat, Jan 07, 2023 at 05:01:54AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Start the first chapter of the online fsck design documentation.
> > This covers the motivations for creating this in the first place.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  Documentation/filesystems/index.rst                |    1 
> >  .../filesystems/xfs-online-fsck-design.rst         |  199
> > ++++++++++++++++++++
> >  2 files changed, 200 insertions(+)
> >  create mode 100644 Documentation/filesystems/xfs-online-fsck-
> > design.rst
> > 
> > 
> > diff --git a/Documentation/filesystems/index.rst
> > b/Documentation/filesystems/index.rst
> > index bee63d42e5ec..fbb2b5ada95b 100644
> > --- a/Documentation/filesystems/index.rst
> > +++ b/Documentation/filesystems/index.rst
> > @@ -123,4 +123,5 @@ Documentation for filesystem implementations.
> >     vfat
> >     xfs-delayed-logging-design
> >     xfs-self-describing-metadata
> > +   xfs-online-fsck-design
> >     zonefs
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > new file mode 100644
> > index 000000000000..25717ebb5f80
> > --- /dev/null
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -0,0 +1,199 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +.. _xfs_online_fsck_design:
> > +
> > +..
> > +        Mapping of heading styles within this document:
> > +        Heading 1 uses "====" above and below
> > +        Heading 2 uses "===="
> > +        Heading 3 uses "----"
> > +        Heading 4 uses "````"
> > +        Heading 5 uses "^^^^"
> > +        Heading 6 uses "~~~~"
> > +        Heading 7 uses "...."
> > +
> > +        Sections are manually numbered because apparently that's
> > what everyone
> > +        does in the kernel.
> > +
> > +======================
> > +XFS Online Fsck Design
> > +======================
> > +
> > +This document captures the design of the online filesystem check
> > feature for
> > +XFS.
> > +The purpose of this document is threefold:
> > +
> > +- To help kernel distributors understand exactly what the XFS online
> > fsck
> > +  feature is, and issues about which they should be aware.
> > +
> > +- To help people reading the code to familiarize themselves with the
> > relevant
> > +  concepts and design points before they start digging into the
> > code.
> > +
> > +- To help developers maintaining the system by capturing the reasons
> > +  supporting higher level decisionmaking.
> nit: decision making

Fixed.

> > +
> > +As the online fsck code is merged, the links in this document to
> > topic branches
> > +will be replaced with links to code.
> > +
> > +This document is licensed under the terms of the GNU Public License,
> > v2.
> > +The primary author is Darrick J. Wong.
> > +
> > +This design document is split into seven parts.
> > +Part 1 defines what fsck tools are and the motivations for writing a
> > new one.
> > +Parts 2 and 3 present a high level overview of how online fsck
> > process works
> > +and how it is tested to ensure correct functionality.
> > +Part 4 discusses the user interface and the intended usage modes of
> > the new
> > +program.
> > +Parts 5 and 6 show off the high level components and how they fit
> > together, and
> > +then present case studies of how each repair function actually
> > works.
> > +Part 7 sums up what has been discussed so far and speculates about
> > what else
> > +might be built atop online fsck.
> > +
> > +.. contents:: Table of Contents
> > +   :local:
> > +
> 
> Something that I've noticed in my training sessions is that often
> times, less is more.  People really only absorb so much over a
> particular duration of time, so sometimes having too much detail in the
> context is not as helpful as you might think.  A lot of times,
> paraphrasing excerpts to reflect the same info in a more compact format
> will help you keep audience on track (a little longer at least). 
> 
> > +1. What is a Filesystem Check?
> > +==============================
> > +
> > +A Unix filesystem has three main jobs: to provide a hierarchy of
> > names through
> > +which application programs can associate arbitrary blobs of data for
> > any
> > +length of time, to virtualize physical storage media across those
> > names, and
> > +to retrieve the named data blobs at any time.
> Consider the following paraphrase:
> 
> A Unix filesystem has three main jobs:
>  * Provide a hierarchy of names by which applications access data for a
> length of time.
>  * Store or retrieve that data at any time.
>  * Virtualize physical storage media across those names

Ooh, listifying.  I did quite a bit of that to break up the walls of
text in earlier revisions, but apparently I missed this one.

> Also... I dont think it would be inappropriate to just skip the above,
> and jump right into fsck.  That's a very limited view of a filesystem,
> likely a reader seeking an fsck doc probably has some idea of what a fs
> is otherwise supposed to be doing.  

This will become part of the general kernel documentation, so we can't
assume that all readers are going to know what a fs really does.

"A Unix filesystem has four main responsibilities:

- Provide a hierarchy of names through which application programs can
  associate arbitrary blobs of data for any length of time,

- Virtualize physical storage media across those names, and

- Retrieve the named data blobs at any time.

- Examine resource usage.

"Metadata directly supporting these functions (e.g. files, directories,
space mappings) are sometimes called primary metadata.
Secondary metadata (e.g. reverse mapping and directory parent pointers)
support operations internal to the filesystem, such as internal
consistency checking and reorganization."

(I added those last two sentences in response to a point you made
below.)

> > +The filesystem check (fsck) tool examines all the metadata in a
> > filesystem
> > +to look for errors.
> > +Simple tools only check for obvious corruptions, but the more
> > sophisticated
> > +ones cross-reference metadata records to look for inconsistencies.
> > +People do not like losing data, so most fsck tools also contains
> > some ability
> > +to deal with any problems found.
> 
> While simple tools can detect data corruptions, a filesystem check
> (fsck) uses metadata records as a cross-reference to find and correct
> more inconsistencies.
> 
> ?

Let's be careful with the term 'data corruption' here -- a lot of people
(well ok me) will see that as *user* data corruption, whereas we're
talking about *metadata* corruption.

I think I'll rework that second sentence further:

"In addition to looking for obvious metadata corruptions, fsck also
cross-references different types of metadata records with each other to
look for inconsistencies."

Since the really dumb fscks of the 1970s are a long ways past now.

> > +As a word of caution -- the primary goal of most Linux fsck tools is
> > to restore
> > +the filesystem metadata to a consistent state, not to maximize the
> > data
> > +recovered.
> > +That precedent will not be challenged here.
> > +
> > +Filesystems of the 20th century generally lacked any redundancy in
> > the ondisk
> > +format, which means that fsck can only respond to errors by erasing
> > files until
> > +errors are no longer detected.
> > +System administrators avoid data loss by increasing the number of
> > separate
> > +storage systems through the creation of backups; 
> 
> 
> > and they avoid downtime by
> > +increasing the redundancy of each storage system through the
> > creation of RAID.
> Mmm, raids help more for hardware failures right?  They dont really
> have a notion of when the fs is corrupted.

Right.

> While an fsck can help
> navigate around a corruption possibly caused by a hardware failure, I
> think it's really a different kind of redundancy. I think I'd probably
> drop the last line and keep the selling point focused online repair.

Yes, RAIDs provide a totally different type of redundancy.  I decided to
make this point specifically to counter the people who argue that RAID
makes them impervious to corruption problems, etc.

This attitude seemed rather prevalent in the early days of btrfs and a
certain other filesystem that Shall Not Be Named, even though the btrfs
developers themselves acknowledge this distinction, given the existence
of `btrfs scrub' and `btrfs check'.

However you do have a good point that this sentence doesn't add much
where it is.  I think I'll add it as a sidebar at the end of the
paragraph.

> > +More recent filesystem designs contain enough redundancy in their
> > metadata that
> > +it is now possible to regenerate data structures when non-
> > catastrophic errors
> > +occur; 
> 
> 
> > this capability aids both strategies.
> > +Over the past few years, XFS has added a storage space reverse
> > mapping index to
> > +make it easy to find which files or metadata objects think they own
> > a
> > +particular range of storage.
> > +Efforts are under way to develop a similar reverse mapping index for
> > the naming
> > +hierarchy, which will involve storing directory parent pointers in
> > each file.
> > +With these two pieces in place, XFS uses secondary information to
> > perform more
> > +sophisticated repairs.
> This part here I think I would either let go or relocate.  The topic of
> this section is supposed to discuss roughly what a filesystem check is.
> Ideally so we can start talking about how ofsck is different.  It feels
> like a bit of a jump to suddenly hop into rmap and pptrs, and for
> "sophisticated repairs" that we havn't really gotten into the details
> of yet.  So I think it would read easier if we saved this part until we
> start talking about how they are used later.  

Agreed.

> > +
> > +TLDR; Show Me the Code!
> > +-----------------------
> > +
> > +Code is posted to the kernel.org git trees as follows:
> > +`kernel changes
> > <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git
> > /log/?h=repair-symlink>`_,
> > +`userspace changes
> > <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.
> > git/log/?h=scrub-media-scan-service>`_, and
> > +`QA test changes
> > <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.
> > git/log/?h=repair-dirs>`_.
> > +Each kernel patchset adding an online repair function will use the
> > same branch
> > +name across the kernel, xfsprogs, and fstests git repos.
> > +
> > +Existing Tools
> > +--------------
> > +
> > +The online fsck tool described here will be the third tool in the
> > history of
> > +XFS (on Linux) to check and repair filesystems.
> > +Two programs precede it:
> > +
> > +The first program, ``xfs_check``, was created as part of the XFS
> > debugger
> > +(``xfs_db``) and can only be used with unmounted filesystems.
> > +It walks all metadata in the filesystem looking for inconsistencies
> > in the
> > +metadata, though it lacks any ability to repair what it finds.
> > +Due to its high memory requirements and inability to repair things,
> > this
> > +program is now deprecated and will not be discussed further.
> > +
> > +The second program, ``xfs_repair``, was created to be faster and
> > more robust
> > +than the first program.
> > +Like its predecessor, it can only be used with unmounted
> > filesystems.
> > +It uses extent-based in-memory data structures to reduce memory
> > consumption,
> > +and tries to schedule readahead IO appropriately to reduce I/O
> > waiting time
> > +while it scans the metadata of the entire filesystem.
> > +The most important feature of this tool is its ability to respond to
> > +inconsistencies in file metadata and directory tree by erasing
> > things as needed
> > +to eliminate problems.
> > +Space usage metadata are rebuilt from the observed file metadata.
> > +
> > +Problem Statement
> > +-----------------
> > +
> > +The current XFS tools leave several problems unsolved:
> > +
> > +1. **User programs** suddenly **lose access** to information in the
> > computer
> > +   when unexpected shutdowns occur as a result of silent corruptions
> > in the
> > +   filesystem metadata.
> > +   These occur **unpredictably** and often without warning.
> 
> 
> 1. **User programs** suddenly **lose access** to the filesystem
>    when unexpected shutdowns occur as a result of silent corruptions
> that could have otherwise been avoided with an online repair
> 
> While some of these issues are not untrue, I think it makes sense to
> limit them to the issue you plan to solve, and therefore discuss.

Fair enough, it's not like one loses /all/ the data in the computer.

That said, we're still in the problem definition phase, so I don't want
to mention online repair just yet.

> > +2. **Users** experience a **total loss of service** during the
> > recovery period
> > +   after an **unexpected shutdown** occurs.
> > +
> > +3. **Users** experience a **total loss of service** if the
> > filesystem is taken
> > +   offline to **look for problems** proactively.
> > +
> > +4. **Data owners** cannot **check the integrity** of their stored
> > data without
> > +   reading all of it.
> 
> > +   This may expose them to substantial billing costs when a linear
> > media scan
> > +   might suffice.
> Ok, I had to re-read this one a few times, but I think this reads a
> little cleaner:
> 
>     Customers that are billed for data egress may incur unnecessary
> cost when a background media scan on the host may have sufficed
> 
> ?

"...when a linear media scan performed by the storage system
administrator would suffice."

I was tempted to say "storage owner" instead of "storage system
administrator" but that sounded a little too IBM.

> > +5. **System administrators** cannot **schedule** a maintenance
> > window to deal
> > +   with corruptions if they **lack the means** to assess filesystem
> > health
> > +   while the filesystem is online.
> > +
> > +6. **Fleet monitoring tools** cannot **automate periodic checks** of
> > filesystem
> > +   health when doing so requires **manual intervention** and
> > downtime.
> > +
> > +7. **Users** can be tricked into **doing things they do not desire**
> > when
> > +   malicious actors **exploit quirks of Unicode** to place
> > misleading names
> > +   in directories.
> hrmm, I guess I'm not immediately extrapolating what things users are
> being tricked into doing, or how ofsck solves this?  Otherwise I might
> drop the last one here, I think the rest of the bullets are plenty of
> motivation.

The doc gets into this later[1], but it's possible to create two entries
within the same directory that have different byte sequences in the name
but render identically in file choosers.  These pathnames:

/home/djwong/Downloads/rustup.sh
/home/djwong/Downloads/rus<zero width space>tup.sh

refer to different files, but a naïve file open dialog will render them
identically as "rustup.sh".  If the first is the Rust installer and the
second name is actually a ransomware payload, I can victimize you by
tricking you into opening the wrong one.

Firefox had a whole CVE over this in 2018:
https://bugzilla.mozilla.org/show_bug.cgi?id=1438025

xfs_scrub is (so far) the only linux filesystem fsck tool that will warn
system administrators about this kind of thing.

See generic/453 and generic/454.

[1] https://djwong.org/docs/xfs-online-fsck-design/#id108

> > +
> > +Given this definition of the problems to be solved and the actors
> > who would
> > +benefit, the proposed solution is a third fsck tool that acts on a
> > running
> > +filesystem.
> > +
> > +This new third program has three components: an in-kernel facility
> > to check
> > +metadata, an in-kernel facility to repair metadata, and a userspace
> > driver
> > +program to drive fsck activity on a live filesystem.
> > +``xfs_scrub`` is the name of the driver program.
> > +The rest of this document presents the goals and use cases of the
> > new fsck
> > +tool, describes its major design points in connection to those
> > goals, and
> > +discusses the similarities and differences with existing tools.
> > +
> > ++-------------------------------------------------------------------
> > -------+
> > +|
> > **Note**:                                                            
> >     |
> > ++-------------------------------------------------------------------
> > -------+
> > +| Throughout this document, the existing offline fsck tool can also
> > be     |
> > +| referred to by its current name
> > "``xfs_repair``".                        |
> > +| The userspace driver program for the new online fsck tool can
> > be         |
> > +| referred to as
> > "``xfs_scrub``".                                          |
> > +| The kernel portion of online fsck that validates metadata is
> > called      |
> > +| "online scrub", and portion of the kernel that fixes metadata is
> > called  |
> > +| "online
> > repair".                                                         |
> > ++-------------------------------------------------------------------
> > -------+

Errr ^^^^ is Evolution doing line wrapping here?

> Hmm, maybe here might be a good spot to move rmap and pptrs?  It's not
> otherwise clear to me what "secondary metadata" is.  If that is what it
> is meant to refer to, I think the reader will more intuitively make the
> connection if those two blurbs appear in the same context.

Ooh, you found a significant gap-- nowhere in this chapter do I actually
define what is primary metadata.  Or secondary metadata.

> > +
> > +Secondary metadata indices enable the reconstruction of parts of a
> > damaged
> > +primary metadata object from secondary information.
> 
> I would take out this blurb...
> > +XFS filesystems shard themselves into multiple primary objects to
> > enable better
> > +performance on highly threaded systems and to contain the blast
> > radius when
> > +problems happen.
> 
> 
> > +The naming hierarchy is broken up into objects known as directories
> > and files;
> > +and the physical space is split into pieces known as allocation
> > groups.
> And add here:
> 
> "This enables better performance on highly threaded systems and helps
> to contain corruptions when they occur."
> 
> I think that reads cleaner

Ok.  Mind if I reword this slightly?  The entire paragraph now reads
like this:

"The naming hierarchy is broken up into objects known as directories and
files and the physical space is split into pieces known as allocation
groups.  Sharding enables better performance on highly parallel systems
and helps to contain the damage when corruptions occur.  The division of
the filesystem into principal objects (allocation groups and inodes)
means that there are ample opportunities to perform targeted checks and
repairs on a subset of the filesystem."

> > +The division of the filesystem into principal objects (allocation
> > groups and
> > +inodes) means that there are ample opportunities to perform targeted
> > checks and
> > +repairs on a subset of the filesystem.
> > +While this is going on, other parts continue processing IO requests.
> > +Even if a piece of filesystem metadata can only be regenerated by
> > scanning the
> > +entire system, the scan can still be done in the background while
> > other file
> > +operations continue.
> > +
> > +In summary, online fsck takes advantage of resource sharding and
> > redundant
> > +metadata to enable targeted checking and repair operations while the
> > system
> > +is running.
> > +This capability will be coupled to automatic system management so
> > that
> > +autonomous self-healing of XFS maximizes service availability.
> > 
> 
> Nits and paraphrases aside, I think this looks pretty good?

Woot.  Thanks for digging in! :)

> Allison
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 02/14] xfs: document the general theory underlying online fsck design
  2023-01-11  1:25     ` Allison Henderson
@ 2023-01-11 23:39       ` Darrick J. Wong
  2023-01-12  0:29         ` Dave Chinner
  2023-01-18  0:03         ` Allison Henderson
  0 siblings, 2 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-01-11 23:39 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Wed, Jan 11, 2023 at 01:25:12AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Start the second chapter of the online fsck design documentation.
> > This covers the general theory underlying how online fsck works.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  366
> > ++++++++++++++++++++
> >  1 file changed, 366 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index 25717ebb5f80..a03a7b9f0250 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -197,3 +197,369 @@ metadata to enable targeted checking and repair
> > operations while the system
> >  is running.
> >  This capability will be coupled to automatic system management so
> > that
> >  autonomous self-healing of XFS maximizes service availability.
> > +
> > +2. Theory of Operation
> > +======================
> > +
> > +Because it is necessary for online fsck to lock and scan live
> > metadata objects,
> > +online fsck consists of three separate code components.
> > +The first is the userspace driver program ``xfs_scrub``, which is
> > responsible
> > +for identifying individual metadata items, scheduling work items for
> > them,
> > +reacting to the outcomes appropriately, and reporting results to the
> > system
> > +administrator.
> > +The second and third are in the kernel, which implements functions
> > to check
> > +and repair each type of online fsck work item.
> > +
> > ++------------------------------------------------------------------+
> > +| **Note**:                                                        |
> > ++------------------------------------------------------------------+
> > +| For brevity, this document shortens the phrase "online fsck work |
> > +| item" to "scrub item".                                           |
> > ++------------------------------------------------------------------+
> > +
> > +Scrub item types are delineated in a manner consistent with the Unix
> > design
> > +philosophy, which is to say that each item should handle one aspect
> > of a
> > +metadata structure, and handle it well.
> > +
> > +Scope
> > +-----
> > +
> > +In principle, online fsck should be able to check and to repair
> > everything that
> > +the offline fsck program can handle.
> > +However, the adjective *online* brings with it the limitation that
> > online fsck
> > +cannot deal with anything that prevents the filesystem from going on
> > line, i.e.
> > +mounting.
> Are there really any other operations that do that other than mount?

No.

> I think this reads cleaner:
> 
> By definition, online fsck can only check and repair an online
> filesystem.  It cannot check mounting operations which start from an
> offline state.

Now that I think about this some more, this whole sentence doesn't make
sense.  xfs_scrub can *definitely* detect and fix latent errors that
would prevent the /next/ mount from succeeding.  It's only the fuzz test
suite that stumbles over this, and only because xfs_db cannot fuzz
mounted filesystems.

"However, online fsck cannot be running 100% of the time, which means
that latent errors may creep in after a scrub completes.
If these errors cause the next mount to fail, offline fsck is the only
solution."

> > +This limitation means that maintenance of the offline fsck tool will
> > continue.
> > +A second limitation of online fsck is that it must follow the same
> > resource
> > +sharing and lock acquisition rules as the regular filesystem.
> > +This means that scrub cannot take *any* shortcuts to save time,
> > because doing
> > +so could lead to concurrency problems.
> > +In other words, online fsck will never be able to fix 100% of the
> > +inconsistencies that offline fsck can repair, 
> Hmm, what inconsistencies cannot repaired as a result of the "no
> shortcut" rule?  I'm all for keeping things short and to the point, but
> since this section is about scope, I'd give it at least a brief bullet
> list

Hmm.  I can't think of any off the top of my head.  Given the rewording
earlier, I think it's more accurate to say:

"In other words, online fsck is not a complete replacement for offline
fsck, and a complete run of online fsck may take longer than online
fsck."

> > and a complete run of online fsck
> > +may take longer.
> > +However, both of these limitations are acceptable tradeoffs to
> > satisfy the
> > +different motivations of online fsck, which are to **minimize system
> > downtime**
> > +and to **increase predictability of operation**.
> > +
> > +.. _scrubphases:
> > +
> > +Phases of Work
> > +--------------
> > +
> > +The userspace driver program ``xfs_scrub`` splits the work of
> > checking and
> > +repairing an entire filesystem into seven phases.
> > +Each phase concentrates on checking specific types of scrub items
> > and depends
> > +on the success of all previous phases.
> > +The seven phases are as follows:
> > +
> > +1. Collect geometry information about the mounted filesystem and
> > computer,
> > +   discover the online fsck capabilities of the kernel, and open the
> > +   underlying storage devices.
> > +
> > +2. Check allocation group metadata, all realtime volume metadata,
> > and all quota
> > +   files.
> > +   Each metadata structure is scheduled as a separate scrub item.
> Like an intent item?

No, these scrub items are struct scrub_item objects that exist solely
within the userspace program code.

> > +   If corruption is found in the inode header or inode btree and
> > ``xfs_scrub``
> > +   is permitted to perform repairs, then those scrub items are
> > repaired to
> > +   prepare for phase 3.
> > +   Repairs are implemented by resubmitting the scrub item to the
> > kernel with
> If I'm understanding this correctly:
> Repairs are implemented as intent items that are queued and committed
> just as any filesystem operation.
> 
> ?

I don't want to go too deep into this prematurely, but...

xfs_scrub (the userspace program) needs to track which metadata objects
have been checked and which ones need repairs.  The current codebase
(ab)uses struct xfs_scrub_metadata, but it's very memory inefficient.
I replaced it with a new struct scrub_item that stores (a) all the
handle information to identify the inode/AG/rt group/whatever; and (b)
the state of all the checks that can be applied to that item:

struct scrub_item {
	/*
	 * Information we need to call the scrub and repair ioctls.
	 * Per-AG items should set the ino/gen fields to -1; per-inode
	 * items should set sri_agno to -1; and per-fs items should set
	 * all three fields to -1.  Or use the macros below.
	 */
	__u64			sri_ino;
	__u32			sri_gen;
	__u32			sri_agno;

	/* Bitmask of scrub types that were scheduled here. */
	__u32			sri_selected;

	/* Scrub item state flags, one for each XFS_SCRUB_TYPE. */
	__u8			sri_state[XFS_SCRUB_TYPE_NR];

	/* Track scrub and repair call retries for each scrub type. */
	__u8			sri_tries[XFS_SCRUB_TYPE_NR];

	/* Were there any corruption repairs needed? */
	bool			sri_inconsistent:1;

	/* Are we revalidating after repairs? */
	bool			sri_revalidate:1;
};

The first three fields are passed to the kernel via scrub ioctl and
describe a particular xfs domain (files, AGs, etc).  The rest of the
structure store state for each type of repair that can be performed
against that domain.

IOWs, xfs_scrub uses struct scrub_item objects to generate ioctl calls
to the kernel to check and repair things.  The kernel reads the ioctl
information, figures out what needs to be done, and then does the usual
get transaction -> lock things -> make updates -> commit dance to make
corrections to the fs.  Those corrections include log intent items, but
there's no tight coupling between log intent items and scrub_items.

Side note: The kernel repair code used to use intents to rebuild a
structure, but nowadays it use the btree bulk loader code to replace
btrees wholesale and in a single atomic commit.  Now we use them
primariliy to free preallocated space if the repair fails.

> > +   the repair flag enabled; this is discussed in the next section.
> > +   Optimizations and all other repairs are deferred to phase 4.
> I guess I'll come back to it. 
> 
> > +
> > +3. Check all metadata of every file in the filesystem.
> > +   Each metadata structure is also scheduled as a separate scrub
> > item.
> > +   If repairs are needed, ``xfs_scrub`` is permitted to perform
> > repairs,
> If repairs are needed and ``xfs_scrub`` is permitted

Fixed.

> ?
> > +   and there were no problems detected during phase 2, then those
> > scrub items
> > +   are repaired.
> > +   Optimizations and unsuccessful repairs are deferred to phase 4.
> > +
> > +4. All remaining repairs and scheduled optimizations are performed
> > during this
> > +   phase, if the caller permits them.
> > +   Before starting repairs, the summary counters are checked and any
> Did we talk about summary counters yet?  Maybe worth a blub. Otherwise
> this may not make sense with out skipping ahead or into the code

Nope.  I'll add that to the previous patch when I introduce primary and
secondary metadata.  Good catch!

"Summary metadata, as the name implies, condense information contained
in primary metadata for performance reasons."

> > necessary
> > +   repairs are performed so that subsequent repairs will not fail
> > the resource
> > +   reservation step due to wildly incorrect summary counters.
> > +   Unsuccesful repairs are requeued as long as forward progress on
> > repairs is
> > +   made somewhere in the filesystem.
> > +   Free space in the filesystem is trimmed at the end of phase 4 if
> > the
> > +   filesystem is clean.
> > +
> > +5. By the start of this phase, all primary and secondary filesystem
> > metadata
> > +   must be correct.
> I think maybe the definitions of primary and secondary metadata should
> move up before the phases section.  Otherwise the reader has to skip
> ahead to know what that means.

Yep, now primary, secondary, and summary metadata are defined in section
1.  Very good comment.

> > +   Summary counters such as the free space counts and quota resource
> > counts
> > +   are checked and corrected.
> > +   Directory entry names and extended attribute names are checked
> > for
> > +   suspicious entries such as control characters or confusing
> > Unicode sequences
> > +   appearing in names.
> > +
> > +6. If the caller asks for a media scan, read all allocated and
> > written data
> > +   file extents in the filesystem.
> > +   The ability to use hardware-assisted data file integrity checking
> > is new
> > +   to online fsck; neither of the previous tools have this
> > capability.
> > +   If media errors occur, they will be mapped to the owning files
> > and reported.
> > +
> > +7. Re-check the summary counters and presents the caller with a
> > summary of
> > +   space usage and file counts.
> > +
> > +Steps for Each Scrub Item
> > +-------------------------
> > +
> > +The kernel scrub code uses a three-step strategy for checking and
> > repairing
> > +the one aspect of a metadata object represented by a scrub item:
> > +
> > +1. The scrub item of intere
> > st is checked for corruptions; opportunities for
> > +   optimization; and for values that are directly controlled by the
> > system
> > +   administrator but look suspicious.
> > +   If the item is not corrupt or does not need optimization,
> > resource are
> > +   released and the positive scan results are returned to userspace.
> > +   If the item is corrupt or could be optimized but the caller does
> > not permit
> > +   this, resources are released and the negative scan results are
> > returned to
> > +   userspace.
> > +   Otherwise, the kernel moves on to the second step.
> > +
> > +2. The repair function is called to rebuild the data structure.
> > +   Repair functions generally choose rebuild a structure from other
> > metadata
> > +   rather than try to salvage the existing structure.
> > +   If the repair fails, the scan results from the first step are
> > returned to
> > +   userspace.
> > +   Otherwise, the kernel moves on to the third step.
> > +
> > +3. In the third step, the kernel runs the same checks over the new
> > metadata
> > +   item to assess the efficacy of the repairs.
> > +   The results of the reassessment are returned to userspace.
> > +
> > +Classification of Metadata
> > +--------------------------
> > +
> > +Each type of metadata object (and therefore each type of scrub item)
> > is
> > +classified as follows:
> > +
> > +Primary Metadata
> > +````````````````
> > +
> > +Metadata structures in this category should be most familiar to
> > filesystem
> > +users either because they are directly created by the user or they
> > index
> > +objects created by the user
> I think I would just jump straight into a brief list.  The above is a
> bit vague, and documentation that tells you you should already know
> what it is, doesnt add much.  Again, I think too much poetry might be
> why you're having a hard time getting responses.

Done:

- Free space and reference count information

- Inode records and indexes

- Storage mapping information for file data

- Directories

- Extended attributes

- Symbolic links

- Quota limits

- Link counts


> > +Most filesystem objects fall into this class.
> Most filesystem objects created by users fall into this class, such as
> inode, directories, allocation groups and so on.
> > +Resource and lock acquisition for scrub code follows the same order
> > as regular
> > +filesystem accesses.
> 
> Lock acquisition for these resources will follow the same order for
> scrub as a regular filesystem access.

Yes, that is clearer.  I think I'll phrase this more actively:

"Scrub obeys the same rules as regular filesystem accesses for resource
and lock acquisition."

> > +
> > +Primary metadata objects are the simplest for scrub to process.
> > +The principal filesystem object (either an allocation group or an
> > inode) that
> > +owns the item being scrubbed is locked to guard against concurrent
> > updates.
> > +The check function examines every record associated with the type
> > for obvious
> > +errors and cross-references healthy records against other metadata
> > to look for
> > +inconsistencies.
> > +Repairs for this class of scrub item are simple, since the repair
> > function
> > +starts by holding all the resources acquired in the previous step.
> > +The repair function scans available metadata as needed to record all
> > the
> > +observations needed to complete the structure.
> > +Next, it stages the observations in a new ondisk structure and
> > commits it
> > +atomically to complete the repair.
> > +Finally, the storage from the old data structure are carefully
> > reaped.
> > +
> > +Because ``xfs_scrub`` locks a primary object for the duration of the
> > repair,
> > +this is effectively an offline repair operation performed on a
> > subset of the
> > +filesystem.
> > +This minimizes the complexity of the repair code because it is not
> > necessary to
> > +handle concurrent updates from other threads, nor is it necessary to
> > access
> > +any other part of the filesystem.
> > +As a result, indexed structures can be rebuilt very quickly, and
> > programs
> > +trying to access the damaged structure will be blocked until repairs
> > complete.
> > +The only infrastructure needed by the repair code are the staging
> > area for
> > +observations and a means to write new structures to disk.
> > +Despite these limitations, the advantage that online repair holds is
> > clear:
> > +targeted work on individual shards of the filesystem avoids total
> > loss of
> > +service.
> > +
> > +This mechanism is described in section 2.1 ("Off-Line Algorithm") of
> > +V. Srinivasan and M. J. Carey, `"Performance of On-Line Index
> > Construction
> > +Algorithms" <https://dl.acm.org/doi/10.5555/645336.649870>`_,
> Hmm, this article is not displaying for me.  If the link is abandoned,
> probably there's not much need to keep it around

The actual paper is not directly available through that ACM link, but
the DOI is what I used to track down a paper copy(!) of that paper as
published in a journal.

(In turn, that journal is "Advances in Database Technology - EDBT 1992";
I found it in the NYU library.  Amazingly, they sold it to me.)

> > +*Extending Database Technology*, pp. 293-309, 1992.
> > +
> > +Most primary metadata repair functions stage their intermediate
> > results in an
> > +in-memory array prior to formatting the new ondisk structure, which
> > is very
> > +similar to the list-based algorithm discussed in section 2.3 ("List-
> > Based
> > +Algorithms") of Srinivasan.
> > +However, any data structure builder that maintains a resource lock
> > for the
> > +duration of the repair is *always* an offline algorithm.
> > +
> > +Secondary Metadata
> > +``````````````````
> > +
> > +Metadata structures in this category reflect records found in
> > primary metadata,
> 
> such as rmap and parent pointer attributes.  But they are only
> needed...
> 
> ?

Euugh, this section needs some restructuring to get rid of redundant
sentences.  How about:

"Metadata structures in this category reflect records found in primary
metadata, but are only needed for online fsck or for reorganization of
the filesystem.

"Secondary metadata include:

- Reverse mapping information

- Directory parent pointers

"This class of metadata is difficult for scrub to process because scrub
attaches to the secondary object but needs to check primary metadata,
which runs counter to the usual order of resource acquisition.
Frequently, this means that full filesystems scans are necessary to
rebuild the metadata.
Check functions..."

> > +but are only needed for online fsck or for reorganization of the
> > filesystem.
> > +Resource and lock acquisition for scrub code do not follow the same
> > order as
> > +regular filesystem accesses, and may involve full filesystem scans.
> > +
> > +Secondary metadata objects are difficult for scrub to process,
> > because scrub
> > +attaches to the secondary object but needs to check primary
> > metadata, which
> > +runs counter to the usual order of resource acquisition.
> bummer :-(

Yup.

> > +Check functions can be limited in scope to reduce runtime.
> > +Repairs, however, require a full scan of primary metadata, which can
> > take a
> > +long time to complete.
> > +Under these conditions, ``xfs_scrub`` cannot lock resources for the
> > entire
> > +duration of the repair.
> > +
> > +Instead, repair functions set up an in-memory staging structure to
> > store
> > +observations.
> > +Depending on the requirements of the specific repair function, the
> > staging
> 
> 
> > +index can have the same format as the ondisk structure, or it can
> > have a design
> > +specific to that repair function.
> ...will have either the same format as the ondisk structure or a
> structure specific to the repair function.

Fixed.

> > +The next step is to release all locks and start the filesystem scan.
> > +When the repair scanner needs to record an observation, the staging
> > data are
> > +locked long enough to apply the update.
> > +Simultaneously, the repair function hooks relevant parts of the
> > filesystem to
> > +apply updates to the staging data if the the update pertains to an
> > object that
> > +has already been scanned by the index builder.
> While a scan is in progress, function hooks are used to apply
> filesystem updates to both the object and the staging data if the
> object has already been scanned.
> 
> ?

The hooks are used to apply updates to the repair staging data, but they
don't apply regular filesystem updates.

The usual process runs something like this:

  Lock -> update -> update -> commit

With a scan in progress, say we hook the second update.  The instruction
flow becomes:

  Lock -> update -> update -> hook -> update staging data -> commit

Maybe something along the following would be better?

"While the filesystem scan is in progress, the repair function hooks the
filesystem so that it can apply pending filesystem updates to the
staging information."

> > +Once the scan is done, the owning object is re-locked, the live data
> > is used to
> > +write a new ondisk structure, and the repairs are committed
> > atomically.
> > +The hooks are disabled and the staging staging area is freed.
> > +Finally, the storage from the old data structure are carefully
> > reaped.
> > +
> > +Introducing concurrency helps online repair avoid various locking
> > problems, but
> > +comes at a high cost to code complexity.
> > +Live filesystem code has to be hooked so that the repair function
> > can observe
> > +updates in progress.
> > +The staging area has to become a fully functional parallel structure
> > so that
> > +updates can be merged from the hooks.
> > +Finally, the hook, the filesystem scan, and the inode locking model
> > must be
> > +sufficiently well integrated that a hook event can decide if a given
> > update
> > +should be applied to the staging structure.
> > +
> > +In theory, the scrub implementation could apply these same
> > techniques for
> > +primary metadata, but doing so would make it massively more complex
> > and less
> > +performant.
> > +Programs attempting to access the damaged structures are not blocked
> > from
> > +operation, which may cause application failure or an unplanned
> > filesystem
> > +shutdown.
> > +
> > +Inspiration for the secondary metadata repair strategy was drawn
> > from section
> > +2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without
> > Side-File")
> > +and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms
> > for
> > +Creating Indexes for Very Large Tables Without Quiescing Updates"
> > +<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
> This one works
> 
> > +
> > +The sidecar index mentioned above bears some resemblance to the side
> > file
> > +method mentioned in Srinivasan and Mohan.
> > +Their method consists of an index builder that extracts relevant
> > record data to
> > +build the new structure as quickly as possible; and an auxiliary
> > structure that
> > +captures all updates that would be committed to the index by other
> > threads were
> > +the new index already online.
> > +After the index building scan finishes, the updates recorded in the
> > side file
> > +are applied to the new index.
> > +To avoid conflicts between the index builder and other writer
> > threads, the
> > +builder maintains a publicly visible cursor that tracks the progress
> > of the
> > +scan through the record space.
> > +To avoid duplication of work between the side file and the index
> > builder, side
> > +file updates are elided when the record ID for the update is greater
> > than the
> > +cursor position within the record ID space.
> > +
> > +To minimize changes to the rest of the codebase, XFS online repair
> > keeps the
> > +replacement index hidden until it's completely ready to go.
> > +In other words, there is no attempt to expose the keyspace of the
> > new index
> > +while repair is running.
> > +The complexity of such an approach would be very high and perhaps
> > more
> > +appropriate to building *new* indices.
> > +
> > +**Question**: Can the full scan and live update code used to
> > facilitate a
> > +repair also be used to implement a comprehensive check?
> > +
> > +*Answer*: Probably, though this has not been yet been studied.
> I kinda feel like discussion Q&As need to be wrapped up before we can
> call things done.  If this is all there was to the answer, then lets
> clean out the discussion notes.

Oh, the situation here is worse than that -- in theory, check would be
much stronger if each scrub function employed these live scans to build
a shadow copy of the metadata and then compared the records of both.

However, that increases the amount of work each scrubber has to do much
higher, and the runtime of those scrubbers would go up.  The other issue
is that live scan hooks would have to proliferate through much more of
the filesystem.  That's rather more invasive to the codebase than most
of fsck, so I want people to look at the usage models for the handful of
scrubbers that really require it before I spread it around elsewhere.
Making that kind of change isn't that difficult, but I want to merge
this stuff before moving on to experimenting with improvements of that
scale.

> > +
> > +Summary Information
> > +```````````````````
> > +
> Oh, perhaps this section could move up with the other metadata
> definitions.  That way the reader already has an idea of what these
> terms are referring to before we get into how they are used during the
> phases.

Yeah, I think/hope this will be less of a problem now that section 1
defines all three types of metadata.  The start of this section now
reads:

"Metadata structures in this last category summarize the contents of
primary metadata records.
These are often used to speed up resource usage queries, and are many
times smaller than the primary metadata which they represent.

Examples of summary information include:

- Summary counts of free space and inodes

- File link counts from directories

- Quota resource usage counts

"Check and repair require full filesystem scans, but resource and lock
acquisition follow the same paths as regular filesystem accesses."

> > +Metadata structures in this last category summarize the contents of
> > primary
> > +metadata records.
> > +These are often used to speed up resource usage queries, and are
> > many times
> > +smaller than the primary metadata which they represent.
> > +Check and repair both require full filesystem scans, but resource
> > and lock
> > +acquisition follow the same paths as regular filesystem accesses.
> > +
> > +The superblock summary counters have special requirements due to the
> > underlying
> > +implementation of the incore counters, and will be treated
> > separately.
> > +Check and repair of the other types of summary counters (quota
> > resource counts
> > +and file link counts) employ the same filesystem scanning and
> > hooking
> > +techniques as outlined above, but because the underlying data are
> > sets of
> > +integer counters, the staging data need not be a fully functional
> > mirror of the
> > +ondisk structure.
> > +
> > +Inspiration for quota and file link count repair strategies were
> > drawn from
> > +sections 2.12 ("Online Index Operations") through 2.14 ("Incremental
> > View
> > +Maintenace") of G.  Graefe, `"Concurrent Queries and Updates in
> > Summary Views
> > +and Their Indexes"
> > +<
> > http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`
> > _, 2011.
> I wonder if these citations would do better as foot notes?  Just to
> kinda keep the body of the document tidy and flowing well.

Yes, if this were a paginated document.

> > +
> > +Since quotas are non-negative integer counts of resource usage,
> > online
> > +quotacheck can use the incremental view deltas described in section
> > 2.14 to
> > +track pending changes to the block and inode usage counts in each
> > transaction,
> > +and commit those changes to a dquot side file when the transaction
> > commits.
> > +Delta tracking is necessary for dquots because the index builder
> > scans inodes,
> > +whereas the data structure being rebuilt is an index of dquots.
> > +Link count checking combines the view deltas and commit step into
> > one because
> > +it sets attributes of the objects being scanned instead of writing
> > them to a
> > +separate data structure.
> > +Each online fsck function will be discussed as case studies later in
> > this
> > +document.
> > +
> > +Risk Management
> > +---------------
> > +
> > +During the development of online fsck, several risk factors were
> > identified
> > +that may make the feature unsuitable for certain distributors and
> > users.
> > +Steps can be taken to mitigate or eliminate those risks, though at a
> > cost to
> > +functionality.
> > +
> > +- **Decreased performance**: Adding metadata indices to the
> > filesystem
> > +  increases the time cost of persisting changes to disk, and the
> > reverse space
> > +  mapping and directory parent pointers are no exception.
> > +  System administrators who require the maximum performance can
> > disable the
> > +  reverse mapping features at format time, though this choice
> > dramatically
> > +  reduces the ability of online fsck to find inconsistencies and
> > repair them.
> > +
> > +- **Incorrect repairs**: As with all software, there might be
> > defects in the
> > +  software that result in incorrect repairs being written to the
> > filesystem.
> > +  Systematic fuzz testing (detailed in the next section) is employed
> > by the
> > +  authors to find bugs early, but it might not catch everything.
> > +  The kernel build system provides Kconfig options
> > (``CONFIG_XFS_ONLINE_SCRUB``
> > +  and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose
> > not to
> > +  accept this risk.
> > +  The xfsprogs build system has a configure option (``--enable-
> > scrub=no``) that
> > +  disables building of the ``xfs_scrub`` binary, though this is not
> > a risk
> > +  mitigation if the kernel functionality remains enabled.
> > +
> > +- **Inability to repair**: Sometimes, a filesystem is too badly
> > damaged to be
> > +  repairable.
> > +  If the keyspaces of several metadata indices overlap in some
> > manner but a
> > +  coherent narrative cannot be formed from records collected, then
> > the repair
> > +  fails.
> > +  To reduce the chance that a repair will fail with a dirty
> > transaction and
> > +  render the filesystem unusable, the online repair functions have
> > been
> > +  designed to stage and validate all new records before committing
> > the new
> > +  structure.
> > +
> > +- **Misbehavior**: Online fsck requires many privileges -- raw IO to
> > block
> > +  devices, opening files by handle, ignoring Unix discretionary
> > access control,
> > +  and the ability to perform administrative changes.
> > +  Running this automatically in the background scares people, so the
> > systemd
> > +  background service is configured to run with only the privileges
> > required.
> > +  Obviously, this cannot address certain problems like the kernel
> > crashing or
> > +  deadlocking, but it should be sufficient to prevent the scrub
> > process from
> > +  escaping and reconfiguring the system.
> > +  The cron job does not have this protection.
> > +
> 
> I think the fuzz part is one I would consider letting go.  All features
> need to go through a period of stabilizing, and we cant really control
> how some people respond to it, so I don't think this part adds much.  I
> think the document would do well to be trimmed where it can so as to
> stay more focused 

It took me a minute to realize that this comment applies to the text
below it.  Right?

> > +- **Fuzz Kiddiez**: There are many people now who seem to think that
> > running
> > +  automated fuzz testing of ondisk artifacts to find mischevious
> > behavior and
> > +  spraying exploit code onto the public mailing list for instant
> > zero-day
> > +  disclosure is somehow of some social benefit.

I want to keep this bit because it keeps happening[2].  Some folks
(huawei/alibaba?) have started to try to fix the bugs that their robots
find, and kudos to them!

You might have noticed that Googlers turned their firehose back on and
once again aren't doing anything to fix the problems they find.  How
very Googley of them.

[2] https://lwn.net/Articles/904293/

> > +  In the view of this author, the benefit is realized only when the
> > fuzz
> > +  operators help to **fix** the flaws, but this opinion apparently
> > is not
> > +  widely shared among security "researchers".
> > +  The XFS maintainers' continuing ability to manage these events
> > presents an
> > +  ongoing risk to the stability of the development process.
> > +  Automated testing should front-load some of the risk while the
> > feature is
> > +  considered EXPERIMENTAL.
> > +
> > +Many of these risks are inherent to software programming.
> > +Despite this, it is hoped that this new functionality will prove
> > useful in
> > +reducing unexpected downtime.
> > 
> 
> Paraphrasing and reorganizing suggestions aside, I think it looks
> pretty good

Ok, thank you!

--D

> Allison

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 01/14] xfs: document the motivation for online fsck design
  2023-01-07  5:01     ` Allison Henderson
  2023-01-11 19:10       ` Darrick J. Wong
@ 2023-01-12  0:10       ` Darrick J. Wong
  1 sibling, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-01-12  0:10 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Sat, Jan 07, 2023 at 05:01:54AM +0000, Allison Henderson wrote:

<snip> There was one part of your reply that I wanted to handle
separately:

> Something that I've noticed in my training sessions is that often
> times, less is more.  People really only absorb so much over a
> particular duration of time, so sometimes having too much detail in the
> context is not as helpful as you might think.

I'm very worried about this ^^^ exact problem making it more difficult
to merge online fsck.

As the online fsck patchset grew and grew and grew, I decided that it
was absolutely necessary to write a design document to condense the
information from 1200 patches, for this is the diffstat for the code
changes themselves:

225 files changed, 41244 insertions(+), 4388 deletions(-)
205 files changed, 16802 insertions(+), 3405 deletions(-)
438 files changed, 20123 insertions(+), 446 deletions(-)

That's 78169 insertions and 8239 deletions, or about ~70k new LOC, and
that doesn't include the scrub code that's already upstream (~60000).
It's wild that online fsck is larger than the filesystem.

You might recall that I sent it out for review twice last year, and the
feedback I got from the experienced folk was that I needed to write in
much more detail about the design -- specifically, what I was doing with
the fs hooks, and all the data structures that I was layering atop tmpfs
files to support rebuilds.

Before I even got to /that/ point, the design documentation had reached
4500 lines (or 90 pages) long, at which point I decided that it was
necessary to write a summary to condense the 4500 lines down to a single
chapter.

Hence part 1 about what is a filesystem check.  It's supposed to
introduce the very very broad concepts to a reader before they dive into
successively higher levels of detail in the later parts.

My guess is that the audience for the code deluges and this design doc
fall into roughly these categories:

* Experienced people who have been around XFS and Linux for a very long
  time.  These people, I think, would benefit from scanning parts 2 and
  3 as a refresher.  Then they can scan parts 5 and 6 before moving on
  to the code.

* Intermediate people, who probably need to read parts 2 - 6 and
  understand them thoroughly before reading the code.  The case studies
  in part 5 should be used as a guide to the patchsets.

* People who have no idea what filesystems and fsck are, want to know
  about them, but don't have any pre-existing knowledge.

> A lot of times, paraphrasing excerpts to reflect the same info in a
> more compact format will help you keep audience on track (a little
> longer at least).

Yes, thank you for your help in spotting these kinds of problems.  I've
been too close to the code for years, which means I have severe myopia
about things like "Am I confusing everyone?". :/

Speaking of which, am I confusing everyone?

--D

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 02/14] xfs: document the general theory underlying online fsck design
  2023-01-11 23:39       ` Darrick J. Wong
@ 2023-01-12  0:29         ` Dave Chinner
  2023-01-18  0:03         ` Allison Henderson
  1 sibling, 0 replies; 220+ messages in thread
From: Dave Chinner @ 2023-01-12  0:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Allison Henderson, Catherine Hoang, willy, linux-xfs,
	Chandan Babu, linux-fsdevel, hch

On Wed, Jan 11, 2023 at 03:39:08PM -0800, Darrick J. Wong wrote:
> On Wed, Jan 11, 2023 at 01:25:12AM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > +Primary metadata objects are the simplest for scrub to process.
> > > +The principal filesystem object (either an allocation group or an
> > > inode) that
> > > +owns the item being scrubbed is locked to guard against concurrent
> > > updates.
> > > +The check function examines every record associated with the type
> > > for obvious
> > > +errors and cross-references healthy records against other metadata
> > > to look for
> > > +inconsistencies.
> > > +Repairs for this class of scrub item are simple, since the repair
> > > function
> > > +starts by holding all the resources acquired in the previous step.
> > > +The repair function scans available metadata as needed to record all
> > > the
> > > +observations needed to complete the structure.
> > > +Next, it stages the observations in a new ondisk structure and
> > > commits it
> > > +atomically to complete the repair.
> > > +Finally, the storage from the old data structure are carefully
> > > reaped.
> > > +
> > > +Because ``xfs_scrub`` locks a primary object for the duration of the
> > > repair,
> > > +this is effectively an offline repair operation performed on a
> > > subset of the
> > > +filesystem.
> > > +This minimizes the complexity of the repair code because it is not
> > > necessary to
> > > +handle concurrent updates from other threads, nor is it necessary to
> > > access
> > > +any other part of the filesystem.
> > > +As a result, indexed structures can be rebuilt very quickly, and
> > > programs
> > > +trying to access the damaged structure will be blocked until repairs
> > > complete.
> > > +The only infrastructure needed by the repair code are the staging
> > > area for
> > > +observations and a means to write new structures to disk.
> > > +Despite these limitations, the advantage that online repair holds is
> > > clear:
> > > +targeted work on individual shards of the filesystem avoids total
> > > loss of
> > > +service.
> > > +
> > > +This mechanism is described in section 2.1 ("Off-Line Algorithm") of
> > > +V. Srinivasan and M. J. Carey, `"Performance of On-Line Index
> > > Construction
> > > +Algorithms" <https://dl.acm.org/doi/10.5555/645336.649870>`_,
> > Hmm, this article is not displaying for me.  If the link is abandoned,
> > probably there's not much need to keep it around
> 
> The actual paper is not directly available through that ACM link, but
> the DOI is what I used to track down a paper copy(!) of that paper as
> published in a journal.

PDF version here:

https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf?sequence=1

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 12/16] fuzzy: increase operation count for each fsstress invocation
  2022-12-30 22:12   ` [PATCH 12/16] fuzzy: increase operation count for each fsstress invocation Darrick J. Wong
@ 2023-01-13 19:55     ` Zorro Lang
  2023-01-13 21:28       ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Zorro Lang @ 2023-01-13 19:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Fri, Dec 30, 2022 at 02:12:54PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> For online fsck stress testing, increase the number of filesystem
> operations per fsstress run to 2 million, now that we have the ability
> to kill fsstress if the user should push ^C to abort the test early.
> This should guarantee a couple of hours of continuous stress testing in
> between clearing the scratch filesystem.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  common/fuzzy |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/common/fuzzy b/common/fuzzy
> index 01cf7f00d8..3e23edc9e4 100644
> --- a/common/fuzzy
> +++ b/common/fuzzy
> @@ -399,7 +399,9 @@ __stress_scrub_fsstress_loop() {
>  	local end="$1"
>  	local runningfile="$2"
>  
> -	local args=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000 $FSSTRESS_AVOID)
> +	# As of March 2022, 2 million fsstress ops should be enough to keep
> +	# any filesystem busy for a couple of hours.
> +	local args=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000000 $FSSTRESS_AVOID)

Can fsstress "-l 0" option help?

>  	echo "Running $FSSTRESS_PROG $args" >> $seqres.full
>  
>  	while __stress_scrub_running "$end" "$runningfile"; do
> 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [NYE DELUGE 1/4] xfs: all pending online scrub improvements
  2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
                   ` (21 preceding siblings ...)
  2022-12-30 22:13 ` [PATCHSET v24.0 0/2] fstests: race online scrub with mount state changes Darrick J. Wong
@ 2023-01-13 20:10 ` Zorro Lang
  2023-01-13 21:28   ` Darrick J. Wong
  22 siblings, 1 reply; 220+ messages in thread
From: Zorro Lang @ 2023-01-13 20:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs, fstests

On Fri, Dec 30, 2022 at 01:13:21PM -0800, Darrick J. Wong wrote:
> Hi everyone,
> 
> As I've mentioned several times throughout 2022, I would like to merge
> the online fsck feature in time for the 2023 LTS kernel.  The first big
> step in this process is to merge all the pending bug fixes, validation
> improvements, and general reorganization of the existing metadata
> scrubbing functionality.
> 
> This first deluge starts with the design document for the entirety of
> the online fsck feature.  The design doc should be familiar to most of
> you, as it's been on the list for review for months already.  It
> outlines in brief the problems we're trying to solve, the use cases and
> testing plan, and the fundamental data structures and algorithms
> underlying the entire feature.
> 
> After that come all the code changes to wrap up the metadata checking
> part of the feature.  The biggest piece here is the scrub drains that
> allow scrub to quiesce deferred ops targeting AGs so that it can
> cross-reference recordsets.  Most of the rest is tweaking the btree code
> so that we can do keyspace scans to look for conflicting records.
> 
> For this review, I would like people to focus the following:
> 
> - Are the major subsystems sufficiently documented that you could figure
>   out what the code does?
> 
> - Do you see any problems that are severe enough to cause long term
>   support hassles? (e.g. bad API design, writing weird metadata to disk)
> 
> - Can you spot mis-interactions between the subsystems?
> 
> - What were my blind spots in devising this feature?
> 
> - Are there missing pieces that you'd like to help build?
> 
> - Can I just merge all of this?
> 
> The one thing that is /not/ in scope for this review are requests for
> more refactoring of existing subsystems.  While there are usually valid
> arguments for performing such cleanups, those are separate tasks to be
> prioritized separately.  I will get to them after merging online fsck.
> 
> I've been running daily online scrubs of every computer I own for the
> last five years, which has helped me iron out real problems in (limited
> scope) production.  All issues observed in that time have been corrected
> in this submission.

The 3 fstests patchsets of the [NYE DELUGE 1/4] look good to me. And I didn't
find more critical issues after Darrick fixed that "group name missing" problem.
By testing it a whole week, I decide to merge this 3 patchsets this weekend,
then we can shift to later patchsets are waiting for review and merge.

Reviewed-by: Zorro Lang <zlang@redhat.com>

Thanks,
Zorro

> 
> As a warning, the patches will likely take several days to trickle in.
> All four patch deluges are based off kernel 6.2-rc1, xfsprogs 6.1, and
> fstests 2022-12-25.
> 
> Thank you all for your participation in the XFS community.  Have a safe
> New Years, and I'll see you all next year!
> 
> --D
> 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 12/16] fuzzy: increase operation count for each fsstress invocation
  2023-01-13 19:55     ` Zorro Lang
@ 2023-01-13 21:28       ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-01-13 21:28 UTC (permalink / raw)
  To: Zorro Lang; +Cc: linux-xfs, fstests

On Sat, Jan 14, 2023 at 03:55:25AM +0800, Zorro Lang wrote:
> On Fri, Dec 30, 2022 at 02:12:54PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > For online fsck stress testing, increase the number of filesystem
> > operations per fsstress run to 2 million, now that we have the ability
> > to kill fsstress if the user should push ^C to abort the test early.
> > This should guarantee a couple of hours of continuous stress testing in
> > between clearing the scratch filesystem.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  common/fuzzy |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/common/fuzzy b/common/fuzzy
> > index 01cf7f00d8..3e23edc9e4 100644
> > --- a/common/fuzzy
> > +++ b/common/fuzzy
> > @@ -399,7 +399,9 @@ __stress_scrub_fsstress_loop() {
> >  	local end="$1"
> >  	local runningfile="$2"
> >  
> > -	local args=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000 $FSSTRESS_AVOID)
> > +	# As of March 2022, 2 million fsstress ops should be enough to keep
> > +	# any filesystem busy for a couple of hours.
> > +	local args=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 2000000 $FSSTRESS_AVOID)
> 
> Can fsstress "-l 0" option help?

No.  -n determines the number of operations per loop, and -l determines
the number of loops:

$ fsstress -d dor/ -n 5 -v -s 1
0/0: mkdir d0 17
0/0: mkdir add id=0,parent=-1
0/1: link - no file
0/2: mkdir d1 17
0/2: mkdir add id=1,parent=-1
0/3: chown . 127/0 0
0/4: rename - no source filename

$ fsstress -d dor/ -n 5 -l 2 -v -s 1
0/0: mkdir d0 17
0/0: mkdir add id=0,parent=-1
0/1: link - no file
0/2: mkdir d1 17
0/2: mkdir add id=1,parent=-1
0/3: chown . 127/0 0
0/4: rename - no source filename
0/0: mkdir d2 0
0/0: mkdir add id=2,parent=-1
0/1: link - no file
0/2: mkdir d2/d3 0
0/2: mkdir add id=3,parent=2
0/3: chown d2 127/0 0
0/4: rename(REXCHANGE) d2/d3 and d2 have ancestor-descendant relationship

--D

> >  	echo "Running $FSSTRESS_PROG $args" >> $seqres.full
> >  
> >  	while __stress_scrub_running "$end" "$runningfile"; do
> > 
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [NYE DELUGE 1/4] xfs: all pending online scrub improvements
  2023-01-13 20:10 ` [NYE DELUGE 1/4] xfs: all pending online scrub improvements Zorro Lang
@ 2023-01-13 21:28   ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-01-13 21:28 UTC (permalink / raw)
  To: Zorro Lang; +Cc: xfs, fstests

On Sat, Jan 14, 2023 at 04:10:33AM +0800, Zorro Lang wrote:
> On Fri, Dec 30, 2022 at 01:13:21PM -0800, Darrick J. Wong wrote:
> > Hi everyone,
> > 
> > As I've mentioned several times throughout 2022, I would like to merge
> > the online fsck feature in time for the 2023 LTS kernel.  The first big
> > step in this process is to merge all the pending bug fixes, validation
> > improvements, and general reorganization of the existing metadata
> > scrubbing functionality.
> > 
> > This first deluge starts with the design document for the entirety of
> > the online fsck feature.  The design doc should be familiar to most of
> > you, as it's been on the list for review for months already.  It
> > outlines in brief the problems we're trying to solve, the use cases and
> > testing plan, and the fundamental data structures and algorithms
> > underlying the entire feature.
> > 
> > After that come all the code changes to wrap up the metadata checking
> > part of the feature.  The biggest piece here is the scrub drains that
> > allow scrub to quiesce deferred ops targeting AGs so that it can
> > cross-reference recordsets.  Most of the rest is tweaking the btree code
> > so that we can do keyspace scans to look for conflicting records.
> > 
> > For this review, I would like people to focus the following:
> > 
> > - Are the major subsystems sufficiently documented that you could figure
> >   out what the code does?
> > 
> > - Do you see any problems that are severe enough to cause long term
> >   support hassles? (e.g. bad API design, writing weird metadata to disk)
> > 
> > - Can you spot mis-interactions between the subsystems?
> > 
> > - What were my blind spots in devising this feature?
> > 
> > - Are there missing pieces that you'd like to help build?
> > 
> > - Can I just merge all of this?
> > 
> > The one thing that is /not/ in scope for this review are requests for
> > more refactoring of existing subsystems.  While there are usually valid
> > arguments for performing such cleanups, those are separate tasks to be
> > prioritized separately.  I will get to them after merging online fsck.
> > 
> > I've been running daily online scrubs of every computer I own for the
> > last five years, which has helped me iron out real problems in (limited
> > scope) production.  All issues observed in that time have been corrected
> > in this submission.
> 
> The 3 fstests patchsets of the [NYE DELUGE 1/4] look good to me. And I didn't
> find more critical issues after Darrick fixed that "group name missing" problem.
> By testing it a whole week, I decide to merge this 3 patchsets this weekend,
> then we can shift to later patchsets are waiting for review and merge.
> 
> Reviewed-by: Zorro Lang <zlang@redhat.com>

Ok, thanks!

--D

> Thanks,
> Zorro
> 
> > 
> > As a warning, the patches will likely take several days to trickle in.
> > All four patch deluges are based off kernel 6.2-rc1, xfsprogs 6.1, and
> > fstests 2022-12-25.
> > 
> > Thank you all for your participation in the XFS community.  Have a safe
> > New Years, and I'll see you all next year!
> > 
> > --D
> > 
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH v24.2 12/14] xfs: document directory tree repairs
  2022-12-30 22:10   ` [PATCH 12/14] xfs: document directory tree repairs Darrick J. Wong
@ 2023-01-14  2:32     ` Darrick J. Wong
  2023-02-03  2:12     ` [PATCH v24.3 " Darrick J. Wong
  1 sibling, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-01-14  2:32 UTC (permalink / raw)
  To: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

Directory tree repairs are the least complete part of online fsck, due
to the lack of directory parent pointers.  However, even without that
feature, we can still make some corrections to the directory tree -- we
can salvage as many directory entries as we can from a damaged
directory, and we can reattach orphaned inodes to the lost+found, just
as xfs_repair does now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
v24.2: updated with my latest thoughts about how to use parent pointers
---
 .../filesystems/xfs-online-fsck-design.rst         |  322 ++++++++++++++++++++
 1 file changed, 322 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 163be2847c24..15e3a4acd40a 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -4319,3 +4319,325 @@ The proposed patchset is the
 `extended attribute repair
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
 series.
+
+Fixing Directories
+------------------
+
+Fixing directories is difficult with currently available filesystem features.
+The offline repair tool scans all inodes to find files with nonzero link count,
+and then it scans all directories to establish parentage of those linked files.
+Damaged files and directories are zapped, and files with no parent are
+moved to the ``/lost+found`` directory.
+It does not try to salvage anything.
+
+The best that online repair can do at this time is to read directory data
+blocks and salvage any dirents that look plausible, correct link counts, and
+move orphans back into the directory tree.
+The salvage process is discussed in the case study at the end of this section.
+The second component to fixing the directory tree online is the :ref:`file link
+count fsck <nlinks>`, since it can scan the entire filesystem to make sure that
+files can neither be deleted while there are still parents nor forgotten after
+all parents sever their links to the child.
+The third part is discussed at the :ref:`end of this section<orphanage>`.
+However, there may be a solution to these deficiencies soon!
+
+Parent Pointers
+```````````````
+
+The lack of secondary directory metadata hinders directory tree reconstruction
+in much the same way that the historic lack of reverse space mapping
+information once hindered reconstruction of filesystem space metadata.
+Specifically, dirents are not redundant, which makes it impossible to construct
+a true replacement for a damaged directory.
+The best that online repair can do currently is to construct a new directory
+from any dirents that are salvageable and use the file link count repair
+function to move orphaned files to the lost and found.
+Offline repair doesn't salvage broken directories.
+The proposed parent pointer feature, however, will make total directory
+reconstruction possible.
+
+Directory parent pointers were first proposed as an XFS feature more than a
+decade ago by SGI.
+In that implementation, each link from a parent directory to a child file was
+augmented by an extended attribute in the child that could be used to identify
+the parent directory.
+Unfortunately, this early implementation had several major shortcomings:
+
+1. The XFS codebase of the late 2000s did not have the infrastructure to
+   enforce strong referential integrity in the directory tree, which is a fancy
+   way to say that it could not guarantee that a change in a forward link would
+   always be followed up with the corresponding change to the reverse links.
+
+2. Referential integrity was not integrated into offline repair.
+   Checking and repairs were performed on mounted filesystems without taking
+   any kernel or inode locks to coordinate access.
+   It is not clear if this actually worked properly.
+
+3. The extended attribute did not record the name of the directory entry in the
+   parent, so the first parent pointer implementation cannot be used to
+   reconnect the directory tree.
+
+4. Extended attribute forks only support 65,536 extents, which means that
+   parent pointer attribute creation is likely to fail at some point before the
+   maximum file link count is achieved.
+
+Allison Henderson, Chandan Babu, and Catherine Hoang are working on a second
+implementation that solves the shortcomings of the first.
+During 2022, Allison introduced log intent items to track physical
+manipulations of the extended attribute structures.
+This solves the referential integrity problem by making it possible to commit
+a dirent update and a parent pointer update in the same transaction.
+Chandan increased the maximum extent counts of both data and attribute forks,
+thereby addressing the fourth problem.
+
+Allison has proposed a second implementation of parent pointers.
+This time around, parent pointer data will also include the dirent name and
+location within the parent.
+In other words, child files use extended attributes to store pointers to
+parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``.
+This solves the third problem.
+
+When the parent pointer feature lands, the directory checking process can be
+strengthened to ensure that the target of each dirent also contains a parent
+pointer pointing back to the dirent.
+Likewise, each parent pointer can be checked by ensuring that the target of
+each parent pointer is a directory and that it contains a dirent matching
+the parent pointer.
+Both online and offline repair can use this strategy.
+
+The quality of directory repairs will improve because online fsck will be able
+to reconstruct a directory in its entirety instead of skipping unsalvageable
+areas.
+This process is imagined to involve a :ref:`coordinated inode scan <iscan>` and
+a :ref:`directory entry live update hook <liveupdate>`, and goes as follows:
+
+1. Visit every file in the entire filesystem.
+
+2. Every time the scan encounters a file with a parent pointer to the directory
+   that is being reconstructed, record this entry in the temporary directory.
+
+3. When the scan is complete, atomically swap the contents of the temporary
+   directory and the directory being repaired.
+
+4. Update the dirent position field of parent pointers as necessary.
+   This may require the queuing of a substantial number of xattr log intent
+   items.
+
+**Question**: How will repair ensure that the ``dirent_pos`` fields match in
+the reconstructed directory?
+
+*Answer*: There are a few ways to solve this problem:
+
+1. The field could be designated advisory, since the other three values are
+   sufficient to find the entry in the parent.
+   However, this makes indexed key lookup impossible while repairs are ongoing.
+
+2. We could allow creating directory entries at specified offsets, which solves
+   the referential integrity problem but runs the risk that dirent creation
+   will fail due to conflicts with the free space in the directory.
+
+   These conflicts could be resolved by appending the directory entry and
+   amending the xattr code to support updating an xattr key and reindexing the
+   dabtree, though this would have to be performed with the parent directory
+   still locked.
+
+3. Same as above, but remove the old parent pointer entry and add a new one
+   atomically.
+
+4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``,
+   which would provide the key uniqueness that we require, without forcing
+   repair code to update the dirent position.
+
+Online reconstruction of a file's parent pointer information is imagined to
+work similarly to directory reconstruction:
+
+1. Visit every directory in the entire filesystem.
+
+2. Every time the scan encounters a directory with a dirent pointing to the
+   file that is being reconstructed, record this entry in the temporary file's
+   extended attributes.
+
+3. When the scan is complete, copy the file's other extended attributes to the
+   temporary file.
+
+4. Atomically swap the contents of the temporary file's extended attributes and
+   the file being repaired.
+   If the other extended attributes are large compared to the parent pointers,
+   it may be faster to use xattr log items to copy the parent pointers from the
+   temporary file to the file being reconstructed.
+   We lose the atomicity guarantee if we do this.
+
+This code has not yet been constructed, so there is not yet a case study laying
+out exactly how this process works.
+
+Examining parent pointers in offline repair works differently because corrupt
+files are erased long before directory tree connectivity checks are performed.
+Parent pointer checks are therefore a second pass to be added to the existing
+connectivity checks:
+
+1. After the set of surviving files has been established (i.e. phase 6),
+   walk the surviving directories of each AG in the filesystem.
+
+2. For each dirent found, add ``(child_ag_inum, parent_inum, dirent_pos)``
+   tuples to an in-memory index.
+   This may require creation of another type of xfile btree.
+
+3. Walk each file a second time to compare compare the ondisk parent pointers
+   against the in-memory index.
+   Parent pointers missing in the ondisk structure should be added, and ondisk
+   pointers not found by the scan should be removed.
+
+4. Move on to examining link counts, as we do today.
+
+Rebuilding directories from parent pointers in offline repair is very
+challenging because it currently uses a single-pass scan of the filesystem
+during phase 3 to decide which files are corrupt enough to be zapped.
+This scan would have to be converted into a multi-pass scan:
+
+1. The first pass of the scan zaps corrupt inodes, forks, and attributes
+   much as it does now.
+   Corrupt directories are noted but not zapped.
+
+2. The next pass records parent pointers pointing to the directories noted
+   as being corrupt in the first pass.
+   This second pass may have to happen after the phase 4 scan for duplicate
+   blocks, if phase 4 is also capable of zapping directories.
+
+3. The third pass resets corrupt directories to an empty shortform directory.
+   Free space metadata has not been ensured yet, so repair cannot yet use the
+   directory building code in libxfs.
+
+4. At the start of phase 6, space metadata have been rebuilt.
+   Use the parent pointer information recorded during step 2 to reconstruct
+   the dirents and add them to the now-empty directories.
+
+This code has also not yet been constructed.
+
+Case Study: Salvaging Directories
+`````````````````````````````````
+
+Unlike extended attributes, directory blocks are all the same size, so
+salvaging directories is straightforward:
+
+1. Find the parent of the directory.
+   If the dotdot entry is not unreadable, try to confirm that the alleged
+   parent has a child entry pointing back to the directory being repaired.
+   Otherwise, walk the filesystem to find it.
+
+2. Walk the first partition of data fork of the directory to find the directory
+   entry data blocks.
+   When one is found,
+
+   a. Walk the directory data block to find candidate entries.
+      When an entry is found:
+
+      i. Check the name for problems, and ignore the name if there are.
+
+      ii. Retrieve the inumber and grab the inode.
+          If that succeeds, add the name, inode number, and file type to the
+          staging xfarray and xblob.
+
+3. If the memory usage of the xfarray and xfblob exceed a certain amount of
+   memory or there are no more directory data blocks to examine, unlock the
+   directory and add the staged dirents into the temporary directory.
+   Truncate the staging files.
+
+4. Use atomic extent swapping to exchange the new and old directory structures.
+   The old directory blocks are now attached to the temporary file.
+
+5. Reap the temporary file.
+
+**Question**: Should repair revalidate the dentry cache when rebuilding a
+directory?
+
+*Answer*: Yes, though the current dentry cache code doesn't provide a means
+to walk every dentry of a specific directory.
+If the cache contains an entry that the salvaging code does not find, the
+repair cannot proceed.
+
+**Question**: Can the dentry cache know about a directory entry that cannot be
+salvaged?
+
+*Answer*: In theory, the dentry cache should be a subset of the directory
+entries on disk because there's no way to load a dentry without having
+something to read in the directory.
+However, it is possible for a coherency problem to be introduced if the ondisk
+structures becomes corrupt *after* the cache loads.
+In theory it is necessary to scan all dentry cache entries for a directory to
+ensure that one of the following apply:
+
+1. The cached dentry reflects an ondisk dirent in the new directory.
+
+2. The cached dentry no longer has a corresponding ondisk dirent in the new
+   directory and the dentry can be purged from the cache.
+
+3. The cached dentry no longer has an ondisk dirent but the dentry cannot be
+   purged.
+   This is bad.
+
+As mentioned above, the dentry cache does not have a means to walk all the
+dentries with a particular directory as a parent.
+This makes detecting situations #2 and #3 impossible, and remains an
+interesting question for research.
+
+The proposed patchset is the
+`directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
+series.
+
+.. _orphanage:
+
+The Orphanage
+-------------
+
+Filesystems present files as a directed, and hopefully acyclic, graph.
+In other words, a tree.
+The root of the filesystem is a directory, and each entry in a directory points
+downwards either to more subdirectories or to non-directory files.
+Unfortunately, a disruption in the directory graph pointers result in a
+disconnected graph, which makes files impossible to access via regular path
+resolution.
+The directory parent pointer online scrub code can detect a dotdot entry
+pointing to a parent directory that doesn't have a link back to the child
+directory, and the file link count checker can detect a file that isn't pointed
+to by any directory in the filesystem.
+If the file in question has a positive link count, the file in question is an
+orphan.
+
+When orphans are found, they should be reconnected to the directory tree.
+Offline fsck solves the problem by creating a directory ``/lost+found`` to
+serve as an orphanage, and linking orphan files into the orphanage by using the
+inumber as the name.
+Reparenting a file to the orphanage does not reset any of its permissions or
+ACLs.
+
+This process is more involved in the kernel than it is in userspace.
+The directory and file link count repair setup functions must use the regular
+VFS mechanisms to create the orphanage directory with all the necessary
+security attributes and dentry cache entries, just like a regular directory
+tree modification.
+
+Orphaned files are adopted by the orphanage as follows:
+
+1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function
+   to try to ensure that the lost and found directory actually exists.
+   This also attaches the orphanage directory to the scrub context.
+
+2. If the decision is made to reconnect a file, take the IOLOCK of both the
+   orphanage and the file being reattached.
+   The ``xrep_orphanage_iolock_two`` function follows the inode locking
+   strategy discussed earlier.
+
+3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name``
+   to compute the new name in the orphanage and the block reservation required.
+
+4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair
+   transaction.
+
+5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost
+   and found, and update the kernel dentry cache.
+
+The proposed patches are in the
+`orphanage adoption
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
+series.

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 01/14] xfs: document the motivation for online fsck design
  2023-01-11 19:10       ` Darrick J. Wong
@ 2023-01-18  0:03         ` Allison Henderson
  2023-01-18  1:29           ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-01-18  0:03 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Wed, 2023-01-11 at 11:10 -0800, Darrick J. Wong wrote:
> On Sat, Jan 07, 2023 at 05:01:54AM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Start the first chapter of the online fsck design documentation.
> > > This covers the motivations for creating this in the first place.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  Documentation/filesystems/index.rst                |    1 
> > >  .../filesystems/xfs-online-fsck-design.rst         |  199
> > > ++++++++++++++++++++
> > >  2 files changed, 200 insertions(+)
> > >  create mode 100644 Documentation/filesystems/xfs-online-fsck-
> > > design.rst
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/index.rst
> > > b/Documentation/filesystems/index.rst
> > > index bee63d42e5ec..fbb2b5ada95b 100644
> > > --- a/Documentation/filesystems/index.rst
> > > +++ b/Documentation/filesystems/index.rst
> > > @@ -123,4 +123,5 @@ Documentation for filesystem implementations.
> > >     vfat
> > >     xfs-delayed-logging-design
> > >     xfs-self-describing-metadata
> > > +   xfs-online-fsck-design
> > >     zonefs
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > new file mode 100644
> > > index 000000000000..25717ebb5f80
> > > --- /dev/null
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -0,0 +1,199 @@
> > > +.. SPDX-License-Identifier: GPL-2.0
> > > +.. _xfs_online_fsck_design:
> > > +
> > > +..
> > > +        Mapping of heading styles within this document:
> > > +        Heading 1 uses "====" above and below
> > > +        Heading 2 uses "===="
> > > +        Heading 3 uses "----"
> > > +        Heading 4 uses "````"
> > > +        Heading 5 uses "^^^^"
> > > +        Heading 6 uses "~~~~"
> > > +        Heading 7 uses "...."
> > > +
> > > +        Sections are manually numbered because apparently that's
> > > what everyone
> > > +        does in the kernel.
> > > +
> > > +======================
> > > +XFS Online Fsck Design
> > > +======================
> > > +
> > > +This document captures the design of the online filesystem check
> > > feature for
> > > +XFS.
> > > +The purpose of this document is threefold:
> > > +
> > > +- To help kernel distributors understand exactly what the XFS
> > > online
> > > fsck
> > > +  feature is, and issues about which they should be aware.
> > > +
> > > +- To help people reading the code to familiarize themselves with
> > > the
> > > relevant
> > > +  concepts and design points before they start digging into the
> > > code.
> > > +
> > > +- To help developers maintaining the system by capturing the
> > > reasons
> > > +  supporting higher level decisionmaking.
> > nit: decision making
> 
> Fixed.
> 
> > > +
> > > +As the online fsck code is merged, the links in this document to
> > > topic branches
> > > +will be replaced with links to code.
> > > +
> > > +This document is licensed under the terms of the GNU Public
> > > License,
> > > v2.
> > > +The primary author is Darrick J. Wong.
> > > +
> > > +This design document is split into seven parts.
> > > +Part 1 defines what fsck tools are and the motivations for
> > > writing a
> > > new one.
> > > +Parts 2 and 3 present a high level overview of how online fsck
> > > process works
> > > +and how it is tested to ensure correct functionality.
> > > +Part 4 discusses the user interface and the intended usage modes
> > > of
> > > the new
> > > +program.
> > > +Parts 5 and 6 show off the high level components and how they
> > > fit
> > > together, and
> > > +then present case studies of how each repair function actually
> > > works.
> > > +Part 7 sums up what has been discussed so far and speculates
> > > about
> > > what else
> > > +might be built atop online fsck.
> > > +
> > > +.. contents:: Table of Contents
> > > +   :local:
> > > +
> > 
> > Something that I've noticed in my training sessions is that often
> > times, less is more.  People really only absorb so much over a
> > particular duration of time, so sometimes having too much detail in
> > the
> > context is not as helpful as you might think.  A lot of times,
> > paraphrasing excerpts to reflect the same info in a more compact
> > format
> > will help you keep audience on track (a little longer at least). 
> > 
> > > +1. What is a Filesystem Check?
> > > +==============================
> > > +
> > > +A Unix filesystem has three main jobs: to provide a hierarchy of
> > > names through
> > > +which application programs can associate arbitrary blobs of data
> > > for
> > > any
> > > +length of time, to virtualize physical storage media across
> > > those
> > > names, and
> > > +to retrieve the named data blobs at any time.
> > Consider the following paraphrase:
> > 
> > A Unix filesystem has three main jobs:
> >  * Provide a hierarchy of names by which applications access data
> > for a
> > length of time.
> >  * Store or retrieve that data at any time.
> >  * Virtualize physical storage media across those names
> 
> Ooh, listifying.  I did quite a bit of that to break up the walls of
> text in earlier revisions, but apparently I missed this one.
> 
> > Also... I dont think it would be inappropriate to just skip the
> > above,
> > and jump right into fsck.  That's a very limited view of a
> > filesystem,
> > likely a reader seeking an fsck doc probably has some idea of what
> > a fs
> > is otherwise supposed to be doing.  
> 
> This will become part of the general kernel documentation, so we
> can't
> assume that all readers are going to know what a fs really does.
> 
> "A Unix filesystem has four main responsibilities:
> 
> - Provide a hierarchy of names through which application programs can
>   associate arbitrary blobs of data for any length of time,
> 
> - Virtualize physical storage media across those names, and
> 
> - Retrieve the named data blobs at any time.
> 
> - Examine resource usage.
> 
> "Metadata directly supporting these functions (e.g. files,
> directories,
> space mappings) are sometimes called primary metadata.
> Secondary metadata (e.g. reverse mapping and directory parent
> pointers)
> support operations internal to the filesystem, such as internal
> consistency checking and reorganization."
Sure, I think that sounds good and helps to set up the metadata
concepts that are discussed later.
> 
> (I added those last two sentences in response to a point you made
> below.)
> 
> > > +The filesystem check (fsck) tool examines all the metadata in a
> > > filesystem
> > > +to look for errors.
> > > +Simple tools only check for obvious corruptions, but the more
> > > sophisticated
> > > +ones cross-reference metadata records to look for
> > > inconsistencies.
> > > +People do not like losing data, so most fsck tools also contains
> > > some ability
> > > +to deal with any problems found.
> > 
> > While simple tools can detect data corruptions, a filesystem check
> > (fsck) uses metadata records as a cross-reference to find and
> > correct
> > more inconsistencies.
> > 
> > ?
> 
> Let's be careful with the term 'data corruption' here -- a lot of
> people
> (well ok me) will see that as *user* data corruption, whereas we're
> talking about *metadata* corruption.
> 
> I think I'll rework that second sentence further:
> 
> "In addition to looking for obvious metadata corruptions, fsck also
> cross-references different types of metadata records with each other
> to
> look for inconsistencies."
> 
Alrighty, that sounds good

> Since the really dumb fscks of the 1970s are a long ways past now.
> 
> > > +As a word of caution -- the primary goal of most Linux fsck
> > > tools is
> > > to restore
> > > +the filesystem metadata to a consistent state, not to maximize
> > > the
> > > data
> > > +recovered.
> > > +That precedent will not be challenged here.
> > > +
> > > +Filesystems of the 20th century generally lacked any redundancy
> > > in
> > > the ondisk
> > > +format, which means that fsck can only respond to errors by
> > > erasing
> > > files until
> > > +errors are no longer detected.
> > > +System administrators avoid data loss by increasing the number
> > > of
> > > separate
> > > +storage systems through the creation of backups; 
> > 
> > 
> > > and they avoid downtime by
> > > +increasing the redundancy of each storage system through the
> > > creation of RAID.
> > Mmm, raids help more for hardware failures right?  They dont really
> > have a notion of when the fs is corrupted.
> 
> Right.
> 
> > While an fsck can help
> > navigate around a corruption possibly caused by a hardware failure,
> > I
> > think it's really a different kind of redundancy. I think I'd
> > probably
> > drop the last line and keep the selling point focused online
> > repair.
> 
> Yes, RAIDs provide a totally different type of redundancy.  I decided
> to
> make this point specifically to counter the people who argue that
> RAID
> makes them impervious to corruption problems, etc.
> 
> This attitude seemed rather prevalent in the early days of btrfs and
> a
> certain other filesystem that Shall Not Be Named, even though the
> btrfs
> developers themselves acknowledge this distinction, given the
> existence
> of `btrfs scrub' and `btrfs check'.
> 
> However you do have a good point that this sentence doesn't add much
> where it is.  I think I'll add it as a sidebar at the end of the
> paragraph.
> 
> > > +More recent filesystem designs contain enough redundancy in
> > > their
> > > metadata that
> > > +it is now possible to regenerate data structures when non-
> > > catastrophic errors
> > > +occur; 
> > 
> > 
> > > this capability aids both strategies.
> > > +Over the past few years, XFS has added a storage space reverse
> > > mapping index to
> > > +make it easy to find which files or metadata objects think they
> > > own
> > > a
> > > +particular range of storage.
> > > +Efforts are under way to develop a similar reverse mapping index
> > > for
> > > the naming
> > > +hierarchy, which will involve storing directory parent pointers
> > > in
> > > each file.
> > > +With these two pieces in place, XFS uses secondary information
> > > to
> > > perform more
> > > +sophisticated repairs.
> > This part here I think I would either let go or relocate.  The
> > topic of
> > this section is supposed to discuss roughly what a filesystem check
> > is.
> > Ideally so we can start talking about how ofsck is different.  It
> > feels
> > like a bit of a jump to suddenly hop into rmap and pptrs, and for
> > "sophisticated repairs" that we havn't really gotten into the
> > details
> > of yet.  So I think it would read easier if we saved this part
> > until we
> > start talking about how they are used later.  
> 
> Agreed.
> 
> > > +
> > > +TLDR; Show Me the Code!
> > > +-----------------------
> > > +
> > > +Code is posted to the kernel.org git trees as follows:
> > > +`kernel changes
> > > <
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.g
> > > it
> > > /log/?h=repair-symlink>`_,
> > > +`userspace changes
> > > <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-
> > > dev.
> > > git/log/?h=scrub-media-scan-service>`_, and
> > > +`QA test changes
> > > <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-
> > > dev.
> > > git/log/?h=repair-dirs>`_.
> > > +Each kernel patchset adding an online repair function will use
> > > the
> > > same branch
> > > +name across the kernel, xfsprogs, and fstests git repos.
> > > +
> > > +Existing Tools
> > > +--------------
> > > +
> > > +The online fsck tool described here will be the third tool in
> > > the
> > > history of
> > > +XFS (on Linux) to check and repair filesystems.
> > > +Two programs precede it:
> > > +
> > > +The first program, ``xfs_check``, was created as part of the XFS
> > > debugger
> > > +(``xfs_db``) and can only be used with unmounted filesystems.
> > > +It walks all metadata in the filesystem looking for
> > > inconsistencies
> > > in the
> > > +metadata, though it lacks any ability to repair what it finds.
> > > +Due to its high memory requirements and inability to repair
> > > things,
> > > this
> > > +program is now deprecated and will not be discussed further.
> > > +
> > > +The second program, ``xfs_repair``, was created to be faster and
> > > more robust
> > > +than the first program.
> > > +Like its predecessor, it can only be used with unmounted
> > > filesystems.
> > > +It uses extent-based in-memory data structures to reduce memory
> > > consumption,
> > > +and tries to schedule readahead IO appropriately to reduce I/O
> > > waiting time
> > > +while it scans the metadata of the entire filesystem.
> > > +The most important feature of this tool is its ability to
> > > respond to
> > > +inconsistencies in file metadata and directory tree by erasing
> > > things as needed
> > > +to eliminate problems.
> > > +Space usage metadata are rebuilt from the observed file
> > > metadata.
> > > +
> > > +Problem Statement
> > > +-----------------
> > > +
> > > +The current XFS tools leave several problems unsolved:
> > > +
> > > +1. **User programs** suddenly **lose access** to information in
> > > the
> > > computer
> > > +   when unexpected shutdowns occur as a result of silent
> > > corruptions
> > > in the
> > > +   filesystem metadata.
> > > +   These occur **unpredictably** and often without warning.
> > 
> > 
> > 1. **User programs** suddenly **lose access** to the filesystem
> >    when unexpected shutdowns occur as a result of silent
> > corruptions
> > that could have otherwise been avoided with an online repair
> > 
> > While some of these issues are not untrue, I think it makes sense
> > to
> > limit them to the issue you plan to solve, and therefore discuss.
> 
> Fair enough, it's not like one loses /all/ the data in the computer.
> 
> That said, we're still in the problem definition phase, so I don't
> want
> to mention online repair just yet.
> 
> > > +2. **Users** experience a **total loss of service** during the
> > > recovery period
> > > +   after an **unexpected shutdown** occurs.
> > > +
> > > +3. **Users** experience a **total loss of service** if the
> > > filesystem is taken
> > > +   offline to **look for problems** proactively.
> > > +
> > > +4. **Data owners** cannot **check the integrity** of their
> > > stored
> > > data without
> > > +   reading all of it.
> > 
> > > +   This may expose them to substantial billing costs when a
> > > linear
> > > media scan
> > > +   might suffice.
> > Ok, I had to re-read this one a few times, but I think this reads a
> > little cleaner:
> > 
> >     Customers that are billed for data egress may incur unnecessary
> > cost when a background media scan on the host may have sufficed
> > 
> > ?
> 
> "...when a linear media scan performed by the storage system
> administrator would suffice."
> 
That sounds fine to me

> I was tempted to say "storage owner" instead of "storage system
> administrator" but that sounded a little too IBM.
> 
> > > +5. **System administrators** cannot **schedule** a maintenance
> > > window to deal
> > > +   with corruptions if they **lack the means** to assess
> > > filesystem
> > > health
> > > +   while the filesystem is online.
> > > +
> > > +6. **Fleet monitoring tools** cannot **automate periodic
> > > checks** of
> > > filesystem
> > > +   health when doing so requires **manual intervention** and
> > > downtime.
> > > +
> > > +7. **Users** can be tricked into **doing things they do not
> > > desire**
> > > when
> > > +   malicious actors **exploit quirks of Unicode** to place
> > > misleading names
> > > +   in directories.
> > hrmm, I guess I'm not immediately extrapolating what things users
> > are
> > being tricked into doing, or how ofsck solves this?  Otherwise I
> > might
> > drop the last one here, I think the rest of the bullets are plenty
> > of
> > motivation.
> 
> The doc gets into this later[1], but it's possible to create two
> entries
> within the same directory that have different byte sequences in the
> name
> but render identically in file choosers.  These pathnames:
> 
> /home/djwong/Downloads/rustup.sh
> /home/djwong/Downloads/rus<zero width space>tup.sh
> 
> refer to different files, but a naïve file open dialog will render
> them
> identically as "rustup.sh".  If the first is the Rust installer and
> the
> second name is actually a ransomware payload, I can victimize you by
> tricking you into opening the wrong one.
> 
> Firefox had a whole CVE over this in 2018:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1438025
> 
> xfs_scrub is (so far) the only linux filesystem fsck tool that will
> warn
> system administrators about this kind of thing.
> 
> See generic/453 and generic/454.
> 
> [1] https://djwong.org/docs/xfs-online-fsck-design/#id108
> 
hmm ok, how about:

7. Malicious attacks may use uncommon unicode characters to create file
names that resemble normal files, which may go undetected until the
filesystem is scanned.


?

> > > +
> > > +Given this definition of the problems to be solved and the
> > > actors
> > > who would
> > > +benefit, the proposed solution is a third fsck tool that acts on
> > > a
> > > running
> > > +filesystem.
> > > +
> > > +This new third program has three components: an in-kernel
> > > facility
> > > to check
> > > +metadata, an in-kernel facility to repair metadata, and a
> > > userspace
> > > driver
> > > +program to drive fsck activity on a live filesystem.
> > > +``xfs_scrub`` is the name of the driver program.
> > > +The rest of this document presents the goals and use cases of
> > > the
> > > new fsck
> > > +tool, describes its major design points in connection to those
> > > goals, and
> > > +discusses the similarities and differences with existing tools.
> > > +
> > > ++---------------------------------------------------------------
> > > ----
> > > -------+
> > > +|
> > > **Note**:                                                        
> > >     
> > >     |
> > > ++---------------------------------------------------------------
> > > ----
> > > -------+
> > > +| Throughout this document, the existing offline fsck tool can
> > > also
> > > be     |
> > > +| referred to by its current name
> > > "``xfs_repair``".                        |
> > > +| The userspace driver program for the new online fsck tool can
> > > be         |
> > > +| referred to as
> > > "``xfs_scrub``".                                          |
> > > +| The kernel portion of online fsck that validates metadata is
> > > called      |
> > > +| "online scrub", and portion of the kernel that fixes metadata
> > > is
> > > called  |
> > > +| "online
> > > repair".                                                        
> > > |
> > > ++---------------------------------------------------------------
> > > ----
> > > -------+
> 
> Errr ^^^^ is Evolution doing line wrapping here?
> 
> > Hmm, maybe here might be a good spot to move rmap and pptrs?  It's
> > not
> > otherwise clear to me what "secondary metadata" is.  If that is
> > what it
> > is meant to refer to, I think the reader will more intuitively make
> > the
> > connection if those two blurbs appear in the same context.
> 
> Ooh, you found a significant gap-- nowhere in this chapter do I
> actually
> define what is primary metadata.  Or secondary metadata.
> 
> > > +
> > > +Secondary metadata indices enable the reconstruction of parts of
> > > a
> > > damaged
> > > +primary metadata object from secondary information.
> > 
> > I would take out this blurb...
> > > +XFS filesystems shard themselves into multiple primary objects
> > > to
> > > enable better
> > > +performance on highly threaded systems and to contain the blast
> > > radius when
> > > +problems happen.
> > 
> > 
> > > +The naming hierarchy is broken up into objects known as
> > > directories
> > > and files;
> > > +and the physical space is split into pieces known as allocation
> > > groups.
> > And add here:
> > 
> > "This enables better performance on highly threaded systems and
> > helps
> > to contain corruptions when they occur."
> > 
> > I think that reads cleaner
> 
> Ok.  Mind if I reword this slightly?  The entire paragraph now reads
> like this:
> 
> "The naming hierarchy is broken up into objects known as directories
> and
> files and the physical space is split into pieces known as allocation
> groups.  Sharding enables better performance on highly parallel
> systems
> and helps to contain the damage when corruptions occur.  The division
> of
> the filesystem into principal objects (allocation groups and inodes)
> means that there are ample opportunities to perform targeted checks
> and
> repairs on a subset of the filesystem."
I think that sounds cleaner

> 
> > > +The division of the filesystem into principal objects
> > > (allocation
> > > groups and
> > > +inodes) means that there are ample opportunities to perform
> > > targeted
> > > checks and
> > > +repairs on a subset of the filesystem.
> > > +While this is going on, other parts continue processing IO
> > > requests.
> > > +Even if a piece of filesystem metadata can only be regenerated
> > > by
> > > scanning the
> > > +entire system, the scan can still be done in the background
> > > while
> > > other file
> > > +operations continue.
> > > +
> > > +In summary, online fsck takes advantage of resource sharding and
> > > redundant
> > > +metadata to enable targeted checking and repair operations while
> > > the
> > > system
> > > +is running.
> > > +This capability will be coupled to automatic system management
> > > so
> > > that
> > > +autonomous self-healing of XFS maximizes service availability.
> > > 
> > 
> > Nits and paraphrases aside, I think this looks pretty good?
> 
> Woot.  Thanks for digging in! :)
> 
Sure, no problem!

> > Allison
> > 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 02/14] xfs: document the general theory underlying online fsck design
  2023-01-11 23:39       ` Darrick J. Wong
  2023-01-12  0:29         ` Dave Chinner
@ 2023-01-18  0:03         ` Allison Henderson
  2023-01-18  2:35           ` Darrick J. Wong
  1 sibling, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-01-18  0:03 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Wed, 2023-01-11 at 15:39 -0800, Darrick J. Wong wrote:
> On Wed, Jan 11, 2023 at 01:25:12AM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Start the second chapter of the online fsck design documentation.
> > > This covers the general theory underlying how online fsck works.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  366
> > > ++++++++++++++++++++
> > >  1 file changed, 366 insertions(+)
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index 25717ebb5f80..a03a7b9f0250 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -197,3 +197,369 @@ metadata to enable targeted checking and
> > > repair
> > > operations while the system
> > >  is running.
> > >  This capability will be coupled to automatic system management
> > > so
> > > that
> > >  autonomous self-healing of XFS maximizes service availability.
> > > +
> > > +2. Theory of Operation
> > > +======================
> > > +
> > > +Because it is necessary for online fsck to lock and scan live
> > > metadata objects,
> > > +online fsck consists of three separate code components.
> > > +The first is the userspace driver program ``xfs_scrub``, which
> > > is
> > > responsible
> > > +for identifying individual metadata items, scheduling work items
> > > for
> > > them,
> > > +reacting to the outcomes appropriately, and reporting results to
> > > the
> > > system
> > > +administrator.
> > > +The second and third are in the kernel, which implements
> > > functions
> > > to check
> > > +and repair each type of online fsck work item.
> > > +
> > > ++---------------------------------------------------------------
> > > ---+
> > > +|
> > > **Note**:                                                       
> > > |
> > > ++---------------------------------------------------------------
> > > ---+
> > > +| For brevity, this document shortens the phrase "online fsck
> > > work |
> > > +| item" to "scrub
> > > item".                                           |
> > > ++---------------------------------------------------------------
> > > ---+
> > > +
> > > +Scrub item types are delineated in a manner consistent with the
> > > Unix
> > > design
> > > +philosophy, which is to say that each item should handle one
> > > aspect
> > > of a
> > > +metadata structure, and handle it well.
> > > +
> > > +Scope
> > > +-----
> > > +
> > > +In principle, online fsck should be able to check and to repair
> > > everything that
> > > +the offline fsck program can handle.
> > > +However, the adjective *online* brings with it the limitation
> > > that
> > > online fsck
> > > +cannot deal with anything that prevents the filesystem from
> > > going on
> > > line, i.e.
> > > +mounting.
> > Are there really any other operations that do that other than
> > mount?
> 
> No.
> 
> > I think this reads cleaner:
> > 
> > By definition, online fsck can only check and repair an online
> > filesystem.  It cannot check mounting operations which start from
> > an
> > offline state.
> 
> Now that I think about this some more, this whole sentence doesn't
> make
> sense.  xfs_scrub can *definitely* detect and fix latent errors that
> would prevent the /next/ mount from succeeding.  It's only the fuzz
> test
> suite that stumbles over this, and only because xfs_db cannot fuzz
> mounted filesystems.
> 
> "However, online fsck cannot be running 100% of the time, which means
> that latent errors may creep in after a scrub completes.
> If these errors cause the next mount to fail, offline fsck is the
> only
> solution."
Sure, that sounds fair

> 
> > > +This limitation means that maintenance of the offline fsck tool
> > > will
> > > continue.
> > > +A second limitation of online fsck is that it must follow the
> > > same
> > > resource
> > > +sharing and lock acquisition rules as the regular filesystem.
> > > +This means that scrub cannot take *any* shortcuts to save time,
> > > because doing
> > > +so could lead to concurrency problems.
> > > +In other words, online fsck will never be able to fix 100% of
> > > the
> > > +inconsistencies that offline fsck can repair, 
> > Hmm, what inconsistencies cannot repaired as a result of the "no
> > shortcut" rule?  I'm all for keeping things short and to the point,
> > but
> > since this section is about scope, I'd give it at least a brief
> > bullet
> > list
> 
> Hmm.  I can't think of any off the top of my head.  Given the
> rewording
> earlier, I think it's more accurate to say:
> 
> "In other words, online fsck is not a complete replacement for
> offline
> fsck, and a complete run of online fsck may take longer than online
> fsck."
That makes sense
> 
> > > and a complete run of online fsck
> > > +may take longer.
> > > +However, both of these limitations are acceptable tradeoffs to
> > > satisfy the
> > > +different motivations of online fsck, which are to **minimize
> > > system
> > > downtime**
> > > +and to **increase predictability of operation**.
> > > +
> > > +.. _scrubphases:
> > > +
> > > +Phases of Work
> > > +--------------
> > > +
> > > +The userspace driver program ``xfs_scrub`` splits the work of
> > > checking and
> > > +repairing an entire filesystem into seven phases.
> > > +Each phase concentrates on checking specific types of scrub
> > > items
> > > and depends
> > > +on the success of all previous phases.
> > > +The seven phases are as follows:
> > > +
> > > +1. Collect geometry information about the mounted filesystem and
> > > computer,
> > > +   discover the online fsck capabilities of the kernel, and open
> > > the
> > > +   underlying storage devices.
> > > +
> > > +2. Check allocation group metadata, all realtime volume
> > > metadata,
> > > and all quota
> > > +   files.
> > > +   Each metadata structure is scheduled as a separate scrub
> > > item.
> > Like an intent item?
> 
> No, these scrub items are struct scrub_item objects that exist solely
> within the userspace program code.
> 
> > > +   If corruption is found in the inode header or inode btree and
> > > ``xfs_scrub``
> > > +   is permitted to perform repairs, then those scrub items are
> > > repaired to
> > > +   prepare for phase 3.
> > > +   Repairs are implemented by resubmitting the scrub item to the
> > > kernel with
> > If I'm understanding this correctly:
> > Repairs are implemented as intent items that are queued and
> > committed
> > just as any filesystem operation.
> > 
> > ?
> 
> I don't want to go too deep into this prematurely, but...
> 
> xfs_scrub (the userspace program) needs to track which metadata
> objects
> have been checked and which ones need repairs.  The current codebase
> (ab)uses struct xfs_scrub_metadata, but it's very memory inefficient.
> I replaced it with a new struct scrub_item that stores (a) all the
> handle information to identify the inode/AG/rt group/whatever; and
> (b)
> the state of all the checks that can be applied to that item:
> 
> struct scrub_item {
>         /*
>          * Information we need to call the scrub and repair ioctls.
>          * Per-AG items should set the ino/gen fields to -1; per-
> inode
>          * items should set sri_agno to -1; and per-fs items should
> set
>          * all three fields to -1.  Or use the macros below.
>          */
>         __u64                   sri_ino;
>         __u32                   sri_gen;
>         __u32                   sri_agno;
> 
>         /* Bitmask of scrub types that were scheduled here. */
>         __u32                   sri_selected;
> 
>         /* Scrub item state flags, one for each XFS_SCRUB_TYPE. */
>         __u8                    sri_state[XFS_SCRUB_TYPE_NR];
> 
>         /* Track scrub and repair call retries for each scrub type.
> */
>         __u8                    sri_tries[XFS_SCRUB_TYPE_NR];
> 
>         /* Were there any corruption repairs needed? */
>         bool                    sri_inconsistent:1;
> 
>         /* Are we revalidating after repairs? */
>         bool                    sri_revalidate:1;
> };
> 
> The first three fields are passed to the kernel via scrub ioctl and
> describe a particular xfs domain (files, AGs, etc).  The rest of the
> structure store state for each type of repair that can be performed
> against that domain.
> 
> IOWs, xfs_scrub uses struct scrub_item objects to generate ioctl
> calls
> to the kernel to check and repair things.  The kernel reads the ioctl
> information, figures out what needs to be done, and then does the
> usual
> get transaction -> lock things -> make updates -> commit dance to
> make
> corrections to the fs.  Those corrections include log intent items,
> but
> there's no tight coupling between log intent items and scrub_items.
> 
> Side note: The kernel repair code used to use intents to rebuild a
> structure, but nowadays it use the btree bulk loader code to replace
> btrees wholesale and in a single atomic commit.  Now we use them
> primariliy to free preallocated space if the repair fails.

Oh ok, well how about just:

"Repairs are implemented by resubmitting the scrub item to the
kernel through a designated ioctl with..."

?

> 
> > > +   the repair flag enabled; this is discussed in the next
> > > section.
> > > +   Optimizations and all other repairs are deferred to phase 4.
> > I guess I'll come back to it. 
> > 
> > > +
> > > +3. Check all metadata of every file in the filesystem.
> > > +   Each metadata structure is also scheduled as a separate scrub
> > > item.
> > > +   If repairs are needed, ``xfs_scrub`` is permitted to perform
> > > repairs,
> > If repairs are needed and ``xfs_scrub`` is permitted
> 
> Fixed.
> 
> > ?
> > > +   and there were no problems detected during phase 2, then
> > > those
> > > scrub items
> > > +   are repaired.
> > > +   Optimizations and unsuccessful repairs are deferred to phase
> > > 4.
> > > +
> > > +4. All remaining repairs and scheduled optimizations are
> > > performed
> > > during this
> > > +   phase, if the caller permits them.
> > > +   Before starting repairs, the summary counters are checked and
> > > any
> > Did we talk about summary counters yet?  Maybe worth a blub.
> > Otherwise
> > this may not make sense with out skipping ahead or into the code
> 
> Nope.  I'll add that to the previous patch when I introduce primary
> and
> secondary metadata.  Good catch!
> 
> "Summary metadata, as the name implies, condense information
> contained
> in primary metadata for performance reasons."

Ok, sounds good then
> 
> > > necessary
> > > +   repairs are performed so that subsequent repairs will not
> > > fail
> > > the resource
> > > +   reservation step due to wildly incorrect summary counters.
> > > +   Unsuccesful repairs are requeued as long as forward progress
> > > on
> > > repairs is
> > > +   made somewhere in the filesystem.
> > > +   Free space in the filesystem is trimmed at the end of phase 4
> > > if
> > > the
> > > +   filesystem is clean.
> > > +
> > > +5. By the start of this phase, all primary and secondary
> > > filesystem
> > > metadata
> > > +   must be correct.
> > I think maybe the definitions of primary and secondary metadata
> > should
> > move up before the phases section.  Otherwise the reader has to
> > skip
> > ahead to know what that means.
> 
> Yep, now primary, secondary, and summary metadata are defined in
> section
> 1.  Very good comment.
> 
> > > +   Summary counters such as the free space counts and quota
> > > resource
> > > counts
> > > +   are checked and corrected.
> > > +   Directory entry names and extended attribute names are
> > > checked
> > > for
> > > +   suspicious entries such as control characters or confusing
> > > Unicode sequences
> > > +   appearing in names.
> > > +
> > > +6. If the caller asks for a media scan, read all allocated and
> > > written data
> > > +   file extents in the filesystem.
> > > +   The ability to use hardware-assisted data file integrity
> > > checking
> > > is new
> > > +   to online fsck; neither of the previous tools have this
> > > capability.
> > > +   If media errors occur, they will be mapped to the owning
> > > files
> > > and reported.
> > > +
> > > +7. Re-check the summary counters and presents the caller with a
> > > summary of
> > > +   space usage and file counts.
> > > +
> > > +Steps for Each Scrub Item
> > > +-------------------------
> > > +
> > > +The kernel scrub code uses a three-step strategy for checking
> > > and
> > > repairing
> > > +the one aspect of a metadata object represented by a scrub item:
> > > +
> > > +1. The scrub item of intere
> > > st is checked for corruptions; opportunities for
> > > +   optimization; and for values that are directly controlled by
> > > the
> > > system
> > > +   administrator but look suspicious.
> > > +   If the item is not corrupt or does not need optimization,
> > > resource are
> > > +   released and the positive scan results are returned to
> > > userspace.
> > > +   If the item is corrupt or could be optimized but the caller
> > > does
> > > not permit
> > > +   this, resources are released and the negative scan results
> > > are
> > > returned to
> > > +   userspace.
> > > +   Otherwise, the kernel moves on to the second step.
> > > +
> > > +2. The repair function is called to rebuild the data structure.
> > > +   Repair functions generally choose rebuild a structure from
> > > other
> > > metadata
> > > +   rather than try to salvage the existing structure.
> > > +   If the repair fails, the scan results from the first step are
> > > returned to
> > > +   userspace.
> > > +   Otherwise, the kernel moves on to the third step.
> > > +
> > > +3. In the third step, the kernel runs the same checks over the
> > > new
> > > metadata
> > > +   item to assess the efficacy of the repairs.
> > > +   The results of the reassessment are returned to userspace.
> > > +
> > > +Classification of Metadata
> > > +--------------------------
> > > +
> > > +Each type of metadata object (and therefore each type of scrub
> > > item)
> > > is
> > > +classified as follows:
> > > +
> > > +Primary Metadata
> > > +````````````````
> > > +
> > > +Metadata structures in this category should be most familiar to
> > > filesystem
> > > +users either because they are directly created by the user or
> > > they
> > > index
> > > +objects created by the user
> > I think I would just jump straight into a brief list.  The above is
> > a
> > bit vague, and documentation that tells you you should already know
> > what it is, doesnt add much.  Again, I think too much poetry might
> > be
> > why you're having a hard time getting responses.
> 
> Done:
> 
> - Free space and reference count information
> 
> - Inode records and indexes
> 
> - Storage mapping information for file data
> 
> - Directories
> 
> - Extended attributes
> 
> - Symbolic links
> 
> - Quota limits
> 
> - Link counts
> 
> 
> > > +Most filesystem objects fall into this class.
> > Most filesystem objects created by users fall into this class, such
> > as
> > inode, directories, allocation groups and so on.
> > > +Resource and lock acquisition for scrub code follows the same
> > > order
> > > as regular
> > > +filesystem accesses.
> > 
> > Lock acquisition for these resources will follow the same order for
> > scrub as a regular filesystem access.
> 
> Yes, that is clearer.  I think I'll phrase this more actively:
> 
> "Scrub obeys the same rules as regular filesystem accesses for
> resource
> and lock acquisition."

Ok, I think that sounds fine
> 
> > > +
> > > +Primary metadata objects are the simplest for scrub to process.
> > > +The principal filesystem object (either an allocation group or
> > > an
> > > inode) that
> > > +owns the item being scrubbed is locked to guard against
> > > concurrent
> > > updates.
> > > +The check function examines every record associated with the
> > > type
> > > for obvious
> > > +errors and cross-references healthy records against other
> > > metadata
> > > to look for
> > > +inconsistencies.
> > > +Repairs for this class of scrub item are simple, since the
> > > repair
> > > function
> > > +starts by holding all the resources acquired in the previous
> > > step.
> > > +The repair function scans available metadata as needed to record
> > > all
> > > the
> > > +observations needed to complete the structure.
> > > +Next, it stages the observations in a new ondisk structure and
> > > commits it
> > > +atomically to complete the repair.
> > > +Finally, the storage from the old data structure are carefully
> > > reaped.
> > > +
> > > +Because ``xfs_scrub`` locks a primary object for the duration of
> > > the
> > > repair,
> > > +this is effectively an offline repair operation performed on a
> > > subset of the
> > > +filesystem.
> > > +This minimizes the complexity of the repair code because it is
> > > not
> > > necessary to
> > > +handle concurrent updates from other threads, nor is it
> > > necessary to
> > > access
> > > +any other part of the filesystem.
> > > +As a result, indexed structures can be rebuilt very quickly, and
> > > programs
> > > +trying to access the damaged structure will be blocked until
> > > repairs
> > > complete.
> > > +The only infrastructure needed by the repair code are the
> > > staging
> > > area for
> > > +observations and a means to write new structures to disk.
> > > +Despite these limitations, the advantage that online repair
> > > holds is
> > > clear:
> > > +targeted work on individual shards of the filesystem avoids
> > > total
> > > loss of
> > > +service.
> > > +
> > > +This mechanism is described in section 2.1 ("Off-Line
> > > Algorithm") of
> > > +V. Srinivasan and M. J. Carey, `"Performance of On-Line Index
> > > Construction
> > > +Algorithms" <https://dl.acm.org/doi/10.5555/645336.649870>`_,
> > Hmm, this article is not displaying for me.  If the link is
> > abandoned,
> > probably there's not much need to keep it around
> 
> The actual paper is not directly available through that ACM link, but
> the DOI is what I used to track down a paper copy(!) of that paper as
> published in a journal.
> 
> (In turn, that journal is "Advances in Database Technology - EDBT
> 1992";
> I found it in the NYU library.  Amazingly, they sold it to me.)
Oh I see.  Dave had replied in a separate thread with a pdf version. 
That might be a better link so that people do not have to buy a paper
copy.

> 
> > > +*Extending Database Technology*, pp. 293-309, 1992.
> > > +
> > > +Most primary metadata repair functions stage their intermediate
> > > results in an
> > > +in-memory array prior to formatting the new ondisk structure,
> > > which
> > > is very
> > > +similar to the list-based algorithm discussed in section 2.3
> > > ("List-
> > > Based
> > > +Algorithms") of Srinivasan.
> > > +However, any data structure builder that maintains a resource
> > > lock
> > > for the
> > > +duration of the repair is *always* an offline algorithm.
> > > +
> > > +Secondary Metadata
> > > +``````````````````
> > > +
> > > +Metadata structures in this category reflect records found in
> > > primary metadata,
> > 
> > such as rmap and parent pointer attributes.  But they are only
> > needed...
> > 
> > ?
> 
> Euugh, this section needs some restructuring to get rid of redundant
> sentences.  How about:
> 
> "Metadata structures in this category reflect records found in
> primary
> metadata, but are only needed for online fsck or for reorganization
> of
> the filesystem.
> 
> "Secondary metadata include:
> 
> - Reverse mapping information
> 
> - Directory parent pointers
> 
> "This class of metadata is difficult for scrub to process because
> scrub
> attaches to the secondary object but needs to check primary metadata,
> which runs counter to the usual order of resource acquisition.
> Frequently, this means that full filesystems scans are necessary to
> rebuild the metadata.
> Check functions..."

Yes I think that's much clearer :-)

> 
> > > +but are only needed for online fsck or for reorganization of the
> > > filesystem.
> > > +Resource and lock acquisition for scrub code do not follow the
> > > same
> > > order as
> > > +regular filesystem accesses, and may involve full filesystem
> > > scans.
> > > +
> > > +Secondary metadata objects are difficult for scrub to process,
> > > because scrub
> > > +attaches to the secondary object but needs to check primary
> > > metadata, which
> > > +runs counter to the usual order of resource acquisition.
> > bummer :-(
> 
> Yup.
> 
> > > +Check functions can be limited in scope to reduce runtime.
> > > +Repairs, however, require a full scan of primary metadata, which
> > > can
> > > take a
> > > +long time to complete.
> > > +Under these conditions, ``xfs_scrub`` cannot lock resources for
> > > the
> > > entire
> > > +duration of the repair.
> > > +
> > > +Instead, repair functions set up an in-memory staging structure
> > > to
> > > store
> > > +observations.
> > > +Depending on the requirements of the specific repair function,
> > > the
> > > staging
> > 
> > 
> > > +index can have the same format as the ondisk structure, or it
> > > can
> > > have a design
> > > +specific to that repair function.
> > ...will have either the same format as the ondisk structure or a
> > structure specific to the repair function.
> 
> Fixed.
> 
> > > +The next step is to release all locks and start the filesystem
> > > scan.
> > > +When the repair scanner needs to record an observation, the
> > > staging
> > > data are
> > > +locked long enough to apply the update.
> > > +Simultaneously, the repair function hooks relevant parts of the
> > > filesystem to
> > > +apply updates to the staging data if the the update pertains to
> > > an
> > > object that
> > > +has already been scanned by the index builder.
> > While a scan is in progress, function hooks are used to apply
> > filesystem updates to both the object and the staging data if the
> > object has already been scanned.
> > 
> > ?
> 
> The hooks are used to apply updates to the repair staging data, but
> they
> don't apply regular filesystem updates.
> 
> The usual process runs something like this:
> 
>   Lock -> update -> update -> commit
> 
> With a scan in progress, say we hook the second update.  The
> instruction
> flow becomes:
> 
>   Lock -> update -> update -> hook -> update staging data -> commit
> 
> Maybe something along the following would be better?
> 
> "While the filesystem scan is in progress, the repair function hooks
> the
> filesystem so that it can apply pending filesystem updates to the
> staging information."
Ok, that sounds clearer then

> 
> > > +Once the scan is done, the owning object is re-locked, the live
> > > data
> > > is used to
> > > +write a new ondisk structure, and the repairs are committed
> > > atomically.
> > > +The hooks are disabled and the staging staging area is freed.
> > > +Finally, the storage from the old data structure are carefully
> > > reaped.
> > > +
> > > +Introducing concurrency helps online repair avoid various
> > > locking
> > > problems, but
> > > +comes at a high cost to code complexity.
> > > +Live filesystem code has to be hooked so that the repair
> > > function
> > > can observe
> > > +updates in progress.
> > > +The staging area has to become a fully functional parallel
> > > structure
> > > so that
> > > +updates can be merged from the hooks.
> > > +Finally, the hook, the filesystem scan, and the inode locking
> > > model
> > > must be
> > > +sufficiently well integrated that a hook event can decide if a
> > > given
> > > update
> > > +should be applied to the staging structure.
> > > +
> > > +In theory, the scrub implementation could apply these same
> > > techniques for
> > > +primary metadata, but doing so would make it massively more
> > > complex
> > > and less
> > > +performant.
> > > +Programs attempting to access the damaged structures are not
> > > blocked
> > > from
> > > +operation, which may cause application failure or an unplanned
> > > filesystem
> > > +shutdown.
> > > +
> > > +Inspiration for the secondary metadata repair strategy was drawn
> > > from section
> > > +2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build
> > > Without
> > > Side-File")
> > > +and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan,
> > > `"Algorithms
> > > for
> > > +Creating Indexes for Very Large Tables Without Quiescing
> > > Updates"
> > > +<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
> > This one works
> > 
> > > +
> > > +The sidecar index mentioned above bears some resemblance to the
> > > side
> > > file
> > > +method mentioned in Srinivasan and Mohan.
> > > +Their method consists of an index builder that extracts relevant
> > > record data to
> > > +build the new structure as quickly as possible; and an auxiliary
> > > structure that
> > > +captures all updates that would be committed to the index by
> > > other
> > > threads were
> > > +the new index already online.
> > > +After the index building scan finishes, the updates recorded in
> > > the
> > > side file
> > > +are applied to the new index.
> > > +To avoid conflicts between the index builder and other writer
> > > threads, the
> > > +builder maintains a publicly visible cursor that tracks the
> > > progress
> > > of the
> > > +scan through the record space.
> > > +To avoid duplication of work between the side file and the index
> > > builder, side
> > > +file updates are elided when the record ID for the update is
> > > greater
> > > than the
> > > +cursor position within the record ID space.
> > > +
> > > +To minimize changes to the rest of the codebase, XFS online
> > > repair
> > > keeps the
> > > +replacement index hidden until it's completely ready to go.
> > > +In other words, there is no attempt to expose the keyspace of
> > > the
> > > new index
> > > +while repair is running.
> > > +The complexity of such an approach would be very high and
> > > perhaps
> > > more
> > > +appropriate to building *new* indices.
> > > +
> > > +**Question**: Can the full scan and live update code used to
> > > facilitate a
> > > +repair also be used to implement a comprehensive check?
> > > +
> > > +*Answer*: Probably, though this has not been yet been studied.
> > I kinda feel like discussion Q&As need to be wrapped up before we
> > can
> > call things done.  If this is all there was to the answer, then
> > lets
> > clean out the discussion notes.
> 
> Oh, the situation here is worse than that -- in theory, check would
> be
> much stronger if each scrub function employed these live scans to
> build
> a shadow copy of the metadata and then compared the records of both.
> 
> However, that increases the amount of work each scrubber has to do
> much
> higher, and the runtime of those scrubbers would go up.  The other
> issue
> is that live scan hooks would have to proliferate through much more
> of
> the filesystem.  That's rather more invasive to the codebase than
> most
> of fsck, so I want people to look at the usage models for the handful
> of
> scrubbers that really require it before I spread it around elsewhere.
> Making that kind of change isn't that difficult, but I want to merge
> this stuff before moving on to experimenting with improvements of
> that
> scale.

I see, well maybe it would be appropriate it to just call it a possible
future improvement for now, depending on how the uses cases go and if
the demand for it arises.

> 
> > > +
> > > +Summary Information
> > > +```````````````````
> > > +
> > Oh, perhaps this section could move up with the other metadata
> > definitions.  That way the reader already has an idea of what these
> > terms are referring to before we get into how they are used during
> > the
> > phases.
> 
> Yeah, I think/hope this will be less of a problem now that section 1
> defines all three types of metadata.  The start of this section now
> reads:
> 
> "Metadata structures in this last category summarize the contents of
> primary metadata records.
> These are often used to speed up resource usage queries, and are many
> times smaller than the primary metadata which they represent.
> 
> Examples of summary information include:
> 
> - Summary counts of free space and inodes
> 
> - File link counts from directories
> 
> - Quota resource usage counts
> 
> "Check and repair require full filesystem scans, but resource and
> lock
> acquisition follow the same paths as regular filesystem accesses."
Sounds good, I think that will help a lot

> 
> > > +Metadata structures in this last category summarize the contents
> > > of
> > > primary
> > > +metadata records.
> > > +These are often used to speed up resource usage queries, and are
> > > many times
> > > +smaller than the primary metadata which they represent.
> > > +Check and repair both require full filesystem scans, but
> > > resource
> > > and lock
> > > +acquisition follow the same paths as regular filesystem
> > > accesses.
> > > +
> > > +The superblock summary counters have special requirements due to
> > > the
> > > underlying
> > > +implementation of the incore counters, and will be treated
> > > separately.
> > > +Check and repair of the other types of summary counters (quota
> > > resource counts
> > > +and file link counts) employ the same filesystem scanning and
> > > hooking
> > > +techniques as outlined above, but because the underlying data
> > > are
> > > sets of
> > > +integer counters, the staging data need not be a fully
> > > functional
> > > mirror of the
> > > +ondisk structure.
> > > +
> > > +Inspiration for quota and file link count repair strategies were
> > > drawn from
> > > +sections 2.12 ("Online Index Operations") through 2.14
> > > ("Incremental
> > > View
> > > +Maintenace") of G.  Graefe, `"Concurrent Queries and Updates in
> > > Summary Views
> > > +and Their Indexes"
> > > +<
> > > http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf
> > > >`
> > > _, 2011.
> > I wonder if these citations would do better as foot notes?  Just to
> > kinda keep the body of the document tidy and flowing well.
> 
> Yes, if this were a paginated document.
> 
> > > +
> > > +Since quotas are non-negative integer counts of resource usage,
> > > online
> > > +quotacheck can use the incremental view deltas described in
> > > section
> > > 2.14 to
> > > +track pending changes to the block and inode usage counts in
> > > each
> > > transaction,
> > > +and commit those changes to a dquot side file when the
> > > transaction
> > > commits.
> > > +Delta tracking is necessary for dquots because the index builder
> > > scans inodes,
> > > +whereas the data structure being rebuilt is an index of dquots.
> > > +Link count checking combines the view deltas and commit step
> > > into
> > > one because
> > > +it sets attributes of the objects being scanned instead of
> > > writing
> > > them to a
> > > +separate data structure.
> > > +Each online fsck function will be discussed as case studies
> > > later in
> > > this
> > > +document.
> > > +
> > > +Risk Management
> > > +---------------
> > > +
> > > +During the development of online fsck, several risk factors were
> > > identified
> > > +that may make the feature unsuitable for certain distributors
> > > and
> > > users.
> > > +Steps can be taken to mitigate or eliminate those risks, though
> > > at a
> > > cost to
> > > +functionality.
> > > +
> > > +- **Decreased performance**: Adding metadata indices to the
> > > filesystem
> > > +  increases the time cost of persisting changes to disk, and the
> > > reverse space
> > > +  mapping and directory parent pointers are no exception.
> > > +  System administrators who require the maximum performance can
> > > disable the
> > > +  reverse mapping features at format time, though this choice
> > > dramatically
> > > +  reduces the ability of online fsck to find inconsistencies and
> > > repair them.
> > > +
> > > +- **Incorrect repairs**: As with all software, there might be
> > > defects in the
> > > +  software that result in incorrect repairs being written to the
> > > filesystem.
> > > +  Systematic fuzz testing (detailed in the next section) is
> > > employed
> > > by the
> > > +  authors to find bugs early, but it might not catch everything.
> > > +  The kernel build system provides Kconfig options
> > > (``CONFIG_XFS_ONLINE_SCRUB``
> > > +  and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to
> > > choose
> > > not to
> > > +  accept this risk.
> > > +  The xfsprogs build system has a configure option (``--enable-
> > > scrub=no``) that
> > > +  disables building of the ``xfs_scrub`` binary, though this is
> > > not
> > > a risk
> > > +  mitigation if the kernel functionality remains enabled.
> > > +
> > > +- **Inability to repair**: Sometimes, a filesystem is too badly
> > > damaged to be
> > > +  repairable.
> > > +  If the keyspaces of several metadata indices overlap in some
> > > manner but a
> > > +  coherent narrative cannot be formed from records collected,
> > > then
> > > the repair
> > > +  fails.
> > > +  To reduce the chance that a repair will fail with a dirty
> > > transaction and
> > > +  render the filesystem unusable, the online repair functions
> > > have
> > > been
> > > +  designed to stage and validate all new records before
> > > committing
> > > the new
> > > +  structure.
> > > +
> > > +- **Misbehavior**: Online fsck requires many privileges -- raw
> > > IO to
> > > block
> > > +  devices, opening files by handle, ignoring Unix discretionary
> > > access control,
> > > +  and the ability to perform administrative changes.
> > > +  Running this automatically in the background scares people, so
> > > the
> > > systemd
> > > +  background service is configured to run with only the
> > > privileges
> > > required.
> > > +  Obviously, this cannot address certain problems like the
> > > kernel
> > > crashing or
> > > +  deadlocking, but it should be sufficient to prevent the scrub
> > > process from
> > > +  escaping and reconfiguring the system.
> > > +  The cron job does not have this protection.
> > > +
> > 
> > I think the fuzz part is one I would consider letting go.  All
> > features
> > need to go through a period of stabilizing, and we cant really
> > control
> > how some people respond to it, so I don't think this part adds
> > much.  I
> > think the document would do well to be trimmed where it can so as
> > to
> > stay more focused 
> 
> It took me a minute to realize that this comment applies to the text
> below it.  Right?
Yes, sorry for confusion :-)

> 
> > > +- **Fuzz Kiddiez**: There are many people now who seem to think
> > > that
> > > running
> > > +  automated fuzz testing of ondisk artifacts to find mischevious
> > > behavior and
> > > +  spraying exploit code onto the public mailing list for instant
> > > zero-day
> > > +  disclosure is somehow of some social benefit.
> 
> I want to keep this bit because it keeps happening[2].  Some folks
> (huawei/alibaba?) have started to try to fix the bugs that their
> robots
> find, and kudos to them!
> 
> You might have noticed that Googlers turned their firehose back on
> and
> once again aren't doing anything to fix the problems they find.  How
> very Googley of them.
> 
> [2] https://lwn.net/Articles/904293/

Alrighty then
> 
> > > +  In the view of this author, the benefit is realized only when
> > > the
> > > fuzz
> > > +  operators help to **fix** the flaws, but this opinion
> > > apparently
> > > is not
> > > +  widely shared among security "researchers".
> > > +  The XFS maintainers' continuing ability to manage these events
> > > presents an
> > > +  ongoing risk to the stability of the development process.
> > > +  Automated testing should front-load some of the risk while the
> > > feature is
> > > +  considered EXPERIMENTAL.
> > > +
> > > +Many of these risks are inherent to software programming.
> > > +Despite this, it is hoped that this new functionality will prove
> > > useful in
> > > +reducing unexpected downtime.
> > > 
> > 
> > Paraphrasing and reorganizing suggestions aside, I think it looks
> > pretty good
> 
> Ok, thank you!
> 
> --D
> 
> > Allison


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/14] xfs: document the testing plan for online fsck
  2022-12-30 22:10   ` [PATCH 03/14] xfs: document the testing plan " Darrick J. Wong
@ 2023-01-18  0:03     ` Allison Henderson
  2023-01-18  2:38       ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-01-18  0:03 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Start the third chapter of the online fsck design documentation. 
> This
> covers the testing plan to make sure that both online and offline
> fsck
> can detect arbitrary problems and correct them without making things
> worse.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  187
> ++++++++++++++++++++
>  1 file changed, 187 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index a03a7b9f0250..d630b6bdbe4a 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -563,3 +563,190 @@ functionality.
>  Many of these risks are inherent to software programming.
>  Despite this, it is hoped that this new functionality will prove
> useful in
>  reducing unexpected downtime.
> +
> +3. Testing Plan
> +===============
> +
> +As stated before, fsck tools have three main goals:
> +
> +1. Detect inconsistencies in the metadata;
> +
> +2. Eliminate those inconsistencies; and
> +
> +3. Minimize further loss of data.
> +
> +Demonstrations of correct operation are necessary to build users'
> confidence
> +that the software behaves within expectations.
> +Unfortunately, it was not really feasible to perform regular
> exhaustive testing
> +of every aspect of a fsck tool until the introduction of low-cost
> virtual
> +machines with high-IOPS storage.
> +With ample hardware availability in mind, the testing strategy for
> the online
> +fsck project involves differential analysis against the existing
> fsck tools and
> +systematic testing of every attribute of every type of metadata
> object.
> +Testing can be split into four major categories, as discussed below.
> +
> +Integrated Testing with fstests
> +-------------------------------
> +
> +The primary goal of any free software QA effort is to make testing
> as
> +inexpensive and widespread as possible to maximize the scaling
> advantages of
> +community.
> +In other words, testing should maximize the breadth of filesystem
> configuration
> +scenarios and hardware setups.
> +This improves code quality by enabling the authors of online fsck to
> find and
> +fix bugs early, and helps developers of new features to find
> integration
> +issues earlier in their development effort.
> +
> +The Linux filesystem community shares a common QA testing suite,
> +`fstests
> <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
> +functional and regression testing.
> +Even before development work began on online fsck, fstests (when run
> on XFS)
> +would run both the ``xfs_check`` and ``xfs_repair -n`` commands on
> the test and
> +scratch filesystems between each test.
> +This provides a level of assurance that the kernel and the fsck
> tools stay in
> +alignment about what constitutes consistent metadata.
> +During development of the online checking code, fstests was modified
> to run
> +``xfs_scrub -n`` between each test to ensure that the new checking
> code
> +produces the same results as the two existing fsck tools.
> +
> +To start development of online repair, fstests was modified to run
> +``xfs_repair`` to rebuild the filesystem's metadata indices between
> tests.
> +This ensures that offline repair does not crash, leave a corrupt
> filesystem
> +after it exists, or trigger complaints from the online check.
> +This also established a baseline for what can and cannot be repaired
> offline.
> +To complete the first phase of development of online repair, fstests
> was
> +modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
> +This enables a comparison of the effectiveness of online repair as
> compared to
> +the existing offline repair tools.
> +
> +General Fuzz Testing of Metadata Blocks
> +---------------------------------------
> +
> +XFS benefits greatly from having a very robust debugging tool,
> ``xfs_db``.
> +
> +Before development of online fsck even began, a set of fstests were
> created
> +to test the rather common fault that entire metadata blocks get
> corrupted.
> +This required the creation of fstests library code that can create a
> filesystem
> +containing every possible type of metadata object.
> +Next, individual test cases were created to create a test
> filesystem, identify
> +a single block of a specific type of metadata object, trash it with
> the
> +existing ``blocktrash`` command in ``xfs_db``, and test the reaction
> of a
> +particular metadata validation strategy.
> +
> +This earlier test suite enabled XFS developers to test the ability
> of the
> +in-kernel validation functions and the ability of the offline fsck
> tool to
> +detect and eliminate the inconsistent metadata.
> +This part of the test suite was extended to cover online fsck in
> exactly the
> +same manner.
> +
> +In other words, for a given fstests filesystem configuration:
> +
> +* For each metadata object existing on the filesystem:
> +
> +  * Write garbage to it
> +
> +  * Test the reactions of:
> +
> +    1. The kernel verifiers to stop obviously bad metadata
> +    2. Offline repair (``xfs_repair``) to detect and fix
> +    3. Online repair (``xfs_scrub``) to detect and fix
> +
> +Targeted Fuzz Testing of Metadata Records
> +-----------------------------------------
> +
> +A quick conversation with the other XFS developers revealed that the
> existing
> +test infrastructure could be extended to provide 

"The testing plan for ofsck includes extending the existing test 
infrastructure to provide..."

Took me a moment to notice we're not talking about history any more....

> a much more powerful
> +facility: targeted fuzz testing of every metadata field of every
> metadata
> +object in the filesystem.
> +``xfs_db`` can modify every field of every metadata structure in
> every
> +block in the filesystem to simulate the effects of memory corruption
> and
> +software bugs.
> +Given that fstests already contains the ability to create a
> filesystem
> +containing every metadata format known to the filesystem, ``xfs_db``
> can be
> +used to perform exhaustive fuzz testing!
> +
> +For a given fstests filesystem configuration:
> +
> +* For each metadata object existing on the filesystem...
> +
> +  * For each record inside that metadata object...
> +
> +    * For each field inside that record...
> +
> +      * For each conceivable type of transformation that can be
> applied to a bit field...
> +
> +        1. Clear all bits
> +        2. Set all bits
> +        3. Toggle the most significant bit
> +        4. Toggle the middle bit
> +        5. Toggle the least significant bit
> +        6. Add a small quantity
> +        7. Subtract a small quantity
> +        8. Randomize the contents
> +
> +        * ...test the reactions of:
> +
> +          1. The kernel verifiers to stop obviously bad metadata
> +          2. Offline checking (``xfs_repair -n``)
> +          3. Offline repair (``xfs_repair``)
> +          4. Online checking (``xfs_scrub -n``)
> +          5. Online repair (``xfs_scrub``)
> +          6. Both repair tools (``xfs_scrub`` and then
> ``xfs_repair`` if online repair doesn't succeed)
I like the indented bullet list format tho

> +
> +This is quite the combinatoric explosion!
> +
> +Fortunately, having this much test coverage makes it easy for XFS
> developers to
> +check the responses of XFS' fsck tools.
> +Since the introduction of the fuzz testing framework, these tests
> have been
> +used to discover incorrect repair code and missing functionality for
> entire
> +classes of metadata objects in ``xfs_repair``.
> +The enhanced testing was used to finalize the deprecation of
> ``xfs_check`` by
> +confirming that ``xfs_repair`` could detect at least as many
> corruptions as
> +the older tool.
> +
> +These tests have been very valuable for ``xfs_scrub`` in the same
> ways -- they
> +allow the online fsck developers to compare online fsck against
> offline fsck,
> +and they enable XFS developers to find deficiencies in the code
> base.
> +
> +Proposed patchsets include
> +`general fuzzer improvements
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g
> it/log/?h=fuzzer-improvements>`_,
> +`fuzzing baselines
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g
> it/log/?h=fuzz-baseline>`_,
> +and `improvements in fuzz testing comprehensiveness
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g
> it/log/?h=more-fuzz-testing>`_.
> +
> +Stress Testing
> +--------------
> +
> +A unique requirement to online fsck is the ability to operate on a
> filesystem
> +concurrently with regular workloads.
> +Although it is of course impossible to run ``xfs_scrub`` with *zero*
> observable
> +impact on the running system, the online repair code should never
> introduce
> +inconsistencies into the filesystem metadata, and regular workloads
> should
> +never notice resource starvation.
> +To verify that these conditions are being met, fstests has been
> enhanced in
> +the following ways:
> +
> +* For each scrub item type, create a test to exercise checking that
> item type
> +  while running ``fsstress``.
> +* For each scrub item type, create a test to exercise repairing that
> item type
> +  while running ``fsstress``.
> +* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the
> whole
> +  filesystem doesn't cause problems.
> +* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to
> ensure that
> +  force-repairing the whole filesystem doesn't cause problems.
> +* Race ``xfs_scrub`` in check and force-repair mode against
> ``fsstress`` while
> +  freezing and thawing the filesystem.
> +* Race ``xfs_scrub`` in check and force-repair mode against
> ``fsstress`` while
> +  remounting the filesystem read-only and read-write.
> +* The same, but running ``fsx`` instead of ``fsstress``.  (Not done
> yet?)
> +
> +Success is defined by the ability to run all of these tests without
> observing
> +any unexpected filesystem shutdowns due to corrupted metadata,
> kernel hang
> +check warnings, or any other sort of mischief.

Seems reasonable.  Other than the one nit, I think this section reads
pretty well.
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

Allison
> +
> +Proposed patchsets include `general stress testing
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g
> it/log/?h=race-scrub-and-mount-state-changes>`_
> +and the `evolution of existing per-function stress testing
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g
> it/log/?h=refactor-scrub-stress>`_.
> 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 04/14] xfs: document the user interface for online fsck
  2022-12-30 22:10   ` [PATCH 04/14] xfs: document the user interface for online fsck Darrick J. Wong
@ 2023-01-18  0:03     ` Allison Henderson
  2023-01-18  2:42       ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-01-18  0:03 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Start the fourth chapter of the online fsck design documentation,
> which
> discusses the user interface and the background scrubbing service.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  114
> ++++++++++++++++++++
>  1 file changed, 114 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index d630b6bdbe4a..42e82971e036 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -750,3 +750,117 @@ Proposed patchsets include `general stress
> testing
>  <
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g
> it/log/?h=race-scrub-and-mount-state-changes>`_
>  and the `evolution of existing per-function stress testing
>  <
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g
> it/log/?h=refactor-scrub-stress>`_.
> +
> +4. User Interface
> +=================
> +
> +The primary user of online fsck is the system administrator, just
> like offline
> +repair.
> +Online fsck presents two modes of operation to administrators:
> +A foreground CLI process for online fsck on demand, and a background
> service
> +that performs autonomous checking and repair.
> +
> +Checking on Demand
> +------------------
> +
> +For administrators who want the absolute freshest information about
> the
> +metadata in a filesystem, ``xfs_scrub`` can be run as a foreground
> process on
> +a command line.
> +The program checks every piece of metadata in the filesystem while
> the
> +administrator waits for the results to be reported, just like the
> existing
> +``xfs_repair`` tool.
> +Both tools share a ``-n`` option to perform a read-only scan, and a
> ``-v``
> +option to increase the verbosity of the information reported.
> +
> +A new feature of ``xfs_scrub`` is the ``-x`` option, which employs
> the error
> +correction capabilities of the hardware to check data file contents.
> +The media scan is not enabled by default because it may dramatically
> increase
> +program runtime and consume a lot of bandwidth on older storage
> hardware.
> +
> +The output of a foreground invocation is captured in the system log.
> +
> +The ``xfs_scrub_all`` program walks the list of mounted filesystems
> and
> +initiates ``xfs_scrub`` for each of them in parallel.
> +It serializes scans for any filesystems that resolve to the same top
> level
> +kernel block device to prevent resource overconsumption.
> +
> +Background Service
> +------------------
> +
I'm assuming the below systemd services are configurable right?
> +To reduce the workload of system administrators, the ``xfs_scrub``
> package
> +provides a suite of `systemd <https://systemd.io/>`_ timers and
> services that
> +run online fsck automatically on weekends.
by default.

> +The background service configures scrub to run with as little
> privilege as
> +possible, the lowest CPU and IO priority, and in a CPU-constrained
> single
> +threaded mode.
"This can be tuned at anytime to best suit the needs of the customer
workload."

Then I think you can drop the below line...
> +It is hoped that this minimizes the amount of load generated on the
> system and
> +avoids starving regular workloads.
> +
> +The output of the background service is also captured in the system
> log.
> +If desired, reports of failures (either due to inconsistencies or
> mere runtime
> +errors) can be emailed automatically by setting the ``EMAIL_ADDR``
> environment
> +variable in the following service files:
> +
> +* ``xfs_scrub_fail@.service``
> +* ``xfs_scrub_media_fail@.service``
> +* ``xfs_scrub_all_fail.service``
> +
> +The decision to enable the background scan is left to the system
> administrator.
> +This can be done by enabling either of the following services:
> +
> +* ``xfs_scrub_all.timer`` on systemd systems
> +* ``xfs_scrub_all.cron`` on non-systemd systems
> +
> +This automatic weekly scan is configured out of the box to perform
> an
> +additional media scan of all file data once per month.
> +This is less foolproof than, say, storing file data block checksums,
> but much
> +more performant if application software provides its own integrity
> checking,
> +redundancy can be provided elsewhere above the filesystem, or the
> storage
> +device's integrity guarantees are deemed sufficient.
> +
> +The systemd unit file definitions have been subjected to a security
> audit
> +(as of systemd 249) to ensure that the xfs_scrub processes have as
> little
> +access to the rest of the system as possible.
> +This was performed via ``systemd-analyze security``, after which
> privileges
> +were restricted to the minimum required, sandboxing was set up to
> the maximal
> +extent possible with sandboxing and system call filtering; and
> access to the
> +filesystem tree was restricted to the minimum needed to start the
> program and
> +access the filesystem being scanned.
> +The service definition files restrict CPU usage to 80% of one CPU
> core, and
> +apply as nice of a priority to IO and CPU scheduling as possible.
> +This measure was taken to minimize delays in the rest of the
> filesystem.
> +No such hardening has been performed for the cron job.
> +
> +Proposed patchset:
> +`Enabling the xfs_scrub background service
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> it/log/?h=scrub-media-scan-service>`_.
> +
> +Health Reporting
> +----------------
> +
> +XFS caches a summary of each filesystem's health status in memory.
> +The information is updated whenever ``xfs_scrub`` is run, or
> whenever
> +inconsistencies are detected in the filesystem metadata during
> regular
> +operations.
> +System administrators should use the ``health`` command of
> ``xfs_spaceman`` to
> +download this information into a human-readable format.
> +If problems have been observed, the administrator can schedule a
> reduced
> +service window to run the online repair tool to correct the problem.
> +Failing that, the administrator can decide to schedule a maintenance
> window to
> +run the traditional offline repair tool to correct the problem.
> +
> +**Question**: Should the health reporting integrate with the new
> inotify fs
> +error notification system?
> +
> +**Question**: Would it be helpful for sysadmins to have a daemon to
> listen for
> +corruption notifications and initiate a repair?
> +
> +*Answer*: These questions remain unanswered, but should be a part of
> the
> +conversation with early adopters and potential downstream users of
> XFS.
I think if there's been no commentary at this point then likely they
can't be answered at this time.  Perhaps for now it is reasonable to
just let the be a potential improvement in the future if the demand for
it arises. In any case, I think we should probably clean out the Q&A
discussion prompts.

Rest looks good tho
Allison

> +
> +Proposed patchsets include
> +`wiring up health reports to correction returns
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=corruption-health-reports>`_
> +and
> +`preservation of sickness info during memory reclaim
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=indirect-health-reporting>`_.
> 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 01/14] xfs: document the motivation for online fsck design
  2023-01-18  0:03         ` Allison Henderson
@ 2023-01-18  1:29           ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-01-18  1:29 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Wed, Jan 18, 2023 at 12:03:09AM +0000, Allison Henderson wrote:
> On Wed, 2023-01-11 at 11:10 -0800, Darrick J. Wong wrote:
> > On Sat, Jan 07, 2023 at 05:01:54AM +0000, Allison Henderson wrote:
> > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Start the first chapter of the online fsck design documentation.
> > > > This covers the motivations for creating this in the first place.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  Documentation/filesystems/index.rst                |    1 
> > > >  .../filesystems/xfs-online-fsck-design.rst         |  199
> > > > ++++++++++++++++++++
> > > >  2 files changed, 200 insertions(+)
> > > >  create mode 100644 Documentation/filesystems/xfs-online-fsck-
> > > > design.rst
> > > > 
> > > > 
> > > > diff --git a/Documentation/filesystems/index.rst
> > > > b/Documentation/filesystems/index.rst
> > > > index bee63d42e5ec..fbb2b5ada95b 100644
> > > > --- a/Documentation/filesystems/index.rst
> > > > +++ b/Documentation/filesystems/index.rst
> > > > @@ -123,4 +123,5 @@ Documentation for filesystem implementations.
> > > >     vfat
> > > >     xfs-delayed-logging-design
> > > >     xfs-self-describing-metadata
> > > > +   xfs-online-fsck-design
> > > >     zonefs
> > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > new file mode 100644
> > > > index 000000000000..25717ebb5f80
> > > > --- /dev/null
> > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > @@ -0,0 +1,199 @@
> > > > +.. SPDX-License-Identifier: GPL-2.0
> > > > +.. _xfs_online_fsck_design:
> > > > +
> > > > +..
> > > > +        Mapping of heading styles within this document:
> > > > +        Heading 1 uses "====" above and below
> > > > +        Heading 2 uses "===="
> > > > +        Heading 3 uses "----"
> > > > +        Heading 4 uses "````"
> > > > +        Heading 5 uses "^^^^"
> > > > +        Heading 6 uses "~~~~"
> > > > +        Heading 7 uses "...."
> > > > +
> > > > +        Sections are manually numbered because apparently that's
> > > > what everyone
> > > > +        does in the kernel.
> > > > +
> > > > +======================
> > > > +XFS Online Fsck Design
> > > > +======================
> > > > +
> > > > +This document captures the design of the online filesystem check
> > > > feature for
> > > > +XFS.
> > > > +The purpose of this document is threefold:
> > > > +
> > > > +- To help kernel distributors understand exactly what the XFS
> > > > online
> > > > fsck
> > > > +  feature is, and issues about which they should be aware.
> > > > +
> > > > +- To help people reading the code to familiarize themselves with
> > > > the
> > > > relevant
> > > > +  concepts and design points before they start digging into the
> > > > code.
> > > > +
> > > > +- To help developers maintaining the system by capturing the
> > > > reasons
> > > > +  supporting higher level decisionmaking.
> > > nit: decision making
> > 
> > Fixed.
> > 
> > > > +
> > > > +As the online fsck code is merged, the links in this document to
> > > > topic branches
> > > > +will be replaced with links to code.
> > > > +
> > > > +This document is licensed under the terms of the GNU Public
> > > > License,
> > > > v2.
> > > > +The primary author is Darrick J. Wong.
> > > > +
> > > > +This design document is split into seven parts.
> > > > +Part 1 defines what fsck tools are and the motivations for
> > > > writing a
> > > > new one.
> > > > +Parts 2 and 3 present a high level overview of how online fsck
> > > > process works
> > > > +and how it is tested to ensure correct functionality.
> > > > +Part 4 discusses the user interface and the intended usage modes
> > > > of
> > > > the new
> > > > +program.
> > > > +Parts 5 and 6 show off the high level components and how they
> > > > fit
> > > > together, and
> > > > +then present case studies of how each repair function actually
> > > > works.
> > > > +Part 7 sums up what has been discussed so far and speculates
> > > > about
> > > > what else
> > > > +might be built atop online fsck.
> > > > +
> > > > +.. contents:: Table of Contents
> > > > +   :local:
> > > > +
> > > 
> > > Something that I've noticed in my training sessions is that often
> > > times, less is more.  People really only absorb so much over a
> > > particular duration of time, so sometimes having too much detail in
> > > the
> > > context is not as helpful as you might think.  A lot of times,
> > > paraphrasing excerpts to reflect the same info in a more compact
> > > format
> > > will help you keep audience on track (a little longer at least). 
> > > 
> > > > +1. What is a Filesystem Check?
> > > > +==============================
> > > > +
> > > > +A Unix filesystem has three main jobs: to provide a hierarchy of
> > > > names through
> > > > +which application programs can associate arbitrary blobs of data
> > > > for
> > > > any
> > > > +length of time, to virtualize physical storage media across
> > > > those
> > > > names, and
> > > > +to retrieve the named data blobs at any time.
> > > Consider the following paraphrase:
> > > 
> > > A Unix filesystem has three main jobs:
> > >  * Provide a hierarchy of names by which applications access data
> > > for a
> > > length of time.
> > >  * Store or retrieve that data at any time.
> > >  * Virtualize physical storage media across those names
> > 
> > Ooh, listifying.  I did quite a bit of that to break up the walls of
> > text in earlier revisions, but apparently I missed this one.
> > 
> > > Also... I dont think it would be inappropriate to just skip the
> > > above,
> > > and jump right into fsck.  That's a very limited view of a
> > > filesystem,
> > > likely a reader seeking an fsck doc probably has some idea of what
> > > a fs
> > > is otherwise supposed to be doing.  
> > 
> > This will become part of the general kernel documentation, so we
> > can't
> > assume that all readers are going to know what a fs really does.
> > 
> > "A Unix filesystem has four main responsibilities:
> > 
> > - Provide a hierarchy of names through which application programs can
> >   associate arbitrary blobs of data for any length of time,
> > 
> > - Virtualize physical storage media across those names, and
> > 
> > - Retrieve the named data blobs at any time.
> > 
> > - Examine resource usage.
> > 
> > "Metadata directly supporting these functions (e.g. files,
> > directories,
> > space mappings) are sometimes called primary metadata.
> > Secondary metadata (e.g. reverse mapping and directory parent
> > pointers)
> > support operations internal to the filesystem, such as internal
> > consistency checking and reorganization."
> Sure, I think that sounds good and helps to set up the metadata
> concepts that are discussed later.
> > 
> > (I added those last two sentences in response to a point you made
> > below.)
> > 
> > > > +The filesystem check (fsck) tool examines all the metadata in a
> > > > filesystem
> > > > +to look for errors.
> > > > +Simple tools only check for obvious corruptions, but the more
> > > > sophisticated
> > > > +ones cross-reference metadata records to look for
> > > > inconsistencies.
> > > > +People do not like losing data, so most fsck tools also contains
> > > > some ability
> > > > +to deal with any problems found.
> > > 
> > > While simple tools can detect data corruptions, a filesystem check
> > > (fsck) uses metadata records as a cross-reference to find and
> > > correct
> > > more inconsistencies.
> > > 
> > > ?
> > 
> > Let's be careful with the term 'data corruption' here -- a lot of
> > people
> > (well ok me) will see that as *user* data corruption, whereas we're
> > talking about *metadata* corruption.
> > 
> > I think I'll rework that second sentence further:
> > 
> > "In addition to looking for obvious metadata corruptions, fsck also
> > cross-references different types of metadata records with each other
> > to
> > look for inconsistencies."
> > 
> Alrighty, that sounds good
> 
> > Since the really dumb fscks of the 1970s are a long ways past now.
> > 
> > > > +As a word of caution -- the primary goal of most Linux fsck
> > > > tools is
> > > > to restore
> > > > +the filesystem metadata to a consistent state, not to maximize
> > > > the
> > > > data
> > > > +recovered.
> > > > +That precedent will not be challenged here.
> > > > +
> > > > +Filesystems of the 20th century generally lacked any redundancy
> > > > in
> > > > the ondisk
> > > > +format, which means that fsck can only respond to errors by
> > > > erasing
> > > > files until
> > > > +errors are no longer detected.
> > > > +System administrators avoid data loss by increasing the number
> > > > of
> > > > separate
> > > > +storage systems through the creation of backups; 
> > > 
> > > 
> > > > and they avoid downtime by
> > > > +increasing the redundancy of each storage system through the
> > > > creation of RAID.
> > > Mmm, raids help more for hardware failures right?  They dont really
> > > have a notion of when the fs is corrupted.
> > 
> > Right.
> > 
> > > While an fsck can help
> > > navigate around a corruption possibly caused by a hardware failure,
> > > I
> > > think it's really a different kind of redundancy. I think I'd
> > > probably
> > > drop the last line and keep the selling point focused online
> > > repair.
> > 
> > Yes, RAIDs provide a totally different type of redundancy.  I decided
> > to
> > make this point specifically to counter the people who argue that
> > RAID
> > makes them impervious to corruption problems, etc.
> > 
> > This attitude seemed rather prevalent in the early days of btrfs and
> > a
> > certain other filesystem that Shall Not Be Named, even though the
> > btrfs
> > developers themselves acknowledge this distinction, given the
> > existence
> > of `btrfs scrub' and `btrfs check'.
> > 
> > However you do have a good point that this sentence doesn't add much
> > where it is.  I think I'll add it as a sidebar at the end of the
> > paragraph.
> > 
> > > > +More recent filesystem designs contain enough redundancy in
> > > > their
> > > > metadata that
> > > > +it is now possible to regenerate data structures when non-
> > > > catastrophic errors
> > > > +occur; 
> > > 
> > > 
> > > > this capability aids both strategies.
> > > > +Over the past few years, XFS has added a storage space reverse
> > > > mapping index to
> > > > +make it easy to find which files or metadata objects think they
> > > > own
> > > > a
> > > > +particular range of storage.
> > > > +Efforts are under way to develop a similar reverse mapping index
> > > > for
> > > > the naming
> > > > +hierarchy, which will involve storing directory parent pointers
> > > > in
> > > > each file.
> > > > +With these two pieces in place, XFS uses secondary information
> > > > to
> > > > perform more
> > > > +sophisticated repairs.
> > > This part here I think I would either let go or relocate.  The
> > > topic of
> > > this section is supposed to discuss roughly what a filesystem check
> > > is.
> > > Ideally so we can start talking about how ofsck is different.  It
> > > feels
> > > like a bit of a jump to suddenly hop into rmap and pptrs, and for
> > > "sophisticated repairs" that we havn't really gotten into the
> > > details
> > > of yet.  So I think it would read easier if we saved this part
> > > until we
> > > start talking about how they are used later.  
> > 
> > Agreed.
> > 
> > > > +
> > > > +TLDR; Show Me the Code!
> > > > +-----------------------
> > > > +
> > > > +Code is posted to the kernel.org git trees as follows:
> > > > +`kernel changes
> > > > <
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.g
> > > > it
> > > > /log/?h=repair-symlink>`_,
> > > > +`userspace changes
> > > > <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-
> > > > dev.
> > > > git/log/?h=scrub-media-scan-service>`_, and
> > > > +`QA test changes
> > > > <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-
> > > > dev.
> > > > git/log/?h=repair-dirs>`_.
> > > > +Each kernel patchset adding an online repair function will use
> > > > the
> > > > same branch
> > > > +name across the kernel, xfsprogs, and fstests git repos.
> > > > +
> > > > +Existing Tools
> > > > +--------------
> > > > +
> > > > +The online fsck tool described here will be the third tool in
> > > > the
> > > > history of
> > > > +XFS (on Linux) to check and repair filesystems.
> > > > +Two programs precede it:
> > > > +
> > > > +The first program, ``xfs_check``, was created as part of the XFS
> > > > debugger
> > > > +(``xfs_db``) and can only be used with unmounted filesystems.
> > > > +It walks all metadata in the filesystem looking for
> > > > inconsistencies
> > > > in the
> > > > +metadata, though it lacks any ability to repair what it finds.
> > > > +Due to its high memory requirements and inability to repair
> > > > things,
> > > > this
> > > > +program is now deprecated and will not be discussed further.
> > > > +
> > > > +The second program, ``xfs_repair``, was created to be faster and
> > > > more robust
> > > > +than the first program.
> > > > +Like its predecessor, it can only be used with unmounted
> > > > filesystems.
> > > > +It uses extent-based in-memory data structures to reduce memory
> > > > consumption,
> > > > +and tries to schedule readahead IO appropriately to reduce I/O
> > > > waiting time
> > > > +while it scans the metadata of the entire filesystem.
> > > > +The most important feature of this tool is its ability to
> > > > respond to
> > > > +inconsistencies in file metadata and directory tree by erasing
> > > > things as needed
> > > > +to eliminate problems.
> > > > +Space usage metadata are rebuilt from the observed file
> > > > metadata.
> > > > +
> > > > +Problem Statement
> > > > +-----------------
> > > > +
> > > > +The current XFS tools leave several problems unsolved:
> > > > +
> > > > +1. **User programs** suddenly **lose access** to information in
> > > > the
> > > > computer
> > > > +   when unexpected shutdowns occur as a result of silent
> > > > corruptions
> > > > in the
> > > > +   filesystem metadata.
> > > > +   These occur **unpredictably** and often without warning.
> > > 
> > > 
> > > 1. **User programs** suddenly **lose access** to the filesystem
> > >    when unexpected shutdowns occur as a result of silent
> > > corruptions
> > > that could have otherwise been avoided with an online repair
> > > 
> > > While some of these issues are not untrue, I think it makes sense
> > > to
> > > limit them to the issue you plan to solve, and therefore discuss.
> > 
> > Fair enough, it's not like one loses /all/ the data in the computer.
> > 
> > That said, we're still in the problem definition phase, so I don't
> > want
> > to mention online repair just yet.
> > 
> > > > +2. **Users** experience a **total loss of service** during the
> > > > recovery period
> > > > +   after an **unexpected shutdown** occurs.
> > > > +
> > > > +3. **Users** experience a **total loss of service** if the
> > > > filesystem is taken
> > > > +   offline to **look for problems** proactively.
> > > > +
> > > > +4. **Data owners** cannot **check the integrity** of their
> > > > stored
> > > > data without
> > > > +   reading all of it.
> > > 
> > > > +   This may expose them to substantial billing costs when a
> > > > linear
> > > > media scan
> > > > +   might suffice.
> > > Ok, I had to re-read this one a few times, but I think this reads a
> > > little cleaner:
> > > 
> > >     Customers that are billed for data egress may incur unnecessary
> > > cost when a background media scan on the host may have sufficed
> > > 
> > > ?
> > 
> > "...when a linear media scan performed by the storage system
> > administrator would suffice."
> > 
> That sounds fine to me
> 
> > I was tempted to say "storage owner" instead of "storage system
> > administrator" but that sounded a little too IBM.
> > 
> > > > +5. **System administrators** cannot **schedule** a maintenance
> > > > window to deal
> > > > +   with corruptions if they **lack the means** to assess
> > > > filesystem
> > > > health
> > > > +   while the filesystem is online.
> > > > +
> > > > +6. **Fleet monitoring tools** cannot **automate periodic
> > > > checks** of
> > > > filesystem
> > > > +   health when doing so requires **manual intervention** and
> > > > downtime.
> > > > +
> > > > +7. **Users** can be tricked into **doing things they do not
> > > > desire**
> > > > when
> > > > +   malicious actors **exploit quirks of Unicode** to place
> > > > misleading names
> > > > +   in directories.
> > > hrmm, I guess I'm not immediately extrapolating what things users
> > > are
> > > being tricked into doing, or how ofsck solves this?  Otherwise I
> > > might
> > > drop the last one here, I think the rest of the bullets are plenty
> > > of
> > > motivation.
> > 
> > The doc gets into this later[1], but it's possible to create two
> > entries
> > within the same directory that have different byte sequences in the
> > name
> > but render identically in file choosers.  These pathnames:
> > 
> > /home/djwong/Downloads/rustup.sh
> > /home/djwong/Downloads/rus<zero width space>tup.sh
> > 
> > refer to different files, but a naïve file open dialog will render
> > them
> > identically as "rustup.sh".  If the first is the Rust installer and
> > the
> > second name is actually a ransomware payload, I can victimize you by
> > tricking you into opening the wrong one.
> > 
> > Firefox had a whole CVE over this in 2018:
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1438025
> > 
> > xfs_scrub is (so far) the only linux filesystem fsck tool that will
> > warn
> > system administrators about this kind of thing.
> > 
> > See generic/453 and generic/454.
> > 
> > [1] https://djwong.org/docs/xfs-online-fsck-design/#id108
> > 
> hmm ok, how about:
> 
> 7. Malicious attacks may use uncommon unicode characters to create file
> names that resemble normal files, which may go undetected until the
> filesystem is scanned.

They resemble *other filenames* in the same directory, normal or
otherwise.

Note that xattrs have the same problem -- a listing of attrs will show
two names that render identically but map to different things.  There's
less double-click danger there, at least.

Another class of unicode problem is that you can use directional
controls to spoof file extensions.  The sequence:

pu<right to left>txt.pl

renders as "pulp.txt" if you're not careful, but file managers think
it's actually a perl script file!  Granted, nobody should allow
execution of random a-x downloaded scripts.

There are enough weird twists to this sort of deception that I left #7
worded as broadly as I needed.

--D

> 
> ?
> 
> > > > +
> > > > +Given this definition of the problems to be solved and the
> > > > actors
> > > > who would
> > > > +benefit, the proposed solution is a third fsck tool that acts on
> > > > a
> > > > running
> > > > +filesystem.
> > > > +
> > > > +This new third program has three components: an in-kernel
> > > > facility
> > > > to check
> > > > +metadata, an in-kernel facility to repair metadata, and a
> > > > userspace
> > > > driver
> > > > +program to drive fsck activity on a live filesystem.
> > > > +``xfs_scrub`` is the name of the driver program.
> > > > +The rest of this document presents the goals and use cases of
> > > > the
> > > > new fsck
> > > > +tool, describes its major design points in connection to those
> > > > goals, and
> > > > +discusses the similarities and differences with existing tools.
> > > > +
> > > > ++---------------------------------------------------------------
> > > > ----
> > > > -------+
> > > > +|
> > > > **Note**:                                                        
> > > >     
> > > >     |
> > > > ++---------------------------------------------------------------
> > > > ----
> > > > -------+
> > > > +| Throughout this document, the existing offline fsck tool can
> > > > also
> > > > be     |
> > > > +| referred to by its current name
> > > > "``xfs_repair``".                        |
> > > > +| The userspace driver program for the new online fsck tool can
> > > > be         |
> > > > +| referred to as
> > > > "``xfs_scrub``".                                          |
> > > > +| The kernel portion of online fsck that validates metadata is
> > > > called      |
> > > > +| "online scrub", and portion of the kernel that fixes metadata
> > > > is
> > > > called  |
> > > > +| "online
> > > > repair".                                                        
> > > > |
> > > > ++---------------------------------------------------------------
> > > > ----
> > > > -------+
> > 
> > Errr ^^^^ is Evolution doing line wrapping here?
> > 
> > > Hmm, maybe here might be a good spot to move rmap and pptrs?  It's
> > > not
> > > otherwise clear to me what "secondary metadata" is.  If that is
> > > what it
> > > is meant to refer to, I think the reader will more intuitively make
> > > the
> > > connection if those two blurbs appear in the same context.
> > 
> > Ooh, you found a significant gap-- nowhere in this chapter do I
> > actually
> > define what is primary metadata.  Or secondary metadata.
> > 
> > > > +
> > > > +Secondary metadata indices enable the reconstruction of parts of
> > > > a
> > > > damaged
> > > > +primary metadata object from secondary information.
> > > 
> > > I would take out this blurb...
> > > > +XFS filesystems shard themselves into multiple primary objects
> > > > to
> > > > enable better
> > > > +performance on highly threaded systems and to contain the blast
> > > > radius when
> > > > +problems happen.
> > > 
> > > 
> > > > +The naming hierarchy is broken up into objects known as
> > > > directories
> > > > and files;
> > > > +and the physical space is split into pieces known as allocation
> > > > groups.
> > > And add here:
> > > 
> > > "This enables better performance on highly threaded systems and
> > > helps
> > > to contain corruptions when they occur."
> > > 
> > > I think that reads cleaner
> > 
> > Ok.  Mind if I reword this slightly?  The entire paragraph now reads
> > like this:
> > 
> > "The naming hierarchy is broken up into objects known as directories
> > and
> > files and the physical space is split into pieces known as allocation
> > groups.  Sharding enables better performance on highly parallel
> > systems
> > and helps to contain the damage when corruptions occur.  The division
> > of
> > the filesystem into principal objects (allocation groups and inodes)
> > means that there are ample opportunities to perform targeted checks
> > and
> > repairs on a subset of the filesystem."
> I think that sounds cleaner
> 
> > 
> > > > +The division of the filesystem into principal objects
> > > > (allocation
> > > > groups and
> > > > +inodes) means that there are ample opportunities to perform
> > > > targeted
> > > > checks and
> > > > +repairs on a subset of the filesystem.
> > > > +While this is going on, other parts continue processing IO
> > > > requests.
> > > > +Even if a piece of filesystem metadata can only be regenerated
> > > > by
> > > > scanning the
> > > > +entire system, the scan can still be done in the background
> > > > while
> > > > other file
> > > > +operations continue.
> > > > +
> > > > +In summary, online fsck takes advantage of resource sharding and
> > > > redundant
> > > > +metadata to enable targeted checking and repair operations while
> > > > the
> > > > system
> > > > +is running.
> > > > +This capability will be coupled to automatic system management
> > > > so
> > > > that
> > > > +autonomous self-healing of XFS maximizes service availability.
> > > > 
> > > 
> > > Nits and paraphrases aside, I think this looks pretty good?
> > 
> > Woot.  Thanks for digging in! :)
> > 
> Sure, no problem!
> 
> > > Allison
> > > 
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 02/14] xfs: document the general theory underlying online fsck design
  2023-01-18  0:03         ` Allison Henderson
@ 2023-01-18  2:35           ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-01-18  2:35 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Wed, Jan 18, 2023 at 12:03:13AM +0000, Allison Henderson wrote:
> On Wed, 2023-01-11 at 15:39 -0800, Darrick J. Wong wrote:
> > On Wed, Jan 11, 2023 at 01:25:12AM +0000, Allison Henderson wrote:
> > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Start the second chapter of the online fsck design documentation.
> > > > This covers the general theory underlying how online fsck works.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  .../filesystems/xfs-online-fsck-design.rst         |  366
> > > > ++++++++++++++++++++
> > > >  1 file changed, 366 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > index 25717ebb5f80..a03a7b9f0250 100644
> > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > @@ -197,3 +197,369 @@ metadata to enable targeted checking and
> > > > repair
> > > > operations while the system
> > > >  is running.
> > > >  This capability will be coupled to automatic system management
> > > > so
> > > > that
> > > >  autonomous self-healing of XFS maximizes service availability.
> > > > +
> > > > +2. Theory of Operation
> > > > +======================
> > > > +
> > > > +Because it is necessary for online fsck to lock and scan live
> > > > metadata objects,
> > > > +online fsck consists of three separate code components.
> > > > +The first is the userspace driver program ``xfs_scrub``, which
> > > > is
> > > > responsible
> > > > +for identifying individual metadata items, scheduling work items
> > > > for
> > > > them,
> > > > +reacting to the outcomes appropriately, and reporting results to
> > > > the
> > > > system
> > > > +administrator.
> > > > +The second and third are in the kernel, which implements
> > > > functions
> > > > to check
> > > > +and repair each type of online fsck work item.
> > > > +
> > > > ++---------------------------------------------------------------
> > > > ---+
> > > > +|
> > > > **Note**:                                                       
> > > > |
> > > > ++---------------------------------------------------------------
> > > > ---+
> > > > +| For brevity, this document shortens the phrase "online fsck
> > > > work |
> > > > +| item" to "scrub
> > > > item".                                           |
> > > > ++---------------------------------------------------------------
> > > > ---+
> > > > +
> > > > +Scrub item types are delineated in a manner consistent with the
> > > > Unix
> > > > design
> > > > +philosophy, which is to say that each item should handle one
> > > > aspect
> > > > of a
> > > > +metadata structure, and handle it well.
> > > > +
> > > > +Scope
> > > > +-----
> > > > +
> > > > +In principle, online fsck should be able to check and to repair
> > > > everything that
> > > > +the offline fsck program can handle.
> > > > +However, the adjective *online* brings with it the limitation
> > > > that
> > > > online fsck
> > > > +cannot deal with anything that prevents the filesystem from
> > > > going on
> > > > line, i.e.
> > > > +mounting.
> > > Are there really any other operations that do that other than
> > > mount?
> > 
> > No.
> > 
> > > I think this reads cleaner:
> > > 
> > > By definition, online fsck can only check and repair an online
> > > filesystem.  It cannot check mounting operations which start from
> > > an
> > > offline state.
> > 
> > Now that I think about this some more, this whole sentence doesn't
> > make
> > sense.  xfs_scrub can *definitely* detect and fix latent errors that
> > would prevent the /next/ mount from succeeding.  It's only the fuzz
> > test
> > suite that stumbles over this, and only because xfs_db cannot fuzz
> > mounted filesystems.
> > 
> > "However, online fsck cannot be running 100% of the time, which means
> > that latent errors may creep in after a scrub completes.
> > If these errors cause the next mount to fail, offline fsck is the
> > only
> > solution."
> Sure, that sounds fair
> 
> > 
> > > > +This limitation means that maintenance of the offline fsck tool
> > > > will
> > > > continue.
> > > > +A second limitation of online fsck is that it must follow the
> > > > same
> > > > resource
> > > > +sharing and lock acquisition rules as the regular filesystem.
> > > > +This means that scrub cannot take *any* shortcuts to save time,
> > > > because doing
> > > > +so could lead to concurrency problems.
> > > > +In other words, online fsck will never be able to fix 100% of
> > > > the
> > > > +inconsistencies that offline fsck can repair, 
> > > Hmm, what inconsistencies cannot repaired as a result of the "no
> > > shortcut" rule?  I'm all for keeping things short and to the point,
> > > but
> > > since this section is about scope, I'd give it at least a brief
> > > bullet
> > > list
> > 
> > Hmm.  I can't think of any off the top of my head.  Given the
> > rewording
> > earlier, I think it's more accurate to say:
> > 
> > "In other words, online fsck is not a complete replacement for
> > offline
> > fsck, and a complete run of online fsck may take longer than online
> > fsck."
> That makes sense
> > 
> > > > and a complete run of online fsck
> > > > +may take longer.
> > > > +However, both of these limitations are acceptable tradeoffs to
> > > > satisfy the
> > > > +different motivations of online fsck, which are to **minimize
> > > > system
> > > > downtime**
> > > > +and to **increase predictability of operation**.
> > > > +
> > > > +.. _scrubphases:
> > > > +
> > > > +Phases of Work
> > > > +--------------
> > > > +
> > > > +The userspace driver program ``xfs_scrub`` splits the work of
> > > > checking and
> > > > +repairing an entire filesystem into seven phases.
> > > > +Each phase concentrates on checking specific types of scrub
> > > > items
> > > > and depends
> > > > +on the success of all previous phases.
> > > > +The seven phases are as follows:
> > > > +
> > > > +1. Collect geometry information about the mounted filesystem and
> > > > computer,
> > > > +   discover the online fsck capabilities of the kernel, and open
> > > > the
> > > > +   underlying storage devices.
> > > > +
> > > > +2. Check allocation group metadata, all realtime volume
> > > > metadata,
> > > > and all quota
> > > > +   files.
> > > > +   Each metadata structure is scheduled as a separate scrub
> > > > item.
> > > Like an intent item?
> > 
> > No, these scrub items are struct scrub_item objects that exist solely
> > within the userspace program code.
> > 
> > > > +   If corruption is found in the inode header or inode btree and
> > > > ``xfs_scrub``
> > > > +   is permitted to perform repairs, then those scrub items are
> > > > repaired to
> > > > +   prepare for phase 3.
> > > > +   Repairs are implemented by resubmitting the scrub item to the
> > > > kernel with
> > > If I'm understanding this correctly:
> > > Repairs are implemented as intent items that are queued and
> > > committed
> > > just as any filesystem operation.
> > > 
> > > ?
> > 
> > I don't want to go too deep into this prematurely, but...
> > 
> > xfs_scrub (the userspace program) needs to track which metadata
> > objects
> > have been checked and which ones need repairs.  The current codebase
> > (ab)uses struct xfs_scrub_metadata, but it's very memory inefficient.
> > I replaced it with a new struct scrub_item that stores (a) all the
> > handle information to identify the inode/AG/rt group/whatever; and
> > (b)
> > the state of all the checks that can be applied to that item:
> > 
> > struct scrub_item {
> >         /*
> >          * Information we need to call the scrub and repair ioctls.
> >          * Per-AG items should set the ino/gen fields to -1; per-
> > inode
> >          * items should set sri_agno to -1; and per-fs items should
> > set
> >          * all three fields to -1.  Or use the macros below.
> >          */
> >         __u64                   sri_ino;
> >         __u32                   sri_gen;
> >         __u32                   sri_agno;
> > 
> >         /* Bitmask of scrub types that were scheduled here. */
> >         __u32                   sri_selected;
> > 
> >         /* Scrub item state flags, one for each XFS_SCRUB_TYPE. */
> >         __u8                    sri_state[XFS_SCRUB_TYPE_NR];
> > 
> >         /* Track scrub and repair call retries for each scrub type.
> > */
> >         __u8                    sri_tries[XFS_SCRUB_TYPE_NR];
> > 
> >         /* Were there any corruption repairs needed? */
> >         bool                    sri_inconsistent:1;
> > 
> >         /* Are we revalidating after repairs? */
> >         bool                    sri_revalidate:1;
> > };
> > 
> > The first three fields are passed to the kernel via scrub ioctl and
> > describe a particular xfs domain (files, AGs, etc).  The rest of the
> > structure store state for each type of repair that can be performed
> > against that domain.
> > 
> > IOWs, xfs_scrub uses struct scrub_item objects to generate ioctl
> > calls
> > to the kernel to check and repair things.  The kernel reads the ioctl
> > information, figures out what needs to be done, and then does the
> > usual
> > get transaction -> lock things -> make updates -> commit dance to
> > make
> > corrections to the fs.  Those corrections include log intent items,
> > but
> > there's no tight coupling between log intent items and scrub_items.
> > 
> > Side note: The kernel repair code used to use intents to rebuild a
> > structure, but nowadays it use the btree bulk loader code to replace
> > btrees wholesale and in a single atomic commit.  Now we use them
> > primariliy to free preallocated space if the repair fails.
> 
> Oh ok, well how about just:
> 
> "Repairs are implemented by resubmitting the scrub item to the
> kernel through a designated ioctl with..."
> 
> ?

How about:

"Repairs are implemented by using the information in the scrub item to
resubmit the kernel scrub call with the repair flag enabled; this is
discussed in the next section.  Optimizations and all other repairs are
deferred to phase 4."

?

> > 
> > > > +   the repair flag enabled; this is discussed in the next
> > > > section.
> > > > +   Optimizations and all other repairs are deferred to phase 4.
> > > I guess I'll come back to it. 
> > > 
> > > > +
> > > > +3. Check all metadata of every file in the filesystem.
> > > > +   Each metadata structure is also scheduled as a separate scrub
> > > > item.
> > > > +   If repairs are needed, ``xfs_scrub`` is permitted to perform
> > > > repairs,
> > > If repairs are needed and ``xfs_scrub`` is permitted
> > 
> > Fixed.
> > 
> > > ?
> > > > +   and there were no problems detected during phase 2, then
> > > > those
> > > > scrub items
> > > > +   are repaired.
> > > > +   Optimizations and unsuccessful repairs are deferred to phase
> > > > 4.
> > > > +
> > > > +4. All remaining repairs and scheduled optimizations are
> > > > performed
> > > > during this
> > > > +   phase, if the caller permits them.
> > > > +   Before starting repairs, the summary counters are checked and
> > > > any
> > > Did we talk about summary counters yet?  Maybe worth a blub.
> > > Otherwise
> > > this may not make sense with out skipping ahead or into the code
> > 
> > Nope.  I'll add that to the previous patch when I introduce primary
> > and
> > secondary metadata.  Good catch!
> > 
> > "Summary metadata, as the name implies, condense information
> > contained
> > in primary metadata for performance reasons."
> 
> Ok, sounds good then
> > 
> > > > necessary
> > > > +   repairs are performed so that subsequent repairs will not
> > > > fail
> > > > the resource
> > > > +   reservation step due to wildly incorrect summary counters.
> > > > +   Unsuccesful repairs are requeued as long as forward progress
> > > > on
> > > > repairs is
> > > > +   made somewhere in the filesystem.
> > > > +   Free space in the filesystem is trimmed at the end of phase 4
> > > > if
> > > > the
> > > > +   filesystem is clean.
> > > > +
> > > > +5. By the start of this phase, all primary and secondary
> > > > filesystem
> > > > metadata
> > > > +   must be correct.
> > > I think maybe the definitions of primary and secondary metadata
> > > should
> > > move up before the phases section.  Otherwise the reader has to
> > > skip
> > > ahead to know what that means.
> > 
> > Yep, now primary, secondary, and summary metadata are defined in
> > section
> > 1.  Very good comment.
> > 
> > > > +   Summary counters such as the free space counts and quota
> > > > resource
> > > > counts
> > > > +   are checked and corrected.
> > > > +   Directory entry names and extended attribute names are
> > > > checked
> > > > for
> > > > +   suspicious entries such as control characters or confusing
> > > > Unicode sequences
> > > > +   appearing in names.
> > > > +
> > > > +6. If the caller asks for a media scan, read all allocated and
> > > > written data
> > > > +   file extents in the filesystem.
> > > > +   The ability to use hardware-assisted data file integrity
> > > > checking
> > > > is new
> > > > +   to online fsck; neither of the previous tools have this
> > > > capability.
> > > > +   If media errors occur, they will be mapped to the owning
> > > > files
> > > > and reported.
> > > > +
> > > > +7. Re-check the summary counters and presents the caller with a
> > > > summary of
> > > > +   space usage and file counts.
> > > > +
> > > > +Steps for Each Scrub Item
> > > > +-------------------------
> > > > +
> > > > +The kernel scrub code uses a three-step strategy for checking
> > > > and
> > > > repairing
> > > > +the one aspect of a metadata object represented by a scrub item:
> > > > +
> > > > +1. The scrub item of intere
> > > > st is checked for corruptions; opportunities for
> > > > +   optimization; and for values that are directly controlled by
> > > > the
> > > > system
> > > > +   administrator but look suspicious.
> > > > +   If the item is not corrupt or does not need optimization,
> > > > resource are
> > > > +   released and the positive scan results are returned to
> > > > userspace.
> > > > +   If the item is corrupt or could be optimized but the caller
> > > > does
> > > > not permit
> > > > +   this, resources are released and the negative scan results
> > > > are
> > > > returned to
> > > > +   userspace.
> > > > +   Otherwise, the kernel moves on to the second step.
> > > > +
> > > > +2. The repair function is called to rebuild the data structure.
> > > > +   Repair functions generally choose rebuild a structure from
> > > > other
> > > > metadata
> > > > +   rather than try to salvage the existing structure.
> > > > +   If the repair fails, the scan results from the first step are
> > > > returned to
> > > > +   userspace.
> > > > +   Otherwise, the kernel moves on to the third step.
> > > > +
> > > > +3. In the third step, the kernel runs the same checks over the
> > > > new
> > > > metadata
> > > > +   item to assess the efficacy of the repairs.
> > > > +   The results of the reassessment are returned to userspace.
> > > > +
> > > > +Classification of Metadata
> > > > +--------------------------
> > > > +
> > > > +Each type of metadata object (and therefore each type of scrub
> > > > item)
> > > > is
> > > > +classified as follows:
> > > > +
> > > > +Primary Metadata
> > > > +````````````````
> > > > +
> > > > +Metadata structures in this category should be most familiar to
> > > > filesystem
> > > > +users either because they are directly created by the user or
> > > > they
> > > > index
> > > > +objects created by the user
> > > I think I would just jump straight into a brief list.  The above is
> > > a
> > > bit vague, and documentation that tells you you should already know
> > > what it is, doesnt add much.  Again, I think too much poetry might
> > > be
> > > why you're having a hard time getting responses.
> > 
> > Done:
> > 
> > - Free space and reference count information
> > 
> > - Inode records and indexes
> > 
> > - Storage mapping information for file data
> > 
> > - Directories
> > 
> > - Extended attributes
> > 
> > - Symbolic links
> > 
> > - Quota limits
> > 
> > - Link counts
> > 
> > 
> > > > +Most filesystem objects fall into this class.
> > > Most filesystem objects created by users fall into this class, such
> > > as
> > > inode, directories, allocation groups and so on.
> > > > +Resource and lock acquisition for scrub code follows the same
> > > > order
> > > > as regular
> > > > +filesystem accesses.
> > > 
> > > Lock acquisition for these resources will follow the same order for
> > > scrub as a regular filesystem access.
> > 
> > Yes, that is clearer.  I think I'll phrase this more actively:
> > 
> > "Scrub obeys the same rules as regular filesystem accesses for
> > resource
> > and lock acquisition."
> 
> Ok, I think that sounds fine
> > 
> > > > +
> > > > +Primary metadata objects are the simplest for scrub to process.
> > > > +The principal filesystem object (either an allocation group or
> > > > an
> > > > inode) that
> > > > +owns the item being scrubbed is locked to guard against
> > > > concurrent
> > > > updates.
> > > > +The check function examines every record associated with the
> > > > type
> > > > for obvious
> > > > +errors and cross-references healthy records against other
> > > > metadata
> > > > to look for
> > > > +inconsistencies.
> > > > +Repairs for this class of scrub item are simple, since the
> > > > repair
> > > > function
> > > > +starts by holding all the resources acquired in the previous
> > > > step.
> > > > +The repair function scans available metadata as needed to record
> > > > all
> > > > the
> > > > +observations needed to complete the structure.
> > > > +Next, it stages the observations in a new ondisk structure and
> > > > commits it
> > > > +atomically to complete the repair.
> > > > +Finally, the storage from the old data structure are carefully
> > > > reaped.
> > > > +
> > > > +Because ``xfs_scrub`` locks a primary object for the duration of
> > > > the
> > > > repair,
> > > > +this is effectively an offline repair operation performed on a
> > > > subset of the
> > > > +filesystem.
> > > > +This minimizes the complexity of the repair code because it is
> > > > not
> > > > necessary to
> > > > +handle concurrent updates from other threads, nor is it
> > > > necessary to
> > > > access
> > > > +any other part of the filesystem.
> > > > +As a result, indexed structures can be rebuilt very quickly, and
> > > > programs
> > > > +trying to access the damaged structure will be blocked until
> > > > repairs
> > > > complete.
> > > > +The only infrastructure needed by the repair code are the
> > > > staging
> > > > area for
> > > > +observations and a means to write new structures to disk.
> > > > +Despite these limitations, the advantage that online repair
> > > > holds is
> > > > clear:
> > > > +targeted work on individual shards of the filesystem avoids
> > > > total
> > > > loss of
> > > > +service.
> > > > +
> > > > +This mechanism is described in section 2.1 ("Off-Line
> > > > Algorithm") of
> > > > +V. Srinivasan and M. J. Carey, `"Performance of On-Line Index
> > > > Construction
> > > > +Algorithms" <https://dl.acm.org/doi/10.5555/645336.649870>`_,
> > > Hmm, this article is not displaying for me.  If the link is
> > > abandoned,
> > > probably there's not much need to keep it around
> > 
> > The actual paper is not directly available through that ACM link, but
> > the DOI is what I used to track down a paper copy(!) of that paper as
> > published in a journal.
> > 
> > (In turn, that journal is "Advances in Database Technology - EDBT
> > 1992";
> > I found it in the NYU library.  Amazingly, they sold it to me.)
> Oh I see.  Dave had replied in a separate thread with a pdf version. 
> That might be a better link so that people do not have to buy a paper
> copy.

Yep, updated, thanks all!

> > 
> > > > +*Extending Database Technology*, pp. 293-309, 1992.
> > > > +
> > > > +Most primary metadata repair functions stage their intermediate
> > > > results in an
> > > > +in-memory array prior to formatting the new ondisk structure,
> > > > which
> > > > is very
> > > > +similar to the list-based algorithm discussed in section 2.3
> > > > ("List-
> > > > Based
> > > > +Algorithms") of Srinivasan.
> > > > +However, any data structure builder that maintains a resource
> > > > lock
> > > > for the
> > > > +duration of the repair is *always* an offline algorithm.
> > > > +
> > > > +Secondary Metadata
> > > > +``````````````````
> > > > +
> > > > +Metadata structures in this category reflect records found in
> > > > primary metadata,
> > > 
> > > such as rmap and parent pointer attributes.  But they are only
> > > needed...
> > > 
> > > ?
> > 
> > Euugh, this section needs some restructuring to get rid of redundant
> > sentences.  How about:
> > 
> > "Metadata structures in this category reflect records found in
> > primary
> > metadata, but are only needed for online fsck or for reorganization
> > of
> > the filesystem.
> > 
> > "Secondary metadata include:
> > 
> > - Reverse mapping information
> > 
> > - Directory parent pointers
> > 
> > "This class of metadata is difficult for scrub to process because
> > scrub
> > attaches to the secondary object but needs to check primary metadata,
> > which runs counter to the usual order of resource acquisition.
> > Frequently, this means that full filesystems scans are necessary to
> > rebuild the metadata.
> > Check functions..."
> 
> Yes I think that's much clearer :-)
> 
> > 
> > > > +but are only needed for online fsck or for reorganization of the
> > > > filesystem.
> > > > +Resource and lock acquisition for scrub code do not follow the
> > > > same
> > > > order as
> > > > +regular filesystem accesses, and may involve full filesystem
> > > > scans.
> > > > +
> > > > +Secondary metadata objects are difficult for scrub to process,
> > > > because scrub
> > > > +attaches to the secondary object but needs to check primary
> > > > metadata, which
> > > > +runs counter to the usual order of resource acquisition.
> > > bummer :-(
> > 
> > Yup.
> > 
> > > > +Check functions can be limited in scope to reduce runtime.
> > > > +Repairs, however, require a full scan of primary metadata, which
> > > > can
> > > > take a
> > > > +long time to complete.
> > > > +Under these conditions, ``xfs_scrub`` cannot lock resources for
> > > > the
> > > > entire
> > > > +duration of the repair.
> > > > +
> > > > +Instead, repair functions set up an in-memory staging structure
> > > > to
> > > > store
> > > > +observations.
> > > > +Depending on the requirements of the specific repair function,
> > > > the
> > > > staging
> > > 
> > > 
> > > > +index can have the same format as the ondisk structure, or it
> > > > can
> > > > have a design
> > > > +specific to that repair function.
> > > ...will have either the same format as the ondisk structure or a
> > > structure specific to the repair function.
> > 
> > Fixed.
> > 
> > > > +The next step is to release all locks and start the filesystem
> > > > scan.
> > > > +When the repair scanner needs to record an observation, the
> > > > staging
> > > > data are
> > > > +locked long enough to apply the update.
> > > > +Simultaneously, the repair function hooks relevant parts of the
> > > > filesystem to
> > > > +apply updates to the staging data if the the update pertains to
> > > > an
> > > > object that
> > > > +has already been scanned by the index builder.
> > > While a scan is in progress, function hooks are used to apply
> > > filesystem updates to both the object and the staging data if the
> > > object has already been scanned.
> > > 
> > > ?
> > 
> > The hooks are used to apply updates to the repair staging data, but
> > they
> > don't apply regular filesystem updates.
> > 
> > The usual process runs something like this:
> > 
> >   Lock -> update -> update -> commit
> > 
> > With a scan in progress, say we hook the second update.  The
> > instruction
> > flow becomes:
> > 
> >   Lock -> update -> update -> hook -> update staging data -> commit
> > 
> > Maybe something along the following would be better?
> > 
> > "While the filesystem scan is in progress, the repair function hooks
> > the
> > filesystem so that it can apply pending filesystem updates to the
> > staging information."
> Ok, that sounds clearer then
> 
> > 
> > > > +Once the scan is done, the owning object is re-locked, the live
> > > > data
> > > > is used to
> > > > +write a new ondisk structure, and the repairs are committed
> > > > atomically.
> > > > +The hooks are disabled and the staging staging area is freed.
> > > > +Finally, the storage from the old data structure are carefully
> > > > reaped.
> > > > +
> > > > +Introducing concurrency helps online repair avoid various
> > > > locking
> > > > problems, but
> > > > +comes at a high cost to code complexity.
> > > > +Live filesystem code has to be hooked so that the repair
> > > > function
> > > > can observe
> > > > +updates in progress.
> > > > +The staging area has to become a fully functional parallel
> > > > structure
> > > > so that
> > > > +updates can be merged from the hooks.
> > > > +Finally, the hook, the filesystem scan, and the inode locking
> > > > model
> > > > must be
> > > > +sufficiently well integrated that a hook event can decide if a
> > > > given
> > > > update
> > > > +should be applied to the staging structure.
> > > > +
> > > > +In theory, the scrub implementation could apply these same
> > > > techniques for
> > > > +primary metadata, but doing so would make it massively more
> > > > complex
> > > > and less
> > > > +performant.
> > > > +Programs attempting to access the damaged structures are not
> > > > blocked
> > > > from
> > > > +operation, which may cause application failure or an unplanned
> > > > filesystem
> > > > +shutdown.
> > > > +
> > > > +Inspiration for the secondary metadata repair strategy was drawn
> > > > from section
> > > > +2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build
> > > > Without
> > > > Side-File")
> > > > +and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan,
> > > > `"Algorithms
> > > > for
> > > > +Creating Indexes for Very Large Tables Without Quiescing
> > > > Updates"
> > > > +<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
> > > This one works
> > > 
> > > > +
> > > > +The sidecar index mentioned above bears some resemblance to the
> > > > side
> > > > file
> > > > +method mentioned in Srinivasan and Mohan.
> > > > +Their method consists of an index builder that extracts relevant
> > > > record data to
> > > > +build the new structure as quickly as possible; and an auxiliary
> > > > structure that
> > > > +captures all updates that would be committed to the index by
> > > > other
> > > > threads were
> > > > +the new index already online.
> > > > +After the index building scan finishes, the updates recorded in
> > > > the
> > > > side file
> > > > +are applied to the new index.
> > > > +To avoid conflicts between the index builder and other writer
> > > > threads, the
> > > > +builder maintains a publicly visible cursor that tracks the
> > > > progress
> > > > of the
> > > > +scan through the record space.
> > > > +To avoid duplication of work between the side file and the index
> > > > builder, side
> > > > +file updates are elided when the record ID for the update is
> > > > greater
> > > > than the
> > > > +cursor position within the record ID space.
> > > > +
> > > > +To minimize changes to the rest of the codebase, XFS online
> > > > repair
> > > > keeps the
> > > > +replacement index hidden until it's completely ready to go.
> > > > +In other words, there is no attempt to expose the keyspace of
> > > > the
> > > > new index
> > > > +while repair is running.
> > > > +The complexity of such an approach would be very high and
> > > > perhaps
> > > > more
> > > > +appropriate to building *new* indices.
> > > > +
> > > > +**Question**: Can the full scan and live update code used to
> > > > facilitate a
> > > > +repair also be used to implement a comprehensive check?
> > > > +
> > > > +*Answer*: Probably, though this has not been yet been studied.
> > > I kinda feel like discussion Q&As need to be wrapped up before we
> > > can
> > > call things done.  If this is all there was to the answer, then
> > > lets
> > > clean out the discussion notes.
> > 
> > Oh, the situation here is worse than that -- in theory, check would
> > be
> > much stronger if each scrub function employed these live scans to
> > build
> > a shadow copy of the metadata and then compared the records of both.
> > 
> > However, that increases the amount of work each scrubber has to do
> > much
> > higher, and the runtime of those scrubbers would go up.  The other
> > issue
> > is that live scan hooks would have to proliferate through much more
> > of
> > the filesystem.  That's rather more invasive to the codebase than
> > most
> > of fsck, so I want people to look at the usage models for the handful
> > of
> > scrubbers that really require it before I spread it around elsewhere.
> > Making that kind of change isn't that difficult, but I want to merge
> > this stuff before moving on to experimenting with improvements of
> > that
> > scale.
> 
> I see, well maybe it would be appropriate it to just call it a possible
> future improvement for now, depending on how the uses cases go and if
> the demand for it arises.

I'll go relabel these as "Future Work Questions".  Thanks for continuing
through! :)

--D

> > 
> > > > +
> > > > +Summary Information
> > > > +```````````````````
> > > > +
> > > Oh, perhaps this section could move up with the other metadata
> > > definitions.  That way the reader already has an idea of what these
> > > terms are referring to before we get into how they are used during
> > > the
> > > phases.
> > 
> > Yeah, I think/hope this will be less of a problem now that section 1
> > defines all three types of metadata.  The start of this section now
> > reads:
> > 
> > "Metadata structures in this last category summarize the contents of
> > primary metadata records.
> > These are often used to speed up resource usage queries, and are many
> > times smaller than the primary metadata which they represent.
> > 
> > Examples of summary information include:
> > 
> > - Summary counts of free space and inodes
> > 
> > - File link counts from directories
> > 
> > - Quota resource usage counts
> > 
> > "Check and repair require full filesystem scans, but resource and
> > lock
> > acquisition follow the same paths as regular filesystem accesses."
> Sounds good, I think that will help a lot
> 
> > 
> > > > +Metadata structures in this last category summarize the contents
> > > > of
> > > > primary
> > > > +metadata records.
> > > > +These are often used to speed up resource usage queries, and are
> > > > many times
> > > > +smaller than the primary metadata which they represent.
> > > > +Check and repair both require full filesystem scans, but
> > > > resource
> > > > and lock
> > > > +acquisition follow the same paths as regular filesystem
> > > > accesses.
> > > > +
> > > > +The superblock summary counters have special requirements due to
> > > > the
> > > > underlying
> > > > +implementation of the incore counters, and will be treated
> > > > separately.
> > > > +Check and repair of the other types of summary counters (quota
> > > > resource counts
> > > > +and file link counts) employ the same filesystem scanning and
> > > > hooking
> > > > +techniques as outlined above, but because the underlying data
> > > > are
> > > > sets of
> > > > +integer counters, the staging data need not be a fully
> > > > functional
> > > > mirror of the
> > > > +ondisk structure.
> > > > +
> > > > +Inspiration for quota and file link count repair strategies were
> > > > drawn from
> > > > +sections 2.12 ("Online Index Operations") through 2.14
> > > > ("Incremental
> > > > View
> > > > +Maintenace") of G.  Graefe, `"Concurrent Queries and Updates in
> > > > Summary Views
> > > > +and Their Indexes"
> > > > +<
> > > > http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf
> > > > >`
> > > > _, 2011.
> > > I wonder if these citations would do better as foot notes?  Just to
> > > kinda keep the body of the document tidy and flowing well.
> > 
> > Yes, if this were a paginated document.
> > 
> > > > +
> > > > +Since quotas are non-negative integer counts of resource usage,
> > > > online
> > > > +quotacheck can use the incremental view deltas described in
> > > > section
> > > > 2.14 to
> > > > +track pending changes to the block and inode usage counts in
> > > > each
> > > > transaction,
> > > > +and commit those changes to a dquot side file when the
> > > > transaction
> > > > commits.
> > > > +Delta tracking is necessary for dquots because the index builder
> > > > scans inodes,
> > > > +whereas the data structure being rebuilt is an index of dquots.
> > > > +Link count checking combines the view deltas and commit step
> > > > into
> > > > one because
> > > > +it sets attributes of the objects being scanned instead of
> > > > writing
> > > > them to a
> > > > +separate data structure.
> > > > +Each online fsck function will be discussed as case studies
> > > > later in
> > > > this
> > > > +document.
> > > > +
> > > > +Risk Management
> > > > +---------------
> > > > +
> > > > +During the development of online fsck, several risk factors were
> > > > identified
> > > > +that may make the feature unsuitable for certain distributors
> > > > and
> > > > users.
> > > > +Steps can be taken to mitigate or eliminate those risks, though
> > > > at a
> > > > cost to
> > > > +functionality.
> > > > +
> > > > +- **Decreased performance**: Adding metadata indices to the
> > > > filesystem
> > > > +  increases the time cost of persisting changes to disk, and the
> > > > reverse space
> > > > +  mapping and directory parent pointers are no exception.
> > > > +  System administrators who require the maximum performance can
> > > > disable the
> > > > +  reverse mapping features at format time, though this choice
> > > > dramatically
> > > > +  reduces the ability of online fsck to find inconsistencies and
> > > > repair them.
> > > > +
> > > > +- **Incorrect repairs**: As with all software, there might be
> > > > defects in the
> > > > +  software that result in incorrect repairs being written to the
> > > > filesystem.
> > > > +  Systematic fuzz testing (detailed in the next section) is
> > > > employed
> > > > by the
> > > > +  authors to find bugs early, but it might not catch everything.
> > > > +  The kernel build system provides Kconfig options
> > > > (``CONFIG_XFS_ONLINE_SCRUB``
> > > > +  and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to
> > > > choose
> > > > not to
> > > > +  accept this risk.
> > > > +  The xfsprogs build system has a configure option (``--enable-
> > > > scrub=no``) that
> > > > +  disables building of the ``xfs_scrub`` binary, though this is
> > > > not
> > > > a risk
> > > > +  mitigation if the kernel functionality remains enabled.
> > > > +
> > > > +- **Inability to repair**: Sometimes, a filesystem is too badly
> > > > damaged to be
> > > > +  repairable.
> > > > +  If the keyspaces of several metadata indices overlap in some
> > > > manner but a
> > > > +  coherent narrative cannot be formed from records collected,
> > > > then
> > > > the repair
> > > > +  fails.
> > > > +  To reduce the chance that a repair will fail with a dirty
> > > > transaction and
> > > > +  render the filesystem unusable, the online repair functions
> > > > have
> > > > been
> > > > +  designed to stage and validate all new records before
> > > > committing
> > > > the new
> > > > +  structure.
> > > > +
> > > > +- **Misbehavior**: Online fsck requires many privileges -- raw
> > > > IO to
> > > > block
> > > > +  devices, opening files by handle, ignoring Unix discretionary
> > > > access control,
> > > > +  and the ability to perform administrative changes.
> > > > +  Running this automatically in the background scares people, so
> > > > the
> > > > systemd
> > > > +  background service is configured to run with only the
> > > > privileges
> > > > required.
> > > > +  Obviously, this cannot address certain problems like the
> > > > kernel
> > > > crashing or
> > > > +  deadlocking, but it should be sufficient to prevent the scrub
> > > > process from
> > > > +  escaping and reconfiguring the system.
> > > > +  The cron job does not have this protection.
> > > > +
> > > 
> > > I think the fuzz part is one I would consider letting go.  All
> > > features
> > > need to go through a period of stabilizing, and we cant really
> > > control
> > > how some people respond to it, so I don't think this part adds
> > > much.  I
> > > think the document would do well to be trimmed where it can so as
> > > to
> > > stay more focused 
> > 
> > It took me a minute to realize that this comment applies to the text
> > below it.  Right?
> Yes, sorry for confusion :-)
> 
> > 
> > > > +- **Fuzz Kiddiez**: There are many people now who seem to think
> > > > that
> > > > running
> > > > +  automated fuzz testing of ondisk artifacts to find mischevious
> > > > behavior and
> > > > +  spraying exploit code onto the public mailing list for instant
> > > > zero-day
> > > > +  disclosure is somehow of some social benefit.
> > 
> > I want to keep this bit because it keeps happening[2].  Some folks
> > (huawei/alibaba?) have started to try to fix the bugs that their
> > robots
> > find, and kudos to them!
> > 
> > You might have noticed that Googlers turned their firehose back on
> > and
> > once again aren't doing anything to fix the problems they find.  How
> > very Googley of them.
> > 
> > [2] https://lwn.net/Articles/904293/
> 
> Alrighty then
> > 
> > > > +  In the view of this author, the benefit is realized only when
> > > > the
> > > > fuzz
> > > > +  operators help to **fix** the flaws, but this opinion
> > > > apparently
> > > > is not
> > > > +  widely shared among security "researchers".
> > > > +  The XFS maintainers' continuing ability to manage these events
> > > > presents an
> > > > +  ongoing risk to the stability of the development process.
> > > > +  Automated testing should front-load some of the risk while the
> > > > feature is
> > > > +  considered EXPERIMENTAL.
> > > > +
> > > > +Many of these risks are inherent to software programming.
> > > > +Despite this, it is hoped that this new functionality will prove
> > > > useful in
> > > > +reducing unexpected downtime.
> > > > 
> > > 
> > > Paraphrasing and reorganizing suggestions aside, I think it looks
> > > pretty good
> > 
> > Ok, thank you!
> > 
> > --D
> > 
> > > Allison
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/14] xfs: document the testing plan for online fsck
  2023-01-18  0:03     ` Allison Henderson
@ 2023-01-18  2:38       ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-01-18  2:38 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Wed, Jan 18, 2023 at 12:03:17AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Start the third chapter of the online fsck design documentation. 
> > This
> > covers the testing plan to make sure that both online and offline
> > fsck
> > can detect arbitrary problems and correct them without making things
> > worse.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  187
> > ++++++++++++++++++++
> >  1 file changed, 187 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index a03a7b9f0250..d630b6bdbe4a 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -563,3 +563,190 @@ functionality.
> >  Many of these risks are inherent to software programming.
> >  Despite this, it is hoped that this new functionality will prove
> > useful in
> >  reducing unexpected downtime.
> > +
> > +3. Testing Plan
> > +===============
> > +
> > +As stated before, fsck tools have three main goals:
> > +
> > +1. Detect inconsistencies in the metadata;
> > +
> > +2. Eliminate those inconsistencies; and
> > +
> > +3. Minimize further loss of data.
> > +
> > +Demonstrations of correct operation are necessary to build users'
> > confidence
> > +that the software behaves within expectations.
> > +Unfortunately, it was not really feasible to perform regular
> > exhaustive testing
> > +of every aspect of a fsck tool until the introduction of low-cost
> > virtual
> > +machines with high-IOPS storage.
> > +With ample hardware availability in mind, the testing strategy for
> > the online
> > +fsck project involves differential analysis against the existing
> > fsck tools and
> > +systematic testing of every attribute of every type of metadata
> > object.
> > +Testing can be split into four major categories, as discussed below.
> > +
> > +Integrated Testing with fstests
> > +-------------------------------
> > +
> > +The primary goal of any free software QA effort is to make testing
> > as
> > +inexpensive and widespread as possible to maximize the scaling
> > advantages of
> > +community.
> > +In other words, testing should maximize the breadth of filesystem
> > configuration
> > +scenarios and hardware setups.
> > +This improves code quality by enabling the authors of online fsck to
> > find and
> > +fix bugs early, and helps developers of new features to find
> > integration
> > +issues earlier in their development effort.
> > +
> > +The Linux filesystem community shares a common QA testing suite,
> > +`fstests
> > <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
> > +functional and regression testing.
> > +Even before development work began on online fsck, fstests (when run
> > on XFS)
> > +would run both the ``xfs_check`` and ``xfs_repair -n`` commands on
> > the test and
> > +scratch filesystems between each test.
> > +This provides a level of assurance that the kernel and the fsck
> > tools stay in
> > +alignment about what constitutes consistent metadata.
> > +During development of the online checking code, fstests was modified
> > to run
> > +``xfs_scrub -n`` between each test to ensure that the new checking
> > code
> > +produces the same results as the two existing fsck tools.
> > +
> > +To start development of online repair, fstests was modified to run
> > +``xfs_repair`` to rebuild the filesystem's metadata indices between
> > tests.
> > +This ensures that offline repair does not crash, leave a corrupt
> > filesystem
> > +after it exists, or trigger complaints from the online check.
> > +This also established a baseline for what can and cannot be repaired
> > offline.
> > +To complete the first phase of development of online repair, fstests
> > was
> > +modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
> > +This enables a comparison of the effectiveness of online repair as
> > compared to
> > +the existing offline repair tools.
> > +
> > +General Fuzz Testing of Metadata Blocks
> > +---------------------------------------
> > +
> > +XFS benefits greatly from having a very robust debugging tool,
> > ``xfs_db``.
> > +
> > +Before development of online fsck even began, a set of fstests were
> > created
> > +to test the rather common fault that entire metadata blocks get
> > corrupted.
> > +This required the creation of fstests library code that can create a
> > filesystem
> > +containing every possible type of metadata object.
> > +Next, individual test cases were created to create a test
> > filesystem, identify
> > +a single block of a specific type of metadata object, trash it with
> > the
> > +existing ``blocktrash`` command in ``xfs_db``, and test the reaction
> > of a
> > +particular metadata validation strategy.
> > +
> > +This earlier test suite enabled XFS developers to test the ability
> > of the
> > +in-kernel validation functions and the ability of the offline fsck
> > tool to
> > +detect and eliminate the inconsistent metadata.
> > +This part of the test suite was extended to cover online fsck in
> > exactly the
> > +same manner.
> > +
> > +In other words, for a given fstests filesystem configuration:
> > +
> > +* For each metadata object existing on the filesystem:
> > +
> > +  * Write garbage to it
> > +
> > +  * Test the reactions of:
> > +
> > +    1. The kernel verifiers to stop obviously bad metadata
> > +    2. Offline repair (``xfs_repair``) to detect and fix
> > +    3. Online repair (``xfs_scrub``) to detect and fix
> > +
> > +Targeted Fuzz Testing of Metadata Records
> > +-----------------------------------------
> > +
> > +A quick conversation with the other XFS developers revealed that the
> > existing
> > +test infrastructure could be extended to provide 
> 
> "The testing plan for ofsck includes extending the existing test 
> infrastructure to provide..."
> 
> Took me a moment to notice we're not talking about history any more....

Ah.  Sorry about that.  The sentence now reads:

"The testing plan for online fsck includes extending the existing fs
testing infrastructure to provide a much more powerful facility:
targeted fuzz testing of every metadata field of every metadata object
in the filesystem."

> > a much more powerful
> > +facility: targeted fuzz testing of every metadata field of every
> > metadata
> > +object in the filesystem.
> > +``xfs_db`` can modify every field of every metadata structure in
> > every
> > +block in the filesystem to simulate the effects of memory corruption
> > and
> > +software bugs.
> > +Given that fstests already contains the ability to create a
> > filesystem
> > +containing every metadata format known to the filesystem, ``xfs_db``
> > can be
> > +used to perform exhaustive fuzz testing!
> > +
> > +For a given fstests filesystem configuration:
> > +
> > +* For each metadata object existing on the filesystem...
> > +
> > +  * For each record inside that metadata object...
> > +
> > +    * For each field inside that record...
> > +
> > +      * For each conceivable type of transformation that can be
> > applied to a bit field...
> > +
> > +        1. Clear all bits
> > +        2. Set all bits
> > +        3. Toggle the most significant bit
> > +        4. Toggle the middle bit
> > +        5. Toggle the least significant bit
> > +        6. Add a small quantity
> > +        7. Subtract a small quantity
> > +        8. Randomize the contents
> > +
> > +        * ...test the reactions of:
> > +
> > +          1. The kernel verifiers to stop obviously bad metadata
> > +          2. Offline checking (``xfs_repair -n``)
> > +          3. Offline repair (``xfs_repair``)
> > +          4. Online checking (``xfs_scrub -n``)
> > +          5. Online repair (``xfs_scrub``)
> > +          6. Both repair tools (``xfs_scrub`` and then
> > ``xfs_repair`` if online repair doesn't succeed)
> I like the indented bullet list format tho

Thanks!  I'm pleased that ... whatever renders this stuff ... actually
supports nested lists.

> > +
> > +This is quite the combinatoric explosion!
> > +
> > +Fortunately, having this much test coverage makes it easy for XFS
> > developers to
> > +check the responses of XFS' fsck tools.
> > +Since the introduction of the fuzz testing framework, these tests
> > have been
> > +used to discover incorrect repair code and missing functionality for
> > entire
> > +classes of metadata objects in ``xfs_repair``.
> > +The enhanced testing was used to finalize the deprecation of
> > ``xfs_check`` by
> > +confirming that ``xfs_repair`` could detect at least as many
> > corruptions as
> > +the older tool.
> > +
> > +These tests have been very valuable for ``xfs_scrub`` in the same
> > ways -- they
> > +allow the online fsck developers to compare online fsck against
> > offline fsck,
> > +and they enable XFS developers to find deficiencies in the code
> > base.
> > +
> > +Proposed patchsets include
> > +`general fuzzer improvements
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g
> > it/log/?h=fuzzer-improvements>`_,
> > +`fuzzing baselines
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g
> > it/log/?h=fuzz-baseline>`_,
> > +and `improvements in fuzz testing comprehensiveness
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g
> > it/log/?h=more-fuzz-testing>`_.
> > +
> > +Stress Testing
> > +--------------
> > +
> > +A unique requirement to online fsck is the ability to operate on a
> > filesystem
> > +concurrently with regular workloads.
> > +Although it is of course impossible to run ``xfs_scrub`` with *zero*
> > observable
> > +impact on the running system, the online repair code should never
> > introduce
> > +inconsistencies into the filesystem metadata, and regular workloads
> > should
> > +never notice resource starvation.
> > +To verify that these conditions are being met, fstests has been
> > enhanced in
> > +the following ways:
> > +
> > +* For each scrub item type, create a test to exercise checking that
> > item type
> > +  while running ``fsstress``.
> > +* For each scrub item type, create a test to exercise repairing that
> > item type
> > +  while running ``fsstress``.
> > +* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the
> > whole
> > +  filesystem doesn't cause problems.
> > +* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to
> > ensure that
> > +  force-repairing the whole filesystem doesn't cause problems.
> > +* Race ``xfs_scrub`` in check and force-repair mode against
> > ``fsstress`` while
> > +  freezing and thawing the filesystem.
> > +* Race ``xfs_scrub`` in check and force-repair mode against
> > ``fsstress`` while
> > +  remounting the filesystem read-only and read-write.
> > +* The same, but running ``fsx`` instead of ``fsstress``.  (Not done
> > yet?)
> > +
> > +Success is defined by the ability to run all of these tests without
> > observing
> > +any unexpected filesystem shutdowns due to corrupted metadata,
> > kernel hang
> > +check warnings, or any other sort of mischief.
> 
> Seems reasonable.  Other than the one nit, I think this section reads
> pretty well.
> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

Woo!

--D

> Allison
> > +
> > +Proposed patchsets include `general stress testing
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g
> > it/log/?h=race-scrub-and-mount-state-changes>`_
> > +and the `evolution of existing per-function stress testing
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g
> > it/log/?h=refactor-scrub-stress>`_.
> > 
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 04/14] xfs: document the user interface for online fsck
  2023-01-18  0:03     ` Allison Henderson
@ 2023-01-18  2:42       ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-01-18  2:42 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Wed, Jan 18, 2023 at 12:03:29AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Start the fourth chapter of the online fsck design documentation,
> > which
> > discusses the user interface and the background scrubbing service.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  114
> > ++++++++++++++++++++
> >  1 file changed, 114 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index d630b6bdbe4a..42e82971e036 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -750,3 +750,117 @@ Proposed patchsets include `general stress
> > testing
> >  <
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g
> > it/log/?h=race-scrub-and-mount-state-changes>`_
> >  and the `evolution of existing per-function stress testing
> >  <
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.g
> > it/log/?h=refactor-scrub-stress>`_.
> > +
> > +4. User Interface
> > +=================
> > +
> > +The primary user of online fsck is the system administrator, just
> > like offline
> > +repair.
> > +Online fsck presents two modes of operation to administrators:
> > +A foreground CLI process for online fsck on demand, and a background
> > service
> > +that performs autonomous checking and repair.
> > +
> > +Checking on Demand
> > +------------------
> > +
> > +For administrators who want the absolute freshest information about
> > the
> > +metadata in a filesystem, ``xfs_scrub`` can be run as a foreground
> > process on
> > +a command line.
> > +The program checks every piece of metadata in the filesystem while
> > the
> > +administrator waits for the results to be reported, just like the
> > existing
> > +``xfs_repair`` tool.
> > +Both tools share a ``-n`` option to perform a read-only scan, and a
> > ``-v``
> > +option to increase the verbosity of the information reported.
> > +
> > +A new feature of ``xfs_scrub`` is the ``-x`` option, which employs
> > the error
> > +correction capabilities of the hardware to check data file contents.
> > +The media scan is not enabled by default because it may dramatically
> > increase
> > +program runtime and consume a lot of bandwidth on older storage
> > hardware.
> > +
> > +The output of a foreground invocation is captured in the system log.
> > +
> > +The ``xfs_scrub_all`` program walks the list of mounted filesystems
> > and
> > +initiates ``xfs_scrub`` for each of them in parallel.
> > +It serializes scans for any filesystems that resolve to the same top
> > level
> > +kernel block device to prevent resource overconsumption.
> > +
> > +Background Service
> > +------------------
> > +
> I'm assuming the below systemd services are configurable right?

Yes, through the standard systemd overriddes.

> > +To reduce the workload of system administrators, the ``xfs_scrub``
> > package
> > +provides a suite of `systemd <https://systemd.io/>`_ timers and
> > services that
> > +run online fsck automatically on weekends.
> by default.

Fixed.

> > +The background service configures scrub to run with as little
> > privilege as
> > +possible, the lowest CPU and IO priority, and in a CPU-constrained
> > single
> > +threaded mode.
> "This can be tuned at anytime to best suit the needs of the customer
> workload."

Fixed.

> Then I think you can drop the below line...
> > +It is hoped that this minimizes the amount of load generated on the
> > system and
> > +avoids starving regular workloads.

Done.

> > +The output of the background service is also captured in the system
> > log.
> > +If desired, reports of failures (either due to inconsistencies or
> > mere runtime
> > +errors) can be emailed automatically by setting the ``EMAIL_ADDR``
> > environment
> > +variable in the following service files:
> > +
> > +* ``xfs_scrub_fail@.service``
> > +* ``xfs_scrub_media_fail@.service``
> > +* ``xfs_scrub_all_fail.service``
> > +
> > +The decision to enable the background scan is left to the system
> > administrator.
> > +This can be done by enabling either of the following services:
> > +
> > +* ``xfs_scrub_all.timer`` on systemd systems
> > +* ``xfs_scrub_all.cron`` on non-systemd systems
> > +
> > +This automatic weekly scan is configured out of the box to perform
> > an
> > +additional media scan of all file data once per month.
> > +This is less foolproof than, say, storing file data block checksums,
> > but much
> > +more performant if application software provides its own integrity
> > checking,
> > +redundancy can be provided elsewhere above the filesystem, or the
> > storage
> > +device's integrity guarantees are deemed sufficient.
> > +
> > +The systemd unit file definitions have been subjected to a security
> > audit
> > +(as of systemd 249) to ensure that the xfs_scrub processes have as
> > little
> > +access to the rest of the system as possible.
> > +This was performed via ``systemd-analyze security``, after which
> > privileges
> > +were restricted to the minimum required, sandboxing was set up to
> > the maximal
> > +extent possible with sandboxing and system call filtering; and
> > access to the
> > +filesystem tree was restricted to the minimum needed to start the
> > program and
> > +access the filesystem being scanned.
> > +The service definition files restrict CPU usage to 80% of one CPU
> > core, and
> > +apply as nice of a priority to IO and CPU scheduling as possible.
> > +This measure was taken to minimize delays in the rest of the
> > filesystem.
> > +No such hardening has been performed for the cron job.
> > +
> > +Proposed patchset:
> > +`Enabling the xfs_scrub background service
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > it/log/?h=scrub-media-scan-service>`_.
> > +
> > +Health Reporting
> > +----------------
> > +
> > +XFS caches a summary of each filesystem's health status in memory.
> > +The information is updated whenever ``xfs_scrub`` is run, or
> > whenever
> > +inconsistencies are detected in the filesystem metadata during
> > regular
> > +operations.
> > +System administrators should use the ``health`` command of
> > ``xfs_spaceman`` to
> > +download this information into a human-readable format.
> > +If problems have been observed, the administrator can schedule a
> > reduced
> > +service window to run the online repair tool to correct the problem.
> > +Failing that, the administrator can decide to schedule a maintenance
> > window to
> > +run the traditional offline repair tool to correct the problem.
> > +
> > +**Question**: Should the health reporting integrate with the new
> > inotify fs
> > +error notification system?
> > +
> > +**Question**: Would it be helpful for sysadmins to have a daemon to
> > listen for
> > +corruption notifications and initiate a repair?
> > +
> > +*Answer*: These questions remain unanswered, but should be a part of
> > the
> > +conversation with early adopters and potential downstream users of
> > XFS.
> I think if there's been no commentary at this point then likely they
> can't be answered at this time.  Perhaps for now it is reasonable to
> just let the be a potential improvement in the future if the demand for
> it arises. In any case, I think we should probably clean out the Q&A
> discussion prompts.

I'll change them to "future work Q's" so I don't forget to pursue them
after part 1 is merged.

> Rest looks good tho

:-D  Thanks!

--D

> Allison
> 
> > +
> > +Proposed patchsets include
> > +`wiring up health reports to correction returns
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=corruption-health-reports>`_
> > +and
> > +`preservation of sickness info during memory reclaim
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=indirect-health-reporting>`_.
> > 
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 05/14] xfs: document the filesystem metadata checking strategy
  2022-12-30 22:10   ` [PATCH 05/14] xfs: document the filesystem metadata checking strategy Darrick J. Wong
@ 2023-01-21  1:38     ` Allison Henderson
  2023-02-02 19:04       ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-01-21  1:38 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Begin the fifth chapter of the online fsck design documentation,
> where
> we discuss the details of the data structures and algorithms used by
> the
> kernel to examine filesystem metadata and cross-reference it around
> the
> filesystem.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  579
> ++++++++++++++++++++
>  .../filesystems/xfs-self-describing-metadata.rst   |    1 
>  2 files changed, 580 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index 42e82971e036..f45bf97fa9c4 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -864,3 +864,582 @@ Proposed patchsets include
>  and
>  `preservation of sickness info during memory reclaim
>  <
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=indirect-health-reporting>`_.
> +
> +5. Kernel Algorithms and Data Structures
> +========================================
> +
> +This section discusses the key algorithms and data structures of the
> kernel
> +code that provide the ability to check and repair metadata while the
> system
> +is running.
> +The first chapters in this section reveal the pieces that provide
> the
> +foundation for checking metadata.
> +The remainder of this section presents the mechanisms through which
> XFS
> +regenerates itself.
> +
> +Self Describing Metadata
> +------------------------
> +
> +Starting with XFS version 5 in 2012, XFS updated the format of
> nearly every
> +ondisk block header to record a magic number, a checksum, a
> universally
> +"unique" identifier (UUID), an owner code, the ondisk address of the
> block,
> +and a log sequence number.
> +When loading a block buffer from disk, the magic number, UUID,
> owner, and
> +ondisk address confirm that the retrieved block matches the specific
> owner of
> +the current filesystem, and that the information contained in the
> block is
> +supposed to be found at the ondisk address.
> +The first three components enable checking tools to disregard
> alleged metadata
> +that doesn't belong to the filesystem, and the fourth component
> enables the
> +filesystem to detect lost writes.
Add...

"When ever a file system operation modifies a block, the change is
submitted to the journal as a transaction.  The journal then processes
these transactions marking them done once they are safely committed to
the disk"

At this point we havnt talked much at all about transactions or logs,
and we've just barely begin to cover blocks.  I think you at least want
a quick blip to describe the relation of these two things, or it may
not be clear why we suddenly jumped into logs.

> +
> +The logging code maintains the checksum and the log sequence number
> of the last
> +transactional update.
> +Checksums are useful for detecting torn writes and other mischief
"Checksums (or crc's) are useful for detecting incomplete or torn
writes as well as other discrepancies..."


> between the
> +computer and its storage devices.
> +Sequence number tracking enables log recovery to avoid applying out
> of date
> +log updates to the filesystem.
> +
> +These two features improve overall runtime resiliency by providing a
> means for
> +the filesystem to detect obvious corruption when reading metadata
> blocks from
> +disk, but these buffer verifiers cannot provide any consistency
> checking
> +between metadata structures.
> +
> +For more information, please see the documentation for
> +Documentation/filesystems/xfs-self-describing-metadata.rst
> +
> +Reverse Mapping
> +---------------
> +
> +The original design of XFS (circa 1993) is an improvement upon 1980s
> Unix
> +filesystem design.
> +In those days, storage density was expensive, CPU time was scarce,
> and
> +excessive seek time could kill performance.
> +For performance reasons, filesystem authors were reluctant to add
> redundancy to
> +the filesystem, even at the cost of data integrity.
> +Filesystems designers in the early 21st century choose different
> strategies to
> +increase internal redundancy -- either storing nearly identical
> copies of
> +metadata, or more space-efficient techniques such as erasure coding.
"such as erasure coding which may encode sections of the data with
redundant symbols and in more than one location"

That ties it into the next line.  If you go on to talk about a term you
have not previously defined, i think you want to either define it
quickly or just drop it all together.  Right now your goal is to just
give the reader context, so you want it to move quickly.

> +Obvious corruptions are typically repaired by copying replicas or
> +reconstructing from codes.
> +
I think I would have just jumped straight from xfs history to modern
xfs...
> +For XFS, a different redundancy strategy was chosen to modernize the
> design:
> +a secondary space usage index that maps allocated disk extents back
> to their
> +owners.
> +By adding a new index, the filesystem retains most of its ability to
> scale
> +well to heavily threaded workloads involving large datasets, since
> the primary
> +file metadata (the directory tree, the file block map, and the
> allocation
> +groups) remain unchanged.
> 

> +Although the reverse-mapping feature increases overhead costs for
> space
> +mapping activities just like any other system that improves
> redundancy, it
"Like any system that improves redundancy, the reverse-mapping feature
increases overhead costs for space mapping activities. However, it..."

> +has two critical advantages: first, the reverse index is key to
> enabling online
> +fsck and other requested functionality such as filesystem
> reorganization,
> +better media failure reporting, and shrinking.
> +Second, the different ondisk storage format of the reverse mapping
> btree
> +defeats device-level deduplication, because the filesystem requires
> real
> +redundancy.
> +
> +A criticism of adding the secondary index is that it does nothing to
> improve
> +the robustness of user data storage itself.
> +This is a valid point, but adding a new index for file data block
> checksums
> +increases write amplification and turns data overwrites into copy-
> writes, which
> +age the filesystem prematurely.
> +In keeping with thirty years of precedent, users who want file data
> integrity
> +can supply as powerful a solution as they require.
> +As for metadata, the complexity of adding a new secondary index of
> space usage
> +is much less than adding volume management and storage device
> mirroring to XFS
> +itself.
> +Perfection of RAID and volume management are best left to existing
> layers in
> +the kernel.
I think I would cull the entire above paragraph.  rmap, crc and raid
all have very different points of redundancy, so criticism that an
apple is not an orange or visavis just feels like a shortsighted
comparison that's probably more of a distraction than anything.

Sometimes it feels like this document kinda gets off into tangents
like it's preemptively trying to position it's self for an argument
that hasn't happened yet.  But I think it has the effect of pulling the
readers attention off topic into an argument they never thought to
consider in the first place.  The topic of this section is to explain
what rmap is.  So lets stay on topic and finish laying out that ground
work first before getting into how it compares to other solutions

> +
> +The information captured in a reverse space mapping record is as
> follows:
> +
> +.. code-block:: c
> +
> +       struct xfs_rmap_irec {
> +           xfs_agblock_t    rm_startblock;   /* extent start block
> */
> +           xfs_extlen_t     rm_blockcount;   /* extent length */
> +           uint64_t         rm_owner;        /* extent owner */
> +           uint64_t         rm_offset;       /* offset within the
> owner */
> +           unsigned int     rm_flags;        /* state flags */
> +       };
> +
> +The first two fields capture the location and size of the physical
> space,
> +in units of filesystem blocks.
> +The owner field tells scrub which metadata structure or file inode
> have been
> +assigned this space.
> +For space allocated to files, the offset field tells scrub where the
> space was
> +mapped within the file fork.
> +Finally, the flags field provides extra information about the space
> usage --
> +is this an attribute fork extent?  A file mapping btree extent?  Or
> an
> +unwritten data extent?
> +
> +Online filesystem checking judges the consistency of each primary
> metadata
> +record by comparing its information against all other space indices.
> +The reverse mapping index plays a key role in the consistency
> checking process
> +because it contains a centralized alternate copy of all space
> allocation
> +information.
> +Program runtime and ease of resource acquisition are the only real
> limits to
> +what online checking can consult.
> +For example, a file data extent mapping can be checked against:
> +
> +* The absence of an entry in the free space information.
> +* The absence of an entry in the inode index.
> +* The absence of an entry in the reference count data if the file is
> not
> +  marked as having shared extents.
> +* The correspondence of an entry in the reverse mapping information.
> +
> +A key observation here is that only the reverse mapping can provide
> a positive
> +affirmation of correctness if the primary metadata is in doubt.
if any of the above metadata is in doubt...

> +The checking code for most primary metadata follows a path similar
> to the
> +one outlined above.
> +
> +A second observation to make about this secondary index is that
> proving its
> +consistency with the primary metadata is difficult.

> +Demonstrating that a given reverse mapping record exactly
> corresponds to the
> +primary space metadata involves a full scan of all primary space
> metadata,
> +which is very time intensive.
"But why?" Wonders the reader. Just jump into an example:

"In order to verify that an rmap extent does not incorrectly over lap
with another record, we would need a full scan of all the other
records, which is time intensive."

?

And then the below is a separate observation right?  

> +Scanning activity for online fsck can only use non-blocking lock
> acquisition
> +primitives if the locking order is not the regular order as used by
> the rest of
> +the filesystem.
Lastly, it should be noted that most file system operations tend to
lock primary metadata before locking the secondary metadata.  This
means that scanning operations that acquire the secondary metadata
first may need to yield the secondary lock to filesystem operations
that have already acquired the primary lock. 

?

> +This means that forward progress during this part of a scan of the
> reverse
> +mapping data cannot be guaranteed if system load is especially
> heavy.
> +Therefore, it is not practical for online check to detect reverse
> mapping
> +records that lack a counterpart in the primary metadata.
Such as <quick list / quick example>

> +Instead, scrub relies on rigorous cross-referencing during the
> primary space
> +mapping structure checks.
> +

The below paragraph sounds like a re-cap?

"So to recap, reverse mappings also...."
> +Reverse mappings also play a key role in reconstruction of primary
> metadata.
> +The secondary information is general enough for online repair to
> synthesize a
> +complete copy of any primary space management metadata by locking
> that
> +resource, querying all reverse mapping indices looking for records
> matching
> +the relevant resource, and transforming the mapping into an
> appropriate format.
> +The details of how these records are staged, written to disk, and
> committed
> +into the filesystem are covered in subsequent sections.
I also think the section would be ok if you were to trim off this last
paragraph too.

> +
> +Checking and Cross-Referencing
> +------------------------------
> +
> +The first step of checking a metadata structure is to examine every
> record
> +contained within the structure and its relationship with the rest of
> the
> +system.
> +XFS contains multiple layers of checking to try to prevent
> inconsistent
> +metadata from wreaking havoc on the system.
> +Each of these layers contributes information that helps the kernel
> to make
> +three decisions about the health of a metadata structure:
> +
> +- Is a part of this structure obviously corrupt
> (``XFS_SCRUB_OFLAG_CORRUPT``) ?
> +- Is this structure inconsistent with the rest of the system
> +  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
> +- Is there so much damage around the filesystem that cross-
> referencing is not
> +  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
> +- Can the structure be optimized to improve performance or reduce
> the size of
> +  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
> +- Does the structure contain data that is not inconsistent but
> deserves review
> +  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
> +
> +The following sections describe how the metadata scrubbing process
> works.
> +
> +Metadata Buffer Verification
> +````````````````````````````
> +
> +The lowest layer of metadata protection in XFS are the metadata
> verifiers built
> +into the buffer cache.
> +These functions perform inexpensive internal consistency checking of
> the block
> +itself, and answer these questions:
> +
> +- Does the block belong to this filesystem?
> +
> +- Does the block belong to the structure that asked for the read?
> +  This assumes that metadata blocks only have one owner, which is
> always true
> +  in XFS.
> +
> +- Is the type of data stored in the block within a reasonable range
> of what
> +  scrub is expecting?
> +
> +- Does the physical location of the block match the location it was
> read from?
> +
> +- Does the block checksum match the data?
> +
> +The scope of the protections here are very limited -- verifiers can
> only
> +establish that the filesystem code is reasonably free of gross
> corruption bugs
> +and that the storage system is reasonably competent at retrieval.
> +Corruption problems observed at runtime cause the generation of
> health reports,
> +failed system calls, and in the extreme case, filesystem shutdowns
> if the
> +corrupt metadata force the cancellation of a dirty transaction.
> +
> +Every online fsck scrubbing function is expected to read every
> ondisk metadata
> +block of a structure in the course of checking the structure.
> +Corruption problems observed during a check are immediately reported
> to
> +userspace as corruption; during a cross-reference, they are reported
> as a
> +failure to cross-reference once the full examination is complete.
> +Reads satisfied by a buffer already in cache (and hence already
> verified)
> +bypass these checks.
> +
> +Internal Consistency Checks
> +```````````````````````````
> +
> +The next higher level of metadata protection is the internal record
"After the buffer cache, the next level of metadata protection is..."
> +verification code built into the filesystem.

> +These checks are split between the buffer verifiers, the in-
> filesystem users of
> +the buffer cache, and the scrub code itself, depending on the amount
> of higher
> +level context required.
> +The scope of checking is still internal to the block.
> +For performance reasons, regular code may skip some of these checks
> unless
> +debugging is enabled or a write is about to occur.
> +Scrub functions, of course, must check all possible problems.
I'd put this chunk after the list below.

> +Either way, these higher level checking functions answer these
> questions:
Then this becomes:
"These higher level checking functions..."

> +
> +- Does the type of data stored in the block match what scrub is
> expecting?
> +
> +- Does the block belong to the owning structure that asked for the
> read?
> +
> +- If the block contains records, do the records fit within the
> block?
> +
> +- If the block tracks internal free space information, is it
> consistent with
> +  the record areas?
> +
> +- Are the records contained inside the block free of obvious
> corruptions?
> +
> +Record checks in this category are more rigorous and more time-
> intensive.
> +For example, block pointers and inumbers are checked to ensure that
> they point
> +within the dynamically allocated parts of an allocation group and
> within
> +the filesystem.
> +Names are checked for invalid characters, and flags are checked for
> invalid
> +combinations.
> +Other record attributes are checked for sensible values.
> +Btree records spanning an interval of the btree keyspace are checked
> for
> +correct order and lack of mergeability (except for file fork
> mappings).
> +
> +Validation of Userspace-Controlled Record Attributes
> +````````````````````````````````````````````````````
> +
> +Various pieces of filesystem metadata are directly controlled by
> userspace.
> +Because of this nature, validation work cannot be more precise than
> checking
> +that a value is within the possible range.
> +These fields include:
> +
> +- Superblock fields controlled by mount options
> +- Filesystem labels
> +- File timestamps
> +- File permissions
> +- File size
> +- File flags
> +- Names present in directory entries, extended attribute keys, and
> filesystem
> +  labels
> +- Extended attribute key namespaces
> +- Extended attribute values
> +- File data block contents
> +- Quota limits
> +- Quota timer expiration (if resource usage exceeds the soft limit)
> +
> +Cross-Referencing Space Metadata
> +````````````````````````````````
> +
> +The next higher level of checking is cross-referencing records
> between metadata

I kinda like the list first so that the reader has an idea of what
these checks are before getting into discussion about them.  It just
makes it a little more obvious as to why it's "prohibitively expensive"
or "dependent on the context of the structure" after having just looked
at it

The rest looks good from here.

Allison

> +structures.
> +For regular runtime code, the cost of these checks is considered to
> be
> +prohibitively expensive, but as scrub is dedicated to rooting out
> +inconsistencies, it must pursue all avenues of inquiry.
> +The exact set of cross-referencing is highly dependent on the
> context of the
> +data structure being checked.
> +
> +The XFS btree code has keyspace scanning functions that online fsck
> uses to
> +cross reference one structure with another.
> +Specifically, scrub can scan the key space of an index to determine
> if that
> +keyspace is fully, sparsely, or not at all mapped to records.
> +For the reverse mapping btree, it is possible to mask parts of the
> key for the
> +purposes of performing a keyspace scan so that scrub can decide if
> the rmap
> +btree contains records mapping a certain extent of physical space
> without the
> +sparsenses of the rest of the rmap keyspace getting in the way.
> +
> +Btree blocks undergo the following checks before cross-referencing:
> +
> +- Does the type of data stored in the block match what scrub is
> expecting?
> +
> +- Does the block belong to the owning structure that asked for the
> read?
> +
> +- Do the records fit within the block?
> +
> +- Are the records contained inside the block free of obvious
> corruptions?
> +
> +- Are the name hashes in the correct order?
> +
> +- Do node pointers within the btree point to valid block addresses
> for the type
> +  of btree?
> +
> +- Do child pointers point towards the leaves?
> +
> +- Do sibling pointers point across the same level?
> +
> +- For each node block record, does the record key accurate reflect
> the contents
> +  of the child block?
> +
> +Space allocation records are cross-referenced as follows:
> +
> +1. Any space mentioned by any metadata structure are cross-
> referenced as
> +   follows:
> +
> +   - Does the reverse mapping index list only the appropriate owner
> as the
> +     owner of each block?
> +
> +   - Are none of the blocks claimed as free space?
> +
> +   - If these aren't file data blocks, are none of the blocks
> claimed as space
> +     shared by different owners?
> +
> +2. Btree blocks are cross-referenced as follows:
> +
> +   - Everything in class 1 above.
> +
> +   - If there's a parent node block, do the keys listed for this
> block match the
> +     keyspace of this block?
> +
> +   - Do the sibling pointers point to valid blocks?  Of the same
> level?
> +
> +   - Do the child pointers point to valid blocks?  Of the next level
> down?
> +
> +3. Free space btree records are cross-referenced as follows:
> +
> +   - Everything in class 1 and 2 above.
> +
> +   - Does the reverse mapping index list no owners of this space?
> +
> +   - Is this space not claimed by the inode index for inodes?
> +
> +   - Is it not mentioned by the reference count index?
> +
> +   - Is there a matching record in the other free space btree?
> +
> +4. Inode btree records are cross-referenced as follows:
> +
> +   - Everything in class 1 and 2 above.
> +
> +   - Is there a matching record in free inode btree?
> +
> +   - Do cleared bits in the holemask correspond with inode clusters?
> +
> +   - Do set bits in the freemask correspond with inode records with
> zero link
> +     count?
> +
> +5. Inode records are cross-referenced as follows:
> +
> +   - Everything in class 1.
> +
> +   - Do all the fields that summarize information about the file
> forks actually
> +     match those forks?
> +
> +   - Does each inode with zero link count correspond to a record in
> the free
> +     inode btree?
> +
> +6. File fork space mapping records are cross-referenced as follows:
> +
> +   - Everything in class 1 and 2 above.
> +
> +   - Is this space not mentioned by the inode btrees?
> +
> +   - If this is a CoW fork mapping, does it correspond to a CoW
> entry in the
> +     reference count btree?
> +
> +7. Reference count records are cross-referenced as follows:
> +
> +   - Everything in class 1 and 2 above.
> +
> +   - Within the space subkeyspace of the rmap btree (that is to say,
> all
> +     records mapped to a particular space extent and ignoring the
> owner info),
> +     are there the same number of reverse mapping records for each
> block as the
> +     reference count record claims?
> +
> +Proposed patchsets are the series to find gaps in
> +`refcount btree
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-detect-refcount-gaps>`_,
> +`inode btree
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-detect-inobt-gaps>`_, and
> +`rmap btree
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-detect-rmapbt-gaps>`_ records;
> +to find
> +`mergeable records
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-detect-mergeable-records>`_;
> +and to
> +`improve cross referencing with rmap
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-strengthen-rmap-checking>`_
> +before starting a repair.
> +
> +Checking Extended Attributes
> +````````````````````````````
> +
> +Extended attributes implement a key-value store that enable
> fragments of data
> +to be attached to any file.
> +Both the kernel and userspace can access the keys and values,
> subject to
> +namespace and privilege restrictions.
> +Most typically these fragments are metadata about the file --
> origins, security
> +contexts, user-supplied labels, indexing information, etc.
> +
> +Names can be as long as 255 bytes and can exist in several different
> +namespaces.
> +Values can be as large as 64KB.
> +A file's extended attributes are stored in blocks mapped by the attr
> fork.
> +The mappings point to leaf blocks, remote value blocks, or dabtree
> blocks.
> +Block 0 in the attribute fork is always the top of the structure,
> but otherwise
> +each of the three types of blocks can be found at any offset in the
> attr fork.
> +Leaf blocks contain attribute key records that point to the name and
> the value.
> +Names are always stored elsewhere in the same leaf block.
> +Values that are less than 3/4 the size of a filesystem block are
> also stored
> +elsewhere in the same leaf block.
> +Remote value blocks contain values that are too large to fit inside
> a leaf.
> +If the leaf information exceeds a single filesystem block, a dabtree
> (also
> +rooted at block 0) is created to map hashes of the attribute names
> to leaf
> +blocks in the attr fork.
> +
> +Checking an extended attribute structure is not so straightfoward
> due to the
> +lack of separation between attr blocks and index blocks.
> +Scrub must read each block mapped by the attr fork and ignore the
> non-leaf
> +blocks:
> +
> +1. Walk the dabtree in the attr fork (if present) to ensure that
> there are no
> +   irregularities in the blocks or dabtree mappings that do not
> point to
> +   attr leaf blocks.
> +
> +2. Walk the blocks of the attr fork looking for leaf blocks.
> +   For each entry inside a leaf:
> +
> +   a. Validate that the name does not contain invalid characters.
> +
> +   b. Read the attr value.
> +      This performs a named lookup of the attr name to ensure the
> correctness
> +      of the dabtree.
> +      If the value is stored in a remote block, this also validates
> the
> +      integrity of the remote value block.
> +
> +Checking and Cross-Referencing Directories
> +``````````````````````````````````````````
> +
> +The filesystem directory tree is a directed acylic graph structure,
> with files
> +constituting the nodes, and directory entries (dirents) constituting
> the edges.
> +Directories are a special type of file containing a set of mappings
> from a
> +255-byte sequence (name) to an inumber.
> +These are called directory entries, or dirents for short.
> +Each directory file must have exactly one directory pointing to the
> file.
> +A root directory points to itself.
> +Directory entries point to files of any type.
> +Each non-directory file may have multiple directories point to it.
> +
> +In XFS, directories are implemented as a file containing up to three
> 32GB
> +partitions.
> +The first partition contains directory entry data blocks.
> +Each data block contains variable-sized records associating a user-
> provided
> +name with an inumber and, optionally, a file type.
> +If the directory entry data grows beyond one block, the second
> partition (which
> +exists as post-EOF extents) is populated with a block containing
> free space
> +information and an index that maps hashes of the dirent names to
> directory data
> +blocks in the first partition.
> +This makes directory name lookups very fast.
> +If this second partition grows beyond one block, the third partition
> is
> +populated with a linear array of free space information for faster
> +expansions.
> +If the free space has been separated and the second partition grows
> again
> +beyond one block, then a dabtree is used to map hashes of dirent
> names to
> +directory data blocks.
> +
> +Checking a directory is pretty straightfoward:
> +
> +1. Walk the dabtree in the second partition (if present) to ensure
> that there
> +   are no irregularities in the blocks or dabtree mappings that do
> not point to
> +   dirent blocks.
> +
> +2. Walk the blocks of the first partition looking for directory
> entries.
> +   Each dirent is checked as follows:
> +
> +   a. Does the name contain no invalid characters?
> +
> +   b. Does the inumber correspond to an actual, allocated inode?
> +
> +   c. Does the child inode have a nonzero link count?
> +
> +   d. If a file type is included in the dirent, does it match the
> type of the
> +      inode?
> +
> +   e. If the child is a subdirectory, does the child's dotdot
> pointer point
> +      back to the parent?
> +
> +   f. If the directory has a second partition, perform a named
> lookup of the
> +      dirent name to ensure the correctness of the dabtree.
> +
> +3. Walk the free space list in the third partition (if present) to
> ensure that
> +   the free spaces it describes are really unused.
> +
> +Checking operations involving :ref:`parents <dirparent>` and
> +:ref:`file link counts <nlinks>` are discussed in more detail in
> later
> +sections.
> +
> +Checking Directory/Attribute Btrees
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +As stated in previous sections, the directory/attribute btree
> (dabtree) index
> +maps user-provided names to improve lookup times by avoiding linear
> scans.
> +Internally, it maps a 32-bit hash of the name to a block offset
> within the
> +appropriate file fork.
> +
> +The internal structure of a dabtree closely resembles the btrees
> that record
> +fixed-size metadata records -- each dabtree block contains a magic
> number, a
> +checksum, sibling pointers, a UUID, a tree level, and a log sequence
> number.
> +The format of leaf and node records are the same -- each entry
> points to the
> +next level down in the hierarchy, with dabtree node records pointing
> to dabtree
> +leaf blocks, and dabtree leaf records pointing to non-dabtree blocks
> elsewhere
> +in the fork.
> +
> +Checking and cross-referencing the dabtree is very similar to what
> is done for
> +space btrees:
> +
> +- Does the type of data stored in the block match what scrub is
> expecting?
> +
> +- Does the block belong to the owning structure that asked for the
> read?
> +
> +- Do the records fit within the block?
> +
> +- Are the records contained inside the block free of obvious
> corruptions?
> +
> +- Are the name hashes in the correct order?
> +
> +- Do node pointers within the dabtree point to valid fork offsets
> for dabtree
> +  blocks?
> +
> +- Do leaf pointers within the dabtree point to valid fork offsets
> for directory
> +  or attr leaf blocks?
> +
> +- Do child pointers point towards the leaves?
> +
> +- Do sibling pointers point across the same level?
> +
> +- For each dabtree node record, does the record key accurate reflect
> the
> +  contents of the child dabtree block?
> +
> +- For each dabtree leaf record, does the record key accurate reflect
> the
> +  contents of the directory or attr block?
> +
> +Cross-Referencing Summary Counters
> +``````````````````````````````````
> +
> +XFS maintains three classes of summary counters: available
> resources, quota
> +resource usage, and file link counts.
> +
> +In theory, the amount of available resources (data blocks, inodes,
> realtime
> +extents) can be found by walking the entire filesystem.
> +This would make for very slow reporting, so a transactional
> filesystem can
> +maintain summaries of this information in the superblock.
> +Cross-referencing these values against the filesystem metadata
> should be a
> +simple matter of walking the free space and inode metadata in each
> AG and the
> +realtime bitmap, but there are complications that will be discussed
> in
> +:ref:`more detail <fscounters>` later.
> +
> +:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
> +checking are sufficiently complicated to warrant separate sections.
> +
> +Post-Repair Reverification
> +``````````````````````````
> +
> +After performing a repair, the checking code is run a second time to
> validate
> +the new structure, and the results of the health assessment are
> recorded
> +internally and returned to the calling process.
> +This step is critical for enabling system administrator to monitor
> the status
> +of the filesystem and the progress of any repairs.
> +For developers, it is a useful means to judge the efficacy of error
> detection
> +and correction in the online and offline checking tools.
> diff --git a/Documentation/filesystems/xfs-self-describing-
> metadata.rst b/Documentation/filesystems/xfs-self-describing-
> metadata.rst
> index b79dbf36dc94..a10c4ae6955e 100644
> --- a/Documentation/filesystems/xfs-self-describing-metadata.rst
> +++ b/Documentation/filesystems/xfs-self-describing-metadata.rst
> @@ -1,4 +1,5 @@
>  .. SPDX-License-Identifier: GPL-2.0
> +.. _xfs_self_describing_metadata:
>  
>  ============================
>  XFS Self Describing Metadata
> 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 06/14] xfs: document how online fsck deals with eventual consistency
  2022-12-30 22:10   ` [PATCH 06/14] xfs: document how online fsck deals with eventual consistency Darrick J. Wong
  2023-01-05  9:08     ` Amir Goldstein
@ 2023-01-31  6:11     ` Allison Henderson
  2023-02-02 19:55       ` Darrick J. Wong
  1 sibling, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-01-31  6:11 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Writes to an XFS filesystem employ an eventual consistency update
> model
> to break up complex multistep metadata updates into small chained
> transactions.  This is generally good for performance and scalability
> because XFS doesn't need to prepare for enormous transactions, but it
> also means that online fsck must be careful not to attempt a fsck
> action
> unless it can be shown that there are no other threads processing a
> transaction chain.  This part of the design documentation covers the
> thinking behind the consistency model and how scrub deals with it.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  303
> ++++++++++++++++++++
>  1 file changed, 303 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index f45bf97fa9c4..419eb54ee200 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -1443,3 +1443,306 @@ This step is critical for enabling system
> administrator to monitor the status
>  of the filesystem and the progress of any repairs.
>  For developers, it is a useful means to judge the efficacy of error
> detection
>  and correction in the online and offline checking tools.
> +
> +Eventual Consistency vs. Online Fsck
> +------------------------------------
> +
> +Midway through the development of online scrubbing, the fsstress
> tests
> +uncovered a misinteraction between online fsck and compound
> transaction chains
> +created by other writer threads that resulted in false reports of
> metadata
> +inconsistency.
> +The root cause of these reports is the eventual consistency model
> introduced by
> +the expansion of deferred work items and compound transaction chains
> when
> +reverse mapping and reflink were introduced.



> +
> +Originally, transaction chains were added to XFS to avoid deadlocks
> when
> +unmapping space from files.
> +Deadlock avoidance rules require that AGs only be locked in
> increasing order,
> +which makes it impossible (say) to use a single transaction to free
> a space
> +extent in AG 7 and then try to free a now superfluous block mapping
> btree block
> +in AG 3.
> +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent
> (EFI) log
> +items to commit to freeing some space in one transaction while
> deferring the
> +actual metadata updates to a fresh transaction.
> +The transaction sequence looks like this:
> +
> +1. The first transaction contains a physical update to the file's
> block mapping
> +   structures to remove the mapping from the btree blocks.
> +   It then attaches to the in-memory transaction an action item to
> schedule
> +   deferred freeing of space.
> +   Concretely, each transaction maintains a list of ``struct
> +   xfs_defer_pending`` objects, each of which maintains a list of
> ``struct
> +   xfs_extent_free_item`` objects.
> +   Returning to the example above, the action item tracks the
> freeing of both
> +   the unmapped space from AG 7 and the block mapping btree (BMBT)
> block from
> +   AG 3.
> +   Deferred frees recorded in this manner are committed in the log
> by creating
> +   an EFI log item from the ``struct xfs_extent_free_item`` object
> and
> +   attaching the log item to the transaction.
> +   When the log is persisted to disk, the EFI item is written into
> the ondisk
> +   transaction record.
> +   EFIs can list up to 16 extents to free, all sorted in AG order.
> +
> +2. The second transaction contains a physical update to the free
> space btrees
> +   of AG 3 to release the former BMBT block and a second physical
> update to the
> +   free space btrees of AG 7 to release the unmapped file space.
> +   Observe that the the physical updates are resequenced in the
> correct order
> +   when possible.
> +   Attached to the transaction is a an extent free done (EFD) log
> item.
> +   The EFD contains a pointer to the EFI logged in transaction #1 so
> that log
> +   recovery can tell if the EFI needs to be replayed.
> +
> +If the system goes down after transaction #1 is written back to the
> filesystem
> +but before #2 is committed, a scan of the filesystem metadata would
> show
> +inconsistent filesystem metadata because there would not appear to
> be any owner
> +of the unmapped space.
> +Happily, log recovery corrects this inconsistency for us -- when
> recovery finds
> +an intent log item but does not find a corresponding intent done
> item, it will
> +reconstruct the incore state of the intent item and finish it.
> +In the example above, the log must replay both frees described in
> the recovered
> +EFI to complete the recovery phase.
> +
> +There are two subtleties to XFS' transaction chaining strategy to
> consider.
> +The first is that log items must be added to a transaction in the
> correct order
> +to prevent conflicts with principal objects that are not held by the
> +transaction.
> +In other words, all per-AG metadata updates for an unmapped block
> must be
> +completed before the last update to free the extent, and extents
> should not
> +be reallocated until that last update commits to the log.
> +The second subtlety comes from the fact that AG header buffers are
> (usually)
> +released between each transaction in a chain.
> +This means that other threads can observe an AG in an intermediate
> state,
> +but as long as the first subtlety is handled, this should not affect
> the
> +correctness of filesystem operations.
> +Unmounting the filesystem flushes all pending work to disk, which
> means that
> +offline fsck never sees the temporary inconsistencies caused by
> deferred work
> +item processing.
> +In this manner, XFS employs a form of eventual consistency to avoid
> deadlocks
> +and increase parallelism.
> +
> +During the design phase of the reverse mapping and reflink features,
> it was
> +decided that it was impractical to cram all the reverse mapping
> updates for a
> +single filesystem change into a single transaction because a single
> file
> +mapping operation can explode into many small updates:
> +
> +* The block mapping update itself
> +* A reverse mapping update for the block mapping update
> +* Fixing the freelist
> +* A reverse mapping update for the freelist fix
> +
> +* A shape change to the block mapping btree
> +* A reverse mapping update for the btree update
> +* Fixing the freelist (again)
> +* A reverse mapping update for the freelist fix
> +
> +* An update to the reference counting information
> +* A reverse mapping update for the refcount update
> +* Fixing the freelist (a third time)
> +* A reverse mapping update for the freelist fix
> +
> +* Freeing any space that was unmapped and not owned by any other
> file
> +* Fixing the freelist (a fourth time)
> +* A reverse mapping update for the freelist fix
> +
> +* Freeing the space used by the block mapping btree
> +* Fixing the freelist (a fifth time)
> +* A reverse mapping update for the freelist fix
> +
> +Free list fixups are not usually needed more than once per AG per
> transaction
> +chain, but it is theoretically possible if space is very tight.
> +For copy-on-write updates this is even worse, because this must be
> done once to
> +remove the space from a staging area and again to map it into the
> file!
> +
> +To deal with this explosion in a calm manner, XFS expands its use of
> deferred
> +work items to cover most reverse mapping updates and all refcount
> updates.
> +This reduces the worst case size of transaction reservations by
> breaking the
> +work into a long chain of small updates, which increases the degree
> of eventual
> +consistency in the system.
> +Again, this generally isn't a problem because XFS orders its
> deferred work
> +items carefully to avoid resource reuse conflicts between
> unsuspecting threads.
> +
> +However, online fsck changes the rules -- remember that although
> physical
> +updates to per-AG structures are coordinated by locking the buffers
> for AG
> +headers, buffer locks are dropped between transactions.
> +Once scrub acquires resources and takes locks for a data structure,
> it must do
> +all the validation work without releasing the lock.
> +If the main lock for a space btree is an AG header buffer lock,
> scrub may have
> +interrupted another thread that is midway through finishing a chain.
> +For example, if a thread performing a copy-on-write has completed a
> reverse
> +mapping update but not the corresponding refcount update, the two AG
> btrees
> +will appear inconsistent to scrub and an observation of corruption
> will be
> +recorded.  This observation will not be correct.
> +If a repair is attempted in this state, the results will be
> catastrophic!
> +
> +Several solutions to this problem were evaluated upon discovery of
> this flaw:


Hmm, so while having a really in depth efi example is insightful, I
wonder if it would be more oranized to put it in a separate document
somewhere and just reference it.  As far as ofsck is concerned, I think
a lighter sumary would do:


"Complex operations that modify multiple AGs are performed through a
series of transactions which are logged to a journal that an offline
fsck can either replay or discard.  Online fsck however, must be able
to deal with these operations while they are still in progress.  This
presents a unique challenge for ofsck since a partially completed
transaction chain may present the appearance of inconsistencies, even
though the operations are functioning as intended. (For a more detailed
example, see <cite document here...>)  

The challenge then becomes how to avoid incorrectly repairing these
non-issues as doing so would cause more harm than help."

> +
> +1. Add a higher level lock to allocation groups and require writer
> threads to
> +   acquire the higher level lock in AG order before making any
> changes.
> +   This would be very difficult to implement in practice because it
> is
> +   difficult to determine which locks need to be obtained, and in
> what order,
> +   without simulating the entire operation.
> +   Performing a dry run of a file operation to discover necessary
> locks would
> +   make the filesystem very slow.
> +
> +2. Make the deferred work coordinator code aware of consecutive
> intent items
> +   targeting the same AG and have it hold the AG header buffers
> locked across
> +   the transaction roll between updates.
> +   This would introduce a lot of complexity into the coordinator
> since it is
> +   only loosely coupled with the actual deferred work items.
> +   It would also fail to solve the problem because deferred work
> items can
> +   generate new deferred subtasks, but all subtasks must be complete
> before
> +   work can start on a new sibling task.
Hmm, that one doesnt seem like it's really an option then :-(

> +
> +3. Teach online fsck to walk all transactions waiting for whichever
> lock(s)
> +   protect the data structure being scrubbed to look for pending
> operations.
> +   The checking and repair operations must factor these pending
> operations into
> +   the evaluations being performed.
> +   This solution is a nonstarter because it is *extremely* invasive
> to the main
> +   filesystem.
> +
> +4. Recognize that only online fsck has this requirement of total
> consistency
> +   of AG metadata, and that online fsck should be relatively rare as
> compared
> +   to filesystem change operations.
> +   For each AG, maintain a count of intent items targetting that AG.
> +   When online fsck wants to examine an AG, it should lock the AG
> header
> +   buffers to quiesce all transaction chains that want to modify
> that AG, and
> +   only proceed with the scrub if the count is zero.
> +   In other words, scrub only proceeds if it can lock the AG header
> buffers and
> +   there can't possibly be any intents in progress.
> +   This may lead to fairness and starvation issues, but regular
> filesystem
> +   updates take precedence over online fsck activity.
So basically it sounds like 4 is the only reasonable option?  If the
discussion concerning the other options have died down, I would clean
them out.  They're great for brain storming and invitations for
collaboration, but ideally the goal of any of that should be to narrow
down an agreed upon plan of action.  And the goal of your document
should make clear what that plan is.  So if no one has any objections
by now, maybe just tie it right into the last line:

"The challenge then becomes how to avoid incorrectly repairing these
non-issues as doing so would cause more harm than help. 
Fortunately only online fsck has this requirement of total
consistency..."

> +
> +Intent Drains
> +`````````````
> +
> +The fourth solution is implemented in the current iteration of
This solution is implemented...

> online fsck,
> +with atomic_t providing the active intent counter.
> +
> +There are two key properties to the drain mechanism.
> +First, the counter is incremented when a deferred work item is
> *queued* to a
> +transaction, and it is decremented after the associated intent done
> log item is
> +*committed* to another transaction.
> +The second property is that deferred work can be added to a
> transaction without
> +holding an AG header lock, but per-AG work items cannot be marked
> done without
> +locking that AG header buffer to log the physical updates and the
> intent done
> +log item.
> +The first property enables scrub to yield to running transaction
> chains, which
> +is an explicit deprioritization of online fsck to benefit file
> operations.
> +The second property of the drain is key to the correct coordination
> of scrub,
> +since scrub will always be able to decide if a conflict is possible.
> +
> +For regular filesystem code, the drain works as follows:
> +
> +1. Call the appropriate subsystem function to add a deferred work
> item to a
> +   transaction.
> +
> +2. The function calls ``xfs_drain_bump`` to increase the counter.
> +
> +3. When the deferred item manager wants to finish the deferred work
> item, it
> +   calls ``->finish_item`` to complete it.
> +
> +4. The ``->finish_item`` implementation logs some changes and calls
> +   ``xfs_drain_drop`` to decrease the sloppy counter and wake up any
> threads
> +   waiting on the drain.
> +
> +5. The subtransaction commits, which unlocks the resource associated
> with the
> +   intent item.
> +
> +For scrub, the drain works as follows:
> +
> +1. Lock the resource(s) associated with the metadata being scrubbed.
> +   For example, a scan of the refcount btree would lock the AGI and
> AGF header
> +   buffers.
> +
> +2. If the counter is zero (``xfs_drain_busy`` returns false), there
> are no
> +   chains in progress and the operation may proceed.
> +
> +3. Otherwise, release the resources grabbed in step 1.
> +
> +4. Wait for the intent counter to reach zero
> (``xfs_drain_intents``), then go
> +   back to step 1 unless a signal has been caught.
> +
> +To avoid polling in step 4, the drain provides a waitqueue for scrub
> threads to
> +be woken up whenever the intent count drops to zero.
I think all that makes sense

> +
> +The proposed patchset is the
> +`scrub intent drain series
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-drain-intents>`_.
> +
> +.. _jump_labels:
> +
> +Static Keys (aka Jump Label Patching)
> +`````````````````````````````````````
> +
> +Online fsck for XFS separates the regular filesystem from the
> checking and
> +repair code as much as possible.
> +However, there are a few parts of online fsck (such as the intent
> drains, and
> +later, live update hooks) where it is useful for the online fsck
> code to know
> +what's going on in the rest of the filesystem.
> +Since it is not expected that online fsck will be constantly running
> in the
> +background, it is very important to minimize the runtime overhead
> imposed by
> +these hooks when online fsck is compiled into the kernel but not
> actively
> +running on behalf of userspace.
> +Taking locks in the hot path of a writer thread to access a data
> structure only
> +to find that no further action is necessary is expensive -- on the
> author's
> +computer, this have an overhead of 40-50ns per access.
> +Fortunately, the kernel supports dynamic code patching, which
> enables XFS to
> +replace a static branch to hook code with ``nop`` sleds when online
> fsck isn't
> +running.
> +This sled has an overhead of however long it takes the instruction
> decoder to
> +skip past the sled, which seems to be on the order of less than 1ns
> and
> +does not access memory outside of instruction fetching.
> +
> +When online fsck enables the static key, the sled is replaced with
> an
> +unconditional branch to call the hook code.
> +The switchover is quite expensive (~22000ns) but is paid entirely by
> the
> +program that invoked online fsck, and can be amortized if multiple
> threads
> +enter online fsck at the same time, or if multiple filesystems are
> being
> +checked at the same time.
> +Changing the branch direction requires taking the CPU hotplug lock,
> and since
> +CPU initialization requires memory allocation, online fsck must be
> careful not
> +to change a static key while holding any locks or resources that
> could be
> +accessed in the memory reclaim paths.
> +To minimize contention on the CPU hotplug lock, care should be taken
> not to
> +enable or disable static keys unnecessarily.
> +
> +Because static keys are intended to minimize hook overhead for
> regular
> +filesystem operations when xfs_scrub is not running, the intended
> usage
> +patterns are as follows:
> +
> +- The hooked part of XFS should declare a static-scoped static key
> that
> +  defaults to false.
> +  The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
> +  The static key itself should be declared as a ``static`` variable.
> +
> +- When deciding to invoke code that's only used by scrub, the
> regular
> +  filesystem should call the ``static_branch_unlikely`` predicate to
> avoid the
> +  scrub-only hook code if the static key is not enabled.
> +
> +- The regular filesystem should export helper functions that call
> +  ``static_branch_inc`` to enable and ``static_branch_dec`` to
> disable the
> +  static key.
> +  Wrapper functions make it easy to compile out the relevant code if
> the kernel
> +  distributor turns off online fsck at build time.
> +
> +- Scrub functions wanting to turn on scrub-only XFS functionality
> should call
> +  the ``xchk_fshooks_enable`` from the setup function to enable a
> specific
> +  hook.
> +  This must be done before obtaining any resources that are used by
> memory
> +  reclaim.
> +  Callers had better be sure they really need the functionality
> gated by the
> +  static key; the ``TRY_HARDER`` flag is useful here.
> +
> +Online scrub has resource acquisition helpers (e.g.
> ``xchk_perag_lock``) to
> +handle locking AGI and AGF buffers for all scrubber functions.
> +If it detects a conflict between scrub and the running transactions,
> it will
> +try to wait for intents to complete.
> +If the caller of the helper has not enabled the static key, the
> helper will
> +return -EDEADLOCK, which should result in the scrub being restarted
> with the
> +``TRY_HARDER`` flag set.
> +The scrub setup function should detect that flag, enable the static
> key, and
> +try the scrub again.
> +Scrub teardown disables all static keys obtained by
> ``xchk_fshooks_enable``.

Ok, this part here seems pretty well documented.  Organizing nits aside
I think it looks good.

Allison

> +
> +For more information, please see the kernel documentation of
> +Documentation/staging/static-keys.rst.
> 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/14] xfs: document pageable kernel memory
  2022-12-30 22:10   ` [PATCH 07/14] xfs: document pageable kernel memory Darrick J. Wong
@ 2023-02-02  7:14     ` Allison Henderson
  2023-02-02 23:14       ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-02-02  7:14 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add a discussion of pageable kernel memory, since online fsck needs
> quite a bit more memory than most other parts of the filesystem to
> stage
> records and other information.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  490
> ++++++++++++++++++++
>  1 file changed, 490 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index 419eb54ee200..9d7a2ef1d0dd 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -383,6 +383,8 @@ Algorithms") of Srinivasan.
>  However, any data structure builder that maintains a resource lock
> for the
>  duration of the repair is *always* an offline algorithm.
>  
> +.. _secondary_metadata:
> +
>  Secondary Metadata
>  ``````````````````
>  
> @@ -1746,3 +1748,491 @@ Scrub teardown disables all static keys
> obtained by ``xchk_fshooks_enable``.
>  
>  For more information, please see the kernel documentation of
>  Documentation/staging/static-keys.rst.
> +
> +.. _xfile:
> +
> +Pageable Kernel Memory
> +----------------------
> +
> +Demonstrations of the first few prototypes of online repair revealed
> new
> +technical requirements that were not originally identified.
> +For the first demonstration, the code walked whatever filesystem
> +metadata it needed to synthesize new records and inserted records
> into a new
> +btree as it found them.
> +This was subpar since any additional corruption or runtime errors
> encountered
> +during the walk would shut down the filesystem.
> +After remount, the blocks containing the half-rebuilt data structure
> would not
> +be accessible until another repair was attempted.
> +Solving the problem of half-rebuilt data structures will be
> discussed in the
> +next section.
> +
> +For the second demonstration, the synthesized records were instead
> stored in
> +kernel slab memory.
> +Doing so enabled online repair to abort without writing to the
> filesystem if
> +the metadata walk failed, which prevented online fsck from making
> things worse.
> +However, even this approach needed improving upon.
> +
> +There are four reasons why traditional Linux kernel memory
> management isn't
> +suitable for storing large datasets:
> +
> +1. Although it is tempting to allocate a contiguous block of memory
> to create a
> +   C array, this cannot easily be done in the kernel because it
> cannot be
> +   relied upon to allocate multiple contiguous memory pages.
> +
> +2. While disparate physical pages can be virtually mapped together,
> installed
> +   memory might still not be large enough to stage the entire record
> set in
> +   memory while constructing a new btree.
> +
> +3. To overcome these two difficulties, the implementation was
> adjusted to use
> +   doubly linked lists, which means every record object needed two
> 64-bit list
> +   head pointers, which is a lot of overhead.
> +
> +4. Kernel memory is pinned, which can drive the system out of
> memory, leading
> +   to OOM kills of unrelated processes.
> +
I think I maybe might just jump to what ever the current plan is
instead of trying to keep a record of the dev history in the document.
I'm sure we're not done yet, dev really never is, so in order for the
documentation to be maintained, it would just get bigger and bigger to
keep documenting it this way.  It's not that the above isnt valuable,
but maybe a different kind of document really.


> +For the third iteration, attention swung back to the possibility of
> using

Due to the large volume of metadata that needs to be processed, ofsck
uses...

> +byte-indexed array-like storage to reduce the overhead of in-memory
> records.
> +At any given time, online repair does not need to keep the entire
> record set in
> +memory, which means that individual records can be paged out.
> +Creating new temporary files in the XFS filesystem to store
> intermediate data
> +was explored and rejected for some types of repairs because a
> filesystem with
> +compromised space and inode metadata should never be used to fix
> compromised
> +space or inode metadata.
> +However, the kernel already has a facility for byte-addressable and
> pageable
> +storage: shmfs.
> +In-kernel graphics drivers (most notably i915) take advantage of
> shmfs files
> +to store intermediate data that doesn't need to be in memory at all
> times, so
> +that usage precedent is already established.
> +Hence, the ``xfile`` was born!
> +
> +xfile Access Models
> +```````````````````
> +
> +A survey of the intended uses of xfiles suggested these use cases:
> +
> +1. Arrays of fixed-sized records (space management btrees, directory
> and
> +   extended attribute entries)
> +
> +2. Sparse arrays of fixed-sized records (quotas and link counts)
> +
> +3. Large binary objects (BLOBs) of variable sizes (directory and
> extended
> +   attribute names and values)
> +
> +4. Staging btrees in memory (reverse mapping btrees)
> +
> +5. Arbitrary contents (realtime space management)
> +
> +To support the first four use cases, high level data structures wrap
> the xfile
> +to share functionality between online fsck functions.
> +The rest of this section discusses the interfaces that the xfile
> presents to
> +four of those five higher level data structures.
> +The fifth use case is discussed in the :ref:`realtime summary
> <rtsummary>` case
> +study.
> +
> +The most general storage interface supported by the xfile enables
> the reading
> +and writing of arbitrary quantities of data at arbitrary offsets in
> the xfile.
> +This capability is provided by ``xfile_pread`` and ``xfile_pwrite``
> functions,
> +which behave similarly to their userspace counterparts.
> +XFS is very record-based, which suggests that the ability to load
> and store
> +complete records is important.
> +To support these cases, a pair of ``xfile_obj_load`` and
> ``xfile_obj_store``
> +functions are provided to read and persist objects into an xfile.
> +They are internally the same as pread and pwrite, except that they
> treat any
> +error as an out of memory error.
> +For online repair, squashing error conditions in this manner is an
> acceptable
> +behavior because the only reaction is to abort the operation back to
> userspace.
> +All five xfile usecases can be serviced by these four functions.
> +
> +However, no discussion of file access idioms is complete without
> answering the
> +question, "But what about mmap?"
I actually wouldn't spend too much time discussing solutions that
didn't work for what ever reason, unless someones really asking for it.
 I think this section would read just fine to trim off the last
paragraph here
 
> +It would be *much* more convenient if kernel code could access
> pageable kernel
> +memory with pointers, just like userspace code does with regular
> memory.
> +Like any other filesystem that uses the page cache, reads and writes
> of xfile
> +data lock the cache page and map it into the kernel address space
> for the
> +duration of the operation.
> +Unfortunately, shmfs can only write a file page to the swap device
> if the page
> +is unmapped and unlocked, which means the xfile risks causing OOM
> problems
> +unless it is careful not to pin too many pages.
> +Therefore, the xfile steers most of its users towards programmatic
> access so
> +that backing pages are not kept locked in memory for longer than is
> necessary.
> +However, for callers performing quick linear scans of xfile data,
> +``xfile_get_page`` and ``xfile_put_page`` functions are provided to
> pin a page
> +in memory.
> +So far, the only code to use these functions are the xfarray
> :ref:`sorting
> +<xfarray_sort>` algorithms.
> +
> +xfile Access Coordination
> +`````````````````````````
> +
> +For security reasons, xfiles must be owned privately by the kernel.
> +They are marked ``S_PRIVATE`` to prevent interference from the
> security system,
> +must never be mapped into process file descriptor tables, and their
> pages must
> +never be mapped into userspace processes.
> +
> +To avoid locking recursion issues with the VFS, all accesses to the
> shmfs file
> +are performed by manipulating the page cache directly.
> +xfile writes call the ``->write_begin`` and ``->write_end``
> functions of the
> +xfile's address space to grab writable pages, copy the caller's
> buffer into the
> +page, and release the pages.
> +xfile reads call ``shmem_read_mapping_page_gfp`` to grab pages
xfile readers
> directly before
> +copying the contents into the caller's buffer.
> +In other words, xfiles ignore the VFS read and write code paths to
> avoid
> +having to create a dummy ``struct kiocb`` and to avoid taking inode
> and
> +freeze locks.
> +
> +If an xfile is shared between threads to stage repairs, the caller
> must provide
> +its own locks to coordinate access.
Ofsck threads that share an xfile between stage repairs will use their
own locks to coordinate access with each other.

?
> +
> +.. _xfarray:
> +
> +Arrays of Fixed-Sized Records
> +`````````````````````````````
> +
> +In XFS, each type of indexed space metadata (free space, inodes,
> reference
> +counts, file fork space, and reverse mappings) consists of a set of
> fixed-size
> +records indexed with a classic B+ tree.
> +Directories have a set of fixed-size dirent records that point to
> the names,
> +and extended attributes have a set of fixed-size attribute keys that
> point to
> +names and values.
> +Quota counters and file link counters index records with numbers.
> +During a repair, scrub needs to stage new records during the
> gathering step and
> +retrieve them during the btree building step.
> +
> +Although this requirement can be satisfied by calling the read and
> write
> +methods of the xfile directly, it is simpler for callers for there
> to be a
> +higher level abstraction to take care of computing array offsets, to
> provide
> +iterator functions, and to deal with sparse records and sorting.
> +The ``xfarray`` abstraction presents a linear array for fixed-size
> records atop
> +the byte-accessible xfile.
> +
> +.. _xfarray_access_patterns:
> +
> +Array Access Patterns
> +^^^^^^^^^^^^^^^^^^^^^
> +
> +Array access patterns in online fsck tend to fall into three
> categories.
> +Iteration of records is assumed to be necessary for all cases and
> will be
> +covered in the next section.
> +
> +The first type of caller handles records that are indexed by
> position.
> +Gaps may exist between records, and a record may be updated multiple
> times
> +during the collection step.
> +In other words, these callers want a sparse linearly addressed table
> file.
> +The typical use case are quota records or file link count records.
> +Access to array elements is performed programmatically via
> ``xfarray_load`` and
> +``xfarray_store`` functions, which wrap the similarly-named xfile
> functions to
> +provide loading and storing of array elements at arbitrary array
> indices.
> +Gaps are defined to be null records, and null records are defined to
> be a
> +sequence of all zero bytes.
> +Null records are detected by calling ``xfarray_element_is_null``.
> +They are created either by calling ``xfarray_unset`` to null out an
> existing
> +record or by never storing anything to an array index.
> +
> +The second type of caller handles records that are not indexed by
> position
> +and do not require multiple updates to a record.
> +The typical use case here is rebuilding space btrees and key/value
> btrees.
> +These callers can add records to the array without caring about
> array indices
> +via the ``xfarray_append`` function, which stores a record at the
> end of the
> +array.
> +For callers that require records to be presentable in a specific
> order (e.g.
> +rebuilding btree data), the ``xfarray_sort`` function can arrange
> the sorted
> +records; this function will be covered later.
> +
> +The third type of caller is a bag, which is useful for counting
> records.
> +The typical use case here is constructing space extent reference
> counts from
> +reverse mapping information.
> +Records can be put in the bag in any order, they can be removed from
> the bag
> +at any time, and uniqueness of records is left to callers.
> +The ``xfarray_store_anywhere`` function is used to insert a record
> in any
> +null record slot in the bag; and the ``xfarray_unset`` function
> removes a
> +record from the bag.
> +
> +The proposed patchset is the
> +`big in-memory array
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=big-array>`_.
> +
> +Iterating Array Elements
> +^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Most users of the xfarray require the ability to iterate the records
> stored in
> +the array.
> +Callers can probe every possible array index with the following:
> +
> +.. code-block:: c
> +
> +       xfarray_idx_t i;
> +       foreach_xfarray_idx(array, i) {
> +           xfarray_load(array, i, &rec);
> +
> +           /* do something with rec */
> +       }
> +
> +All users of this idiom must be prepared to handle null records or
> must already
> +know that there aren't any.
> +
> +For xfarray users that want to iterate a sparse array, the
> ``xfarray_iter``
> +function ignores indices in the xfarray that have never been written
> to by
> +calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to
> skip areas
> +of the array that are not populated with memory pages.
> +Once it finds a page, it will skip the zeroed areas of the page.
> +
> +.. code-block:: c
> +
> +       xfarray_idx_t i = XFARRAY_CURSOR_INIT;
> +       while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
> +           /* do something with rec */
> +       }
> +
> +.. _xfarray_sort:
> +
> +Sorting Array Elements
> +^^^^^^^^^^^^^^^^^^^^^^
> +
> +During the fourth demonstration of online repair, a community
> reviewer remarked
> +that for performance reasons, online repair ought to load batches of
> records
> +into btree record blocks instead of inserting records into a new
> btree one at a
> +time.
> +The btree insertion code in XFS is responsible for maintaining
> correct ordering
> +of the records, so naturally the xfarray must also support sorting
> the record
> +set prior to bulk loading.
> +
> +The sorting algorithm used in the xfarray is actually a combination
> of adaptive
> +quicksort and a heapsort subalgorithm in the spirit of
> +`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
> +`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations
> for the Linux
> +kernel.
> +To sort records in a reasonably short amount of time, ``xfarray``
> takes
> +advantage of the binary subpartitioning offered by quicksort, but it
> also uses
> +heapsort to hedge aginst performance collapse if the chosen
> quicksort pivots
> +are poor.
> +Both algorithms are (in general) O(n * lg(n)), but there is a wide
> performance
> +gulf between the two implementations.
> +
> +The Linux kernel already contains a reasonably fast implementation
> of heapsort.
> +It only operates on regular C arrays, which limits the scope of its
> usefulness.
> +There are two key places where the xfarray uses it:
> +
> +* Sorting any record subset backed by a single xfile page.
> +
> +* Loading a small number of xfarray records from potentially
> disparate parts
> +  of the xfarray into a memory buffer, and sorting the buffer.
> +
> +In other words, ``xfarray`` uses heapsort to constrain the nested
> recursion of
> +quicksort, thereby mitigating quicksort's worst runtime behavior.
> +
> +Choosing a quicksort pivot is a tricky business.
> +A good pivot splits the set to sort in half, leading to the divide
> and conquer
> +behavior that is crucial to  O(n * lg(n)) performance.
> +A poor pivot barely splits the subset at all, leading to O(n\
> :sup:`2`)
> +runtime.
> +The xfarray sort routine tries to avoid picking a bad pivot by
> sampling nine
> +records into a memory buffer and using the kernel heapsort to
> identify the
> +median of the nine.
> +
> +Most modern quicksort implementations employ Tukey's "ninther" to
> select a
> +pivot from a classic C array.
> +Typical ninther implementations pick three unique triads of records,
> sort each
> +of the triads, and then sort the middle value of each triad to
> determine the
> +ninther value.
> +As stated previously, however, xfile accesses are not entirely
> cheap.
> +It turned out to be much more performant to read the nine elements
> into a
> +memory buffer, run the kernel's in-memory heapsort on the buffer,
> and choose
> +the 4th element of that buffer as the pivot.
> +Tukey's ninthers are described in J. W. Tukey, `The ninther, a
> technique for
> +low-effort robust (resistant) location in large samples`, in
> *Contributions to
> +Survey Sampling and Applied Statistics*, edited by H. David,
> (Academic Press,
> +1978), pp. 251–257.
> +
> +The partitioning of quicksort is fairly textbook -- rearrange the
> record
> +subset around the pivot, then set up the current and next stack
> frames to
> +sort with the larger and the smaller halves of the pivot,
> respectively.
> +This keeps the stack space requirements to log2(record count).
> +
> +As a final performance optimization, the hi and lo scanning phase of
> quicksort
> +keeps examined xfile pages mapped in the kernel for as long as
> possible to
> +reduce map/unmap cycles.
> +Surprisingly, this reduces overall sort runtime by nearly half again
> after
> +accounting for the application of heapsort directly onto xfile
> pages.
This sorting section is insightful, but I think I'd be ok with out it
too.  Or maybe save it for later in the document as an "implementation
details" section, or something similar.  It seems like there's still a
lot to cover about how ofsck works in general before we start drilling
into things like the runtime complexity of the sorting algorithm it
uses.  

> +
> +Blob Storage
> +````````````
> +
> +Extended attributes and directories add an additional requirement
> for staging
> +records: arbitrary byte sequences of finite length.
> +Each directory entry record needs to store entry name,
> +and each extended attribute needs to store both the attribute name
> and value.
> +The names, keys, and values can consume a large amount of memory, so
> the
> +``xfblob`` abstraction was created to simplify management of these
> blobs
> +atop an xfile.
> +
> +Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions
> to retrieve
> +and persist objects.
> +The store function returns a magic cookie for every object that it
> persists.
> +Later, callers provide this cookie to the ``xblob_load`` to recall
> the object.
> +The ``xfblob_free`` function frees a specific blob, and the
> ``xfblob_truncate``
> +function frees them all because compaction is not needed.
> +
> +The details of repairing directories and extended attributes will be
> discussed
> +in a subsequent section about atomic extent swapping.
> +However, it should be noted that these repair functions only use
> blob storage
> +to cache a small number of entries before adding them to a temporary
> ondisk
> +file, which is why compaction is not required.
> +
> +The proposed patchset is at the start of the
> +`extended attribute repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-xattrs>`_ series.
> +
> +.. _xfbtree:
> +
> +In-Memory B+Trees
> +`````````````````
> +
> +The chapter about :ref:`secondary metadata<secondary_metadata>`
> mentioned that
> +checking and repairing of secondary metadata commonly requires
> coordination
> +between a live metadata scan of the filesystem and writer threads
> that are
> +updating that metadata.
> +Keeping the scan data up to date requires requires the ability to
> propagate
> +metadata updates from the filesystem into the data being collected
> by the scan.
> +This *can* be done by appending concurrent updates into a separate
> log file and
> +applying them before writing the new metadata to disk, but this
> leads to
> +unbounded memory consumption if the rest of the system is very busy.
> +Another option is to skip the side-log and commit live updates from
> the
> +filesystem directly into the scan data, which trades more overhead
> for a lower
> +maximum memory requirement.
> +In both cases, the data structure holding the scan results must
> support indexed
> +access to perform well.
> +
> +Given that indexed lookups of scan data is required for both
> strategies, online
> +fsck employs the second strategy of committing live updates directly
> into
> +scan data.
> +Because xfarrays are not indexed and do not enforce record ordering,
> they
> +are not suitable for this task.
> +Conveniently, however, XFS has a library to create and maintain
> ordered reverse
> +mapping records: the existing rmap btree code!
> +If only there was a means to create one in memory.
> +
> +Recall that the :ref:`xfile <xfile>` abstraction represents memory
> pages as a
> +regular file, which means that the kernel can create byte or block
> addressable
> +virtual address spaces at will.
> +The XFS buffer cache specializes in abstracting IO to block-
> oriented  address
> +spaces, which means that adaptation of the buffer cache to interface
> with
> +xfiles enables reuse of the entire btree library.
> +Btrees built atop an xfile are collectively known as ``xfbtrees``.
> +The next few sections describe how they actually work.
> +
> +The proposed patchset is the
> +`in-memory btree
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=in-memory-btrees>`_
> +series.
> +
> +Using xfiles as a Buffer Cache Target
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Two modifications are necessary to support xfiles as a buffer cache
> target.
> +The first is to make it possible for the ``struct xfs_buftarg``
> structure to
> +host the ``struct xfs_buf`` rhashtable, because normally those are
> held by a
> +per-AG structure.
> +The second change is to modify the buffer ``ioapply`` function to
> "read" cached
> +pages from the xfile and "write" cached pages back to the xfile.
> +Multiple access to individual buffers is controlled by the
> ``xfs_buf`` lock,
> +since the xfile does not provide any locking on its own.
> +With this adaptation in place, users of the xfile-backed buffer
> cache use
> +exactly the same APIs as users of the disk-backed buffer cache.
> +The separation between xfile and buffer cache implies higher memory
> usage since
> +they do not share pages, but this property could some day enable
> transactional
> +updates to an in-memory btree.
> +Today, however, it simply eliminates the need for new code.
> +
> +Space Management with an xfbtree
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Space management for an xfile is very simple -- each btree block is
> one memory
> +page in size.
> +These blocks use the same header format as an on-disk btree, but the
> in-memory
> +block verifiers ignore the checksums, assuming that xfile memory is
> no more
> +corruption-prone than regular DRAM.
> +Reusing existing code here is more important than absolute memory
> efficiency.
> +
> +The very first block of an xfile backing an xfbtree contains a
> header block.
> +The header describes the owner, height, and the block number of the
> root
> +xfbtree block.
> +
> +To allocate a btree block, use ``xfile_seek_data`` to find a gap in
> the file.
> +If there are no gaps, create one by extending the length of the
> xfile.
> +Preallocate space for the block with ``xfile_prealloc``, and hand
> back the
> +location.
> +To free an xfbtree block, use ``xfile_discard`` (which internally
> uses
> +``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
> +
> +Populating an xfbtree
> +^^^^^^^^^^^^^^^^^^^^^
> +
> +An online fsck function that wants to create an xfbtree should
> proceed as
> +follows:
> +
> +1. Call ``xfile_create`` to create an xfile.
> +
> +2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target
> structure
> +   pointing to the xfile.
> +
> +3. Pass the buffer cache target, buffer ops, and other information
> to
> +   ``xfbtree_create`` to write an initial tree header and root block
> to the
> +   xfile.
> +   Each btree type should define a wrapper that passes necessary
> arguments to
> +   the creation function.
> +   For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take
> care of
> +   all the necessary details for callers.
> +   A ``struct xfbtree`` object will be returned.
> +
> +4. Pass the xfbtree object to the btree cursor creation function for
> the
> +   btree type.
> +   Following the example above, ``xfs_rmapbt_mem_cursor`` takes care
> of this
> +   for callers.
> +
> +5. Pass the btree cursor to the regular btree functions to make
> queries against
> +   and to update the in-memory btree.
> +   For example, a btree cursor for an rmap xfbtree can be passed to
> the
> +   ``xfs_rmap_*`` functions just like any other btree cursor.
> +   See the :ref:`next section<xfbtree_commit>` for information on
> dealing with
> +   xfbtree updates that are logged to a transaction.
> +
> +6. When finished, delete the btree cursor, destroy the xfbtree
> object, free the
> +   buffer target, and the destroy the xfile to release all
> resources.
> +
> +.. _xfbtree_commit:
> +
> +Committing Logged xfbtree Buffers
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Although it is a clever hack to reuse the rmap btree code to handle
> the staging
> +structure, the ephemeral nature of the in-memory btree block storage
> presents
> +some challenges of its own.
> +The XFS transaction manager must not commit buffer log items for
> buffers backed
> +by an xfile because the log format does not understand updates for
> devices
> +other than the data device.
> +An ephemeral xfbtree probably will not exist by the time the AIL
> checkpoints
> +log transactions back into the filesystem, and certainly won't exist
> during
> +log recovery.
> +For these reasons, any code updating an xfbtree in transaction
> context must
> +remove the buffer log items from the transaction and write the
> updates into the
> +backing xfile before committing or cancelling the transaction.
> +
> +The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions
> implement
> +this functionality as follows:
> +
> +1. Find each buffer log item whose buffer targets the xfile.
> +
> +2. Record the dirty/ordered status of the log item.
> +
> +3. Detach the log item from the buffer.
> +
> +4. Queue the buffer to a special delwri list.
> +
> +5. Clear the transaction dirty flag if the only dirty log items were
> the ones
> +   that were detached in step 3.
> +
> +6. Submit the delwri list to commit the changes to the xfile, if the
> updates
> +   are being committed.
> +
> +After removing xfile logged buffers from the transaction in this
> manner, the
> +transaction can be committed or cancelled.
Rest of this looks pretty good, organizing nits aside.

Allison

> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 05/14] xfs: document the filesystem metadata checking strategy
  2023-01-21  1:38     ` Allison Henderson
@ 2023-02-02 19:04       ` Darrick J. Wong
  2023-02-09  5:41         ` Allison Henderson
  0 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2023-02-02 19:04 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Sat, Jan 21, 2023 at 01:38:33AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Begin the fifth chapter of the online fsck design documentation,
> > where
> > we discuss the details of the data structures and algorithms used by
> > the
> > kernel to examine filesystem metadata and cross-reference it around
> > the
> > filesystem.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  579
> > ++++++++++++++++++++
> >  .../filesystems/xfs-self-describing-metadata.rst   |    1 
> >  2 files changed, 580 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index 42e82971e036..f45bf97fa9c4 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -864,3 +864,582 @@ Proposed patchsets include
> >  and
> >  `preservation of sickness info during memory reclaim
> >  <
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=indirect-health-reporting>`_.
> > +
> > +5. Kernel Algorithms and Data Structures
> > +========================================
> > +
> > +This section discusses the key algorithms and data structures of the
> > kernel
> > +code that provide the ability to check and repair metadata while the
> > system
> > +is running.
> > +The first chapters in this section reveal the pieces that provide
> > the
> > +foundation for checking metadata.
> > +The remainder of this section presents the mechanisms through which
> > XFS
> > +regenerates itself.
> > +
> > +Self Describing Metadata
> > +------------------------
> > +
> > +Starting with XFS version 5 in 2012, XFS updated the format of
> > nearly every
> > +ondisk block header to record a magic number, a checksum, a
> > universally
> > +"unique" identifier (UUID), an owner code, the ondisk address of the
> > block,
> > +and a log sequence number.
> > +When loading a block buffer from disk, the magic number, UUID,
> > owner, and
> > +ondisk address confirm that the retrieved block matches the specific
> > owner of
> > +the current filesystem, and that the information contained in the
> > block is
> > +supposed to be found at the ondisk address.
> > +The first three components enable checking tools to disregard
> > alleged metadata
> > +that doesn't belong to the filesystem, and the fourth component
> > enables the
> > +filesystem to detect lost writes.
> Add...
> 
> "When ever a file system operation modifies a block, the change is
> submitted to the journal as a transaction.  The journal then processes
> these transactions marking them done once they are safely committed to
> the disk"

Ok, I'll add that transition.  Though I'll s/journal/log/ since this is
xfs. :)

> At this point we havnt talked much at all about transactions or logs,
> and we've just barely begin to cover blocks.  I think you at least want
> a quick blip to describe the relation of these two things, or it may
> not be clear why we suddenly jumped into logs.

Point taken.  Thanks for the suggestion.

> > +
> > +The logging code maintains the checksum and the log sequence number
> > of the last
> > +transactional update.
> > +Checksums are useful for detecting torn writes and other mischief
> "Checksums (or crc's) are useful for detecting incomplete or torn
> writes as well as other discrepancies..."

Checksums are a general concept, whereas CRCs denote a particular family
of checksums.  The statement would still apply even if we used a
different family (e.g. erasure codes, cryptographic hash functions) of
function instead of crc32c.

I will, however, avoid the undefined term 'mischief'.  Thanks for the
correction.

"Checksums are useful for detecting torn writes and other discrepancies
that can be introduced between the computer and its storage devices."

> > between the
> > +computer and its storage devices.
> > +Sequence number tracking enables log recovery to avoid applying out
> > of date
> > +log updates to the filesystem.
> > +
> > +These two features improve overall runtime resiliency by providing a
> > means for
> > +the filesystem to detect obvious corruption when reading metadata
> > blocks from
> > +disk, but these buffer verifiers cannot provide any consistency
> > checking
> > +between metadata structures.
> > +
> > +For more information, please see the documentation for
> > +Documentation/filesystems/xfs-self-describing-metadata.rst
> > +
> > +Reverse Mapping
> > +---------------
> > +
> > +The original design of XFS (circa 1993) is an improvement upon 1980s
> > Unix
> > +filesystem design.
> > +In those days, storage density was expensive, CPU time was scarce,
> > and
> > +excessive seek time could kill performance.
> > +For performance reasons, filesystem authors were reluctant to add
> > redundancy to
> > +the filesystem, even at the cost of data integrity.
> > +Filesystems designers in the early 21st century choose different
> > strategies to
> > +increase internal redundancy -- either storing nearly identical
> > copies of
> > +metadata, or more space-efficient techniques such as erasure coding.
> "such as erasure coding which may encode sections of the data with
> redundant symbols and in more than one location"
> 
> That ties it into the next line.  If you go on to talk about a term you
> have not previously defined, i think you want to either define it
> quickly or just drop it all together.  Right now your goal is to just
> give the reader context, so you want it to move quickly.

How about I shorten it to:

"...or more space-efficient encoding techniques." ?

and end the paragraph there?

> > +Obvious corruptions are typically repaired by copying replicas or
> > +reconstructing from codes.
> > +
> I think I would have just jumped straight from xfs history to modern
> xfs...
> > +For XFS, a different redundancy strategy was chosen to modernize the
> > design:
> > +a secondary space usage index that maps allocated disk extents back
> > to their
> > +owners.
> > +By adding a new index, the filesystem retains most of its ability to
> > scale
> > +well to heavily threaded workloads involving large datasets, since
> > the primary
> > +file metadata (the directory tree, the file block map, and the
> > allocation
> > +groups) remain unchanged.
> > 
> 
> > +Although the reverse-mapping feature increases overhead costs for
> > space
> > +mapping activities just like any other system that improves
> > redundancy, it
> "Like any system that improves redundancy, the reverse-mapping feature
> increases overhead costs for space mapping activities. However, it..."

I like this better.  These two sentences have been changed to read:

"Like any system that improves redundancy, the reverse-mapping feature
increases overhead costs for space mapping activities.  However, it has
two critical advantages: first, the reverse index is key to enabling
online fsck and other requested functionality such as free space
defragmentation, better media failure reporting, and filesystem
shrinking."

> > +has two critical advantages: first, the reverse index is key to
> > enabling online
> > +fsck and other requested functionality such as filesystem
> > reorganization,
> > +better media failure reporting, and shrinking.
> > +Second, the different ondisk storage format of the reverse mapping
> > btree
> > +defeats device-level deduplication, because the filesystem requires
> > real
> > +redundancy.
> > +
> > +A criticism of adding the secondary index is that it does nothing to
> > improve
> > +the robustness of user data storage itself.
> > +This is a valid point, but adding a new index for file data block
> > checksums
> > +increases write amplification and turns data overwrites into copy-
> > writes, which
> > +age the filesystem prematurely.
> > +In keeping with thirty years of precedent, users who want file data
> > integrity
> > +can supply as powerful a solution as they require.
> > +As for metadata, the complexity of adding a new secondary index of
> > space usage
> > +is much less than adding volume management and storage device
> > mirroring to XFS
> > +itself.
> > +Perfection of RAID and volume management are best left to existing
> > layers in
> > +the kernel.
> I think I would cull the entire above paragraph.  rmap, crc and raid
> all have very different points of redundancy, so criticism that an
> apple is not an orange or visavis just feels like a shortsighted
> comparison that's probably more of a distraction than anything.
> 
> Sometimes it feels like this document kinda gets off into tangents
> like it's preemptively trying to position it's self for an argument
> that hasn't happened yet.

It does!  Each of the many tangents that you've pointed out are a
reaction to some discussion that we've had on the list, or at an
LSF, or <cough> fs nerds sniping on social media.  The reason I
capture all of these offtopic arguments is to discourage people from
wasting time rehashing discussions that were settled long ago.

Admittedly, that is a very defensive reaction on my part...

> But I think it has the effect of pulling the
> readers attention off topic into an argument they never thought to
> consider in the first place.  The topic of this section is to explain
> what rmap is.  So lets stay on topic and finish laying out that ground
> work first before getting into how it compares to other solutions

...and you're right to point out that mentioning these things is
distracting and provides fuel to reignite a flamewar.  At the same time,
I think there's value in identifying the roads not taken, and why.

What if I turned these tangents into explicitly labelled sidebars?
Would that help readers who want to stick to the topic?

> > +
> > +The information captured in a reverse space mapping record is as
> > follows:
> > +
> > +.. code-block:: c
> > +
> > +       struct xfs_rmap_irec {
> > +           xfs_agblock_t    rm_startblock;   /* extent start block
> > */
> > +           xfs_extlen_t     rm_blockcount;   /* extent length */
> > +           uint64_t         rm_owner;        /* extent owner */
> > +           uint64_t         rm_offset;       /* offset within the
> > owner */
> > +           unsigned int     rm_flags;        /* state flags */
> > +       };
> > +
> > +The first two fields capture the location and size of the physical
> > space,
> > +in units of filesystem blocks.
> > +The owner field tells scrub which metadata structure or file inode
> > have been
> > +assigned this space.
> > +For space allocated to files, the offset field tells scrub where the
> > space was
> > +mapped within the file fork.
> > +Finally, the flags field provides extra information about the space
> > usage --
> > +is this an attribute fork extent?  A file mapping btree extent?  Or
> > an
> > +unwritten data extent?
> > +
> > +Online filesystem checking judges the consistency of each primary
> > metadata
> > +record by comparing its information against all other space indices.
> > +The reverse mapping index plays a key role in the consistency
> > checking process
> > +because it contains a centralized alternate copy of all space
> > allocation
> > +information.
> > +Program runtime and ease of resource acquisition are the only real
> > limits to
> > +what online checking can consult.
> > +For example, a file data extent mapping can be checked against:
> > +
> > +* The absence of an entry in the free space information.
> > +* The absence of an entry in the inode index.
> > +* The absence of an entry in the reference count data if the file is
> > not
> > +  marked as having shared extents.
> > +* The correspondence of an entry in the reverse mapping information.
> > +
> > +A key observation here is that only the reverse mapping can provide
> > a positive
> > +affirmation of correctness if the primary metadata is in doubt.
> if any of the above metadata is in doubt...

Fixed.

> > +The checking code for most primary metadata follows a path similar
> > to the
> > +one outlined above.
> > +
> > +A second observation to make about this secondary index is that
> > proving its
> > +consistency with the primary metadata is difficult.
> 
> > +Demonstrating that a given reverse mapping record exactly
> > corresponds to the
> > +primary space metadata involves a full scan of all primary space
> > metadata,
> > +which is very time intensive.
> "But why?" Wonders the reader. Just jump into an example:
> 
> "In order to verify that an rmap extent does not incorrectly over lap
> with another record, we would need a full scan of all the other
> records, which is time intensive."

I want to shorten it even further:

"Validating that reverse mapping records are correct requires a full
scan of all primary space metadata, which is very time intensive."

> 
> ?
> 
> And then the below is a separate observation right?  

Right.

> > +Scanning activity for online fsck can only use non-blocking lock
> > acquisition
> > +primitives if the locking order is not the regular order as used by
> > the rest of
> > +the filesystem.
> Lastly, it should be noted that most file system operations tend to
> lock primary metadata before locking the secondary metadata.

This isn't accurate -- metadata structures don't have separate locks.
So it's not true to say that we lock primary or secondary metadata.

We /can/ say that file operations lock the inode, then the AGI, then the
AGF; or that directory operations lock the parent and child ILOCKs in
inumber order; and that if scrub wants to take locks in any other order,
it can only do that via trylocks and backoff.

> This
> means that scanning operations that acquire the secondary metadata
> first may need to yield the secondary lock to filesystem operations
> that have already acquired the primary lock. 
> 
> ?
> 
> > +This means that forward progress during this part of a scan of the
> > reverse
> > +mapping data cannot be guaranteed if system load is especially
> > heavy.
> > +Therefore, it is not practical for online check to detect reverse
> > mapping
> > +records that lack a counterpart in the primary metadata.
> Such as <quick list / quick example>
> 
> > +Instead, scrub relies on rigorous cross-referencing during the
> > primary space
> > +mapping structure checks.

I've converted this section into a bullet list:

"There are several observations to make about reverse mapping indices:

"1. Reverse mappings can provide a positive affirmation of correctness if
any of the above primary metadata are in doubt.  The checking code for
most primary metadata follows a path similar to the one outlined above.

"2. Proving the consistency of secondary metadata with the primary
metadata is difficult because that requires a full scan of all primary
space metadata, which is very time intensive.  For example, checking a
reverse mapping record for a file extent mapping btree block requires
locking the file and searching the entire btree to confirm the block.
Instead, scrub relies on rigorous cross-referencing during the primary
space mapping structure checks.

"3. Consistency scans must use non-blocking lock acquisition primitives
if the required locking order is not the same order used by regular
filesystem operations.  This means that forward progress during this
part of a scan of the reverse mapping data cannot be guaranteed if
system load is heavy."

> > +
> 
> The below paragraph sounds like a re-cap?
> 
> "So to recap, reverse mappings also...."

Yep.

> > +Reverse mappings also play a key role in reconstruction of primary
> > metadata.
> > +The secondary information is general enough for online repair to
> > synthesize a
> > +complete copy of any primary space management metadata by locking
> > that
> > +resource, querying all reverse mapping indices looking for records
> > matching
> > +the relevant resource, and transforming the mapping into an
> > appropriate format.
> > +The details of how these records are staged, written to disk, and
> > committed
> > +into the filesystem are covered in subsequent sections.
> I also think the section would be ok if you were to trim off this last
> paragraph too.

Hm.  I still want to set up the expectation that there's more to come.
How about a brief two-sentence transition paragraph:

"In summary, reverse mappings play a key role in reconstruction of
primary metadata.  The details of how these records are staged, written
to disk, and committed into the filesystem are covered in subsequent
sections."

> 
> > +
> > +Checking and Cross-Referencing
> > +------------------------------
> > +
> > +The first step of checking a metadata structure is to examine every
> > record
> > +contained within the structure and its relationship with the rest of
> > the
> > +system.
> > +XFS contains multiple layers of checking to try to prevent
> > inconsistent
> > +metadata from wreaking havoc on the system.
> > +Each of these layers contributes information that helps the kernel
> > to make
> > +three decisions about the health of a metadata structure:
> > +
> > +- Is a part of this structure obviously corrupt
> > (``XFS_SCRUB_OFLAG_CORRUPT``) ?
> > +- Is this structure inconsistent with the rest of the system
> > +  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
> > +- Is there so much damage around the filesystem that cross-
> > referencing is not
> > +  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
> > +- Can the structure be optimized to improve performance or reduce
> > the size of
> > +  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
> > +- Does the structure contain data that is not inconsistent but
> > deserves review
> > +  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
> > +
> > +The following sections describe how the metadata scrubbing process
> > works.
> > +
> > +Metadata Buffer Verification
> > +````````````````````````````
> > +
> > +The lowest layer of metadata protection in XFS are the metadata
> > verifiers built
> > +into the buffer cache.
> > +These functions perform inexpensive internal consistency checking of
> > the block
> > +itself, and answer these questions:
> > +
> > +- Does the block belong to this filesystem?
> > +
> > +- Does the block belong to the structure that asked for the read?
> > +  This assumes that metadata blocks only have one owner, which is
> > always true
> > +  in XFS.
> > +
> > +- Is the type of data stored in the block within a reasonable range
> > of what
> > +  scrub is expecting?
> > +
> > +- Does the physical location of the block match the location it was
> > read from?
> > +
> > +- Does the block checksum match the data?
> > +
> > +The scope of the protections here are very limited -- verifiers can
> > only
> > +establish that the filesystem code is reasonably free of gross
> > corruption bugs
> > +and that the storage system is reasonably competent at retrieval.
> > +Corruption problems observed at runtime cause the generation of
> > health reports,
> > +failed system calls, and in the extreme case, filesystem shutdowns
> > if the
> > +corrupt metadata force the cancellation of a dirty transaction.
> > +
> > +Every online fsck scrubbing function is expected to read every
> > ondisk metadata
> > +block of a structure in the course of checking the structure.
> > +Corruption problems observed during a check are immediately reported
> > to
> > +userspace as corruption; during a cross-reference, they are reported
> > as a
> > +failure to cross-reference once the full examination is complete.
> > +Reads satisfied by a buffer already in cache (and hence already
> > verified)
> > +bypass these checks.
> > +
> > +Internal Consistency Checks
> > +```````````````````````````
> > +
> > +The next higher level of metadata protection is the internal record
> "After the buffer cache, the next level of metadata protection is..."

Changed.  I'll do the same to the next section as well.

> > +verification code built into the filesystem.
> 
> > +These checks are split between the buffer verifiers, the in-
> > filesystem users of
> > +the buffer cache, and the scrub code itself, depending on the amount
> > of higher
> > +level context required.
> > +The scope of checking is still internal to the block.
> > +For performance reasons, regular code may skip some of these checks
> > unless
> > +debugging is enabled or a write is about to occur.
> > +Scrub functions, of course, must check all possible problems.
> I'd put this chunk after the list below.
> 
> > +Either way, these higher level checking functions answer these
> > questions:
> Then this becomes:
> "These higher level checking functions..."

Done.

> > +
> > +- Does the type of data stored in the block match what scrub is
> > expecting?
> > +
> > +- Does the block belong to the owning structure that asked for the
> > read?
> > +
> > +- If the block contains records, do the records fit within the
> > block?
> > +
> > +- If the block tracks internal free space information, is it
> > consistent with
> > +  the record areas?
> > +
> > +- Are the records contained inside the block free of obvious
> > corruptions?
> > +
> > +Record checks in this category are more rigorous and more time-
> > intensive.
> > +For example, block pointers and inumbers are checked to ensure that
> > they point
> > +within the dynamically allocated parts of an allocation group and
> > within
> > +the filesystem.
> > +Names are checked for invalid characters, and flags are checked for
> > invalid
> > +combinations.
> > +Other record attributes are checked for sensible values.
> > +Btree records spanning an interval of the btree keyspace are checked
> > for
> > +correct order and lack of mergeability (except for file fork
> > mappings).
> > +
> > +Validation of Userspace-Controlled Record Attributes
> > +````````````````````````````````````````````````````
> > +
> > +Various pieces of filesystem metadata are directly controlled by
> > userspace.
> > +Because of this nature, validation work cannot be more precise than
> > checking
> > +that a value is within the possible range.
> > +These fields include:
> > +
> > +- Superblock fields controlled by mount options
> > +- Filesystem labels
> > +- File timestamps
> > +- File permissions
> > +- File size
> > +- File flags
> > +- Names present in directory entries, extended attribute keys, and
> > filesystem
> > +  labels
> > +- Extended attribute key namespaces
> > +- Extended attribute values
> > +- File data block contents
> > +- Quota limits
> > +- Quota timer expiration (if resource usage exceeds the soft limit)
> > +
> > +Cross-Referencing Space Metadata
> > +````````````````````````````````
> > +
> > +The next higher level of checking is cross-referencing records
> > between metadata
> 
> I kinda like the list first so that the reader has an idea of what
> these checks are before getting into discussion about them.  It just
> makes it a little more obvious as to why it's "prohibitively expensive"
> or "dependent on the context of the structure" after having just looked
> at it

<nod>

> The rest looks good from here.

Woot.  Onto the next reply! :)

--D

> Allison
> 
> > +structures.
> > +For regular runtime code, the cost of these checks is considered to
> > be
> > +prohibitively expensive, but as scrub is dedicated to rooting out
> > +inconsistencies, it must pursue all avenues of inquiry.
> > +The exact set of cross-referencing is highly dependent on the
> > context of the
> > +data structure being checked.
> > +
> > +The XFS btree code has keyspace scanning functions that online fsck
> > uses to
> > +cross reference one structure with another.
> > +Specifically, scrub can scan the key space of an index to determine
> > if that
> > +keyspace is fully, sparsely, or not at all mapped to records.
> > +For the reverse mapping btree, it is possible to mask parts of the
> > key for the
> > +purposes of performing a keyspace scan so that scrub can decide if
> > the rmap
> > +btree contains records mapping a certain extent of physical space
> > without the
> > +sparsenses of the rest of the rmap keyspace getting in the way.
> > +
> > +Btree blocks undergo the following checks before cross-referencing:
> > +
> > +- Does the type of data stored in the block match what scrub is
> > expecting?
> > +
> > +- Does the block belong to the owning structure that asked for the
> > read?
> > +
> > +- Do the records fit within the block?
> > +
> > +- Are the records contained inside the block free of obvious
> > corruptions?
> > +
> > +- Are the name hashes in the correct order?
> > +
> > +- Do node pointers within the btree point to valid block addresses
> > for the type
> > +  of btree?
> > +
> > +- Do child pointers point towards the leaves?
> > +
> > +- Do sibling pointers point across the same level?
> > +
> > +- For each node block record, does the record key accurate reflect
> > the contents
> > +  of the child block?
> > +
> > +Space allocation records are cross-referenced as follows:
> > +
> > +1. Any space mentioned by any metadata structure are cross-
> > referenced as
> > +   follows:
> > +
> > +   - Does the reverse mapping index list only the appropriate owner
> > as the
> > +     owner of each block?
> > +
> > +   - Are none of the blocks claimed as free space?
> > +
> > +   - If these aren't file data blocks, are none of the blocks
> > claimed as space
> > +     shared by different owners?
> > +
> > +2. Btree blocks are cross-referenced as follows:
> > +
> > +   - Everything in class 1 above.
> > +
> > +   - If there's a parent node block, do the keys listed for this
> > block match the
> > +     keyspace of this block?
> > +
> > +   - Do the sibling pointers point to valid blocks?  Of the same
> > level?
> > +
> > +   - Do the child pointers point to valid blocks?  Of the next level
> > down?
> > +
> > +3. Free space btree records are cross-referenced as follows:
> > +
> > +   - Everything in class 1 and 2 above.
> > +
> > +   - Does the reverse mapping index list no owners of this space?
> > +
> > +   - Is this space not claimed by the inode index for inodes?
> > +
> > +   - Is it not mentioned by the reference count index?
> > +
> > +   - Is there a matching record in the other free space btree?
> > +
> > +4. Inode btree records are cross-referenced as follows:
> > +
> > +   - Everything in class 1 and 2 above.
> > +
> > +   - Is there a matching record in free inode btree?
> > +
> > +   - Do cleared bits in the holemask correspond with inode clusters?
> > +
> > +   - Do set bits in the freemask correspond with inode records with
> > zero link
> > +     count?
> > +
> > +5. Inode records are cross-referenced as follows:
> > +
> > +   - Everything in class 1.
> > +
> > +   - Do all the fields that summarize information about the file
> > forks actually
> > +     match those forks?
> > +
> > +   - Does each inode with zero link count correspond to a record in
> > the free
> > +     inode btree?
> > +
> > +6. File fork space mapping records are cross-referenced as follows:
> > +
> > +   - Everything in class 1 and 2 above.
> > +
> > +   - Is this space not mentioned by the inode btrees?
> > +
> > +   - If this is a CoW fork mapping, does it correspond to a CoW
> > entry in the
> > +     reference count btree?
> > +
> > +7. Reference count records are cross-referenced as follows:
> > +
> > +   - Everything in class 1 and 2 above.
> > +
> > +   - Within the space subkeyspace of the rmap btree (that is to say,
> > all
> > +     records mapped to a particular space extent and ignoring the
> > owner info),
> > +     are there the same number of reverse mapping records for each
> > block as the
> > +     reference count record claims?
> > +
> > +Proposed patchsets are the series to find gaps in
> > +`refcount btree
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-detect-refcount-gaps>`_,
> > +`inode btree
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-detect-inobt-gaps>`_, and
> > +`rmap btree
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-detect-rmapbt-gaps>`_ records;
> > +to find
> > +`mergeable records
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-detect-mergeable-records>`_;
> > +and to
> > +`improve cross referencing with rmap
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-strengthen-rmap-checking>`_
> > +before starting a repair.
> > +
> > +Checking Extended Attributes
> > +````````````````````````````
> > +
> > +Extended attributes implement a key-value store that enable
> > fragments of data
> > +to be attached to any file.
> > +Both the kernel and userspace can access the keys and values,
> > subject to
> > +namespace and privilege restrictions.
> > +Most typically these fragments are metadata about the file --
> > origins, security
> > +contexts, user-supplied labels, indexing information, etc.
> > +
> > +Names can be as long as 255 bytes and can exist in several different
> > +namespaces.
> > +Values can be as large as 64KB.
> > +A file's extended attributes are stored in blocks mapped by the attr
> > fork.
> > +The mappings point to leaf blocks, remote value blocks, or dabtree
> > blocks.
> > +Block 0 in the attribute fork is always the top of the structure,
> > but otherwise
> > +each of the three types of blocks can be found at any offset in the
> > attr fork.
> > +Leaf blocks contain attribute key records that point to the name and
> > the value.
> > +Names are always stored elsewhere in the same leaf block.
> > +Values that are less than 3/4 the size of a filesystem block are
> > also stored
> > +elsewhere in the same leaf block.
> > +Remote value blocks contain values that are too large to fit inside
> > a leaf.
> > +If the leaf information exceeds a single filesystem block, a dabtree
> > (also
> > +rooted at block 0) is created to map hashes of the attribute names
> > to leaf
> > +blocks in the attr fork.
> > +
> > +Checking an extended attribute structure is not so straightfoward
> > due to the
> > +lack of separation between attr blocks and index blocks.
> > +Scrub must read each block mapped by the attr fork and ignore the
> > non-leaf
> > +blocks:
> > +
> > +1. Walk the dabtree in the attr fork (if present) to ensure that
> > there are no
> > +   irregularities in the blocks or dabtree mappings that do not
> > point to
> > +   attr leaf blocks.
> > +
> > +2. Walk the blocks of the attr fork looking for leaf blocks.
> > +   For each entry inside a leaf:
> > +
> > +   a. Validate that the name does not contain invalid characters.
> > +
> > +   b. Read the attr value.
> > +      This performs a named lookup of the attr name to ensure the
> > correctness
> > +      of the dabtree.
> > +      If the value is stored in a remote block, this also validates
> > the
> > +      integrity of the remote value block.
> > +
> > +Checking and Cross-Referencing Directories
> > +``````````````````````````````````````````
> > +
> > +The filesystem directory tree is a directed acylic graph structure,
> > with files
> > +constituting the nodes, and directory entries (dirents) constituting
> > the edges.
> > +Directories are a special type of file containing a set of mappings
> > from a
> > +255-byte sequence (name) to an inumber.
> > +These are called directory entries, or dirents for short.
> > +Each directory file must have exactly one directory pointing to the
> > file.
> > +A root directory points to itself.
> > +Directory entries point to files of any type.
> > +Each non-directory file may have multiple directories point to it.
> > +
> > +In XFS, directories are implemented as a file containing up to three
> > 32GB
> > +partitions.
> > +The first partition contains directory entry data blocks.
> > +Each data block contains variable-sized records associating a user-
> > provided
> > +name with an inumber and, optionally, a file type.
> > +If the directory entry data grows beyond one block, the second
> > partition (which
> > +exists as post-EOF extents) is populated with a block containing
> > free space
> > +information and an index that maps hashes of the dirent names to
> > directory data
> > +blocks in the first partition.
> > +This makes directory name lookups very fast.
> > +If this second partition grows beyond one block, the third partition
> > is
> > +populated with a linear array of free space information for faster
> > +expansions.
> > +If the free space has been separated and the second partition grows
> > again
> > +beyond one block, then a dabtree is used to map hashes of dirent
> > names to
> > +directory data blocks.
> > +
> > +Checking a directory is pretty straightfoward:
> > +
> > +1. Walk the dabtree in the second partition (if present) to ensure
> > that there
> > +   are no irregularities in the blocks or dabtree mappings that do
> > not point to
> > +   dirent blocks.
> > +
> > +2. Walk the blocks of the first partition looking for directory
> > entries.
> > +   Each dirent is checked as follows:
> > +
> > +   a. Does the name contain no invalid characters?
> > +
> > +   b. Does the inumber correspond to an actual, allocated inode?
> > +
> > +   c. Does the child inode have a nonzero link count?
> > +
> > +   d. If a file type is included in the dirent, does it match the
> > type of the
> > +      inode?
> > +
> > +   e. If the child is a subdirectory, does the child's dotdot
> > pointer point
> > +      back to the parent?
> > +
> > +   f. If the directory has a second partition, perform a named
> > lookup of the
> > +      dirent name to ensure the correctness of the dabtree.
> > +
> > +3. Walk the free space list in the third partition (if present) to
> > ensure that
> > +   the free spaces it describes are really unused.
> > +
> > +Checking operations involving :ref:`parents <dirparent>` and
> > +:ref:`file link counts <nlinks>` are discussed in more detail in
> > later
> > +sections.
> > +
> > +Checking Directory/Attribute Btrees
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +As stated in previous sections, the directory/attribute btree
> > (dabtree) index
> > +maps user-provided names to improve lookup times by avoiding linear
> > scans.
> > +Internally, it maps a 32-bit hash of the name to a block offset
> > within the
> > +appropriate file fork.
> > +
> > +The internal structure of a dabtree closely resembles the btrees
> > that record
> > +fixed-size metadata records -- each dabtree block contains a magic
> > number, a
> > +checksum, sibling pointers, a UUID, a tree level, and a log sequence
> > number.
> > +The format of leaf and node records are the same -- each entry
> > points to the
> > +next level down in the hierarchy, with dabtree node records pointing
> > to dabtree
> > +leaf blocks, and dabtree leaf records pointing to non-dabtree blocks
> > elsewhere
> > +in the fork.
> > +
> > +Checking and cross-referencing the dabtree is very similar to what
> > is done for
> > +space btrees:
> > +
> > +- Does the type of data stored in the block match what scrub is
> > expecting?
> > +
> > +- Does the block belong to the owning structure that asked for the
> > read?
> > +
> > +- Do the records fit within the block?
> > +
> > +- Are the records contained inside the block free of obvious
> > corruptions?
> > +
> > +- Are the name hashes in the correct order?
> > +
> > +- Do node pointers within the dabtree point to valid fork offsets
> > for dabtree
> > +  blocks?
> > +
> > +- Do leaf pointers within the dabtree point to valid fork offsets
> > for directory
> > +  or attr leaf blocks?
> > +
> > +- Do child pointers point towards the leaves?
> > +
> > +- Do sibling pointers point across the same level?
> > +
> > +- For each dabtree node record, does the record key accurate reflect
> > the
> > +  contents of the child dabtree block?
> > +
> > +- For each dabtree leaf record, does the record key accurate reflect
> > the
> > +  contents of the directory or attr block?
> > +
> > +Cross-Referencing Summary Counters
> > +``````````````````````````````````
> > +
> > +XFS maintains three classes of summary counters: available
> > resources, quota
> > +resource usage, and file link counts.
> > +
> > +In theory, the amount of available resources (data blocks, inodes,
> > realtime
> > +extents) can be found by walking the entire filesystem.
> > +This would make for very slow reporting, so a transactional
> > filesystem can
> > +maintain summaries of this information in the superblock.
> > +Cross-referencing these values against the filesystem metadata
> > should be a
> > +simple matter of walking the free space and inode metadata in each
> > AG and the
> > +realtime bitmap, but there are complications that will be discussed
> > in
> > +:ref:`more detail <fscounters>` later.
> > +
> > +:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
> > +checking are sufficiently complicated to warrant separate sections.
> > +
> > +Post-Repair Reverification
> > +``````````````````````````
> > +
> > +After performing a repair, the checking code is run a second time to
> > validate
> > +the new structure, and the results of the health assessment are
> > recorded
> > +internally and returned to the calling process.
> > +This step is critical for enabling system administrator to monitor
> > the status
> > +of the filesystem and the progress of any repairs.
> > +For developers, it is a useful means to judge the efficacy of error
> > detection
> > +and correction in the online and offline checking tools.
> > diff --git a/Documentation/filesystems/xfs-self-describing-
> > metadata.rst b/Documentation/filesystems/xfs-self-describing-
> > metadata.rst
> > index b79dbf36dc94..a10c4ae6955e 100644
> > --- a/Documentation/filesystems/xfs-self-describing-metadata.rst
> > +++ b/Documentation/filesystems/xfs-self-describing-metadata.rst
> > @@ -1,4 +1,5 @@
> >  .. SPDX-License-Identifier: GPL-2.0
> > +.. _xfs_self_describing_metadata:
> >  
> >  ============================
> >  XFS Self Describing Metadata
> > 
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 06/14] xfs: document how online fsck deals with eventual consistency
  2023-01-31  6:11     ` Allison Henderson
@ 2023-02-02 19:55       ` Darrick J. Wong
  2023-02-09  5:41         ` Allison Henderson
  0 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2023-02-02 19:55 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Tue, Jan 31, 2023 at 06:11:30AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Writes to an XFS filesystem employ an eventual consistency update
> > model
> > to break up complex multistep metadata updates into small chained
> > transactions.  This is generally good for performance and scalability
> > because XFS doesn't need to prepare for enormous transactions, but it
> > also means that online fsck must be careful not to attempt a fsck
> > action
> > unless it can be shown that there are no other threads processing a
> > transaction chain.  This part of the design documentation covers the
> > thinking behind the consistency model and how scrub deals with it.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  303
> > ++++++++++++++++++++
> >  1 file changed, 303 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index f45bf97fa9c4..419eb54ee200 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -1443,3 +1443,306 @@ This step is critical for enabling system
> > administrator to monitor the status
> >  of the filesystem and the progress of any repairs.
> >  For developers, it is a useful means to judge the efficacy of error
> > detection
> >  and correction in the online and offline checking tools.
> > +
> > +Eventual Consistency vs. Online Fsck
> > +------------------------------------
> > +
> > +Midway through the development of online scrubbing, the fsstress
> > tests
> > +uncovered a misinteraction between online fsck and compound
> > transaction chains
> > +created by other writer threads that resulted in false reports of
> > metadata
> > +inconsistency.
> > +The root cause of these reports is the eventual consistency model
> > introduced by
> > +the expansion of deferred work items and compound transaction chains
> > when
> > +reverse mapping and reflink were introduced.
> 
> 
> 

Was there supposed to be a comment here?

> > +
> > +Originally, transaction chains were added to XFS to avoid deadlocks
> > when
> > +unmapping space from files.
> > +Deadlock avoidance rules require that AGs only be locked in
> > increasing order,
> > +which makes it impossible (say) to use a single transaction to free
> > a space
> > +extent in AG 7 and then try to free a now superfluous block mapping
> > btree block
> > +in AG 3.
> > +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent
> > (EFI) log
> > +items to commit to freeing some space in one transaction while
> > deferring the
> > +actual metadata updates to a fresh transaction.
> > +The transaction sequence looks like this:
> > +
> > +1. The first transaction contains a physical update to the file's
> > block mapping
> > +   structures to remove the mapping from the btree blocks.
> > +   It then attaches to the in-memory transaction an action item to
> > schedule
> > +   deferred freeing of space.
> > +   Concretely, each transaction maintains a list of ``struct
> > +   xfs_defer_pending`` objects, each of which maintains a list of
> > ``struct
> > +   xfs_extent_free_item`` objects.
> > +   Returning to the example above, the action item tracks the
> > freeing of both
> > +   the unmapped space from AG 7 and the block mapping btree (BMBT)
> > block from
> > +   AG 3.
> > +   Deferred frees recorded in this manner are committed in the log
> > by creating
> > +   an EFI log item from the ``struct xfs_extent_free_item`` object
> > and
> > +   attaching the log item to the transaction.
> > +   When the log is persisted to disk, the EFI item is written into
> > the ondisk
> > +   transaction record.
> > +   EFIs can list up to 16 extents to free, all sorted in AG order.
> > +
> > +2. The second transaction contains a physical update to the free
> > space btrees
> > +   of AG 3 to release the former BMBT block and a second physical
> > update to the
> > +   free space btrees of AG 7 to release the unmapped file space.
> > +   Observe that the the physical updates are resequenced in the
> > correct order
> > +   when possible.
> > +   Attached to the transaction is a an extent free done (EFD) log
> > item.
> > +   The EFD contains a pointer to the EFI logged in transaction #1 so
> > that log
> > +   recovery can tell if the EFI needs to be replayed.
> > +
> > +If the system goes down after transaction #1 is written back to the
> > filesystem
> > +but before #2 is committed, a scan of the filesystem metadata would
> > show
> > +inconsistent filesystem metadata because there would not appear to
> > be any owner
> > +of the unmapped space.
> > +Happily, log recovery corrects this inconsistency for us -- when
> > recovery finds
> > +an intent log item but does not find a corresponding intent done
> > item, it will
> > +reconstruct the incore state of the intent item and finish it.
> > +In the example above, the log must replay both frees described in
> > the recovered
> > +EFI to complete the recovery phase.
> > +
> > +There are two subtleties to XFS' transaction chaining strategy to
> > consider.
> > +The first is that log items must be added to a transaction in the
> > correct order
> > +to prevent conflicts with principal objects that are not held by the
> > +transaction.
> > +In other words, all per-AG metadata updates for an unmapped block
> > must be
> > +completed before the last update to free the extent, and extents
> > should not
> > +be reallocated until that last update commits to the log.
> > +The second subtlety comes from the fact that AG header buffers are
> > (usually)
> > +released between each transaction in a chain.
> > +This means that other threads can observe an AG in an intermediate
> > state,
> > +but as long as the first subtlety is handled, this should not affect
> > the
> > +correctness of filesystem operations.
> > +Unmounting the filesystem flushes all pending work to disk, which
> > means that
> > +offline fsck never sees the temporary inconsistencies caused by
> > deferred work
> > +item processing.
> > +In this manner, XFS employs a form of eventual consistency to avoid
> > deadlocks
> > +and increase parallelism.
> > +
> > +During the design phase of the reverse mapping and reflink features,
> > it was
> > +decided that it was impractical to cram all the reverse mapping
> > updates for a
> > +single filesystem change into a single transaction because a single
> > file
> > +mapping operation can explode into many small updates:
> > +
> > +* The block mapping update itself
> > +* A reverse mapping update for the block mapping update
> > +* Fixing the freelist
> > +* A reverse mapping update for the freelist fix
> > +
> > +* A shape change to the block mapping btree
> > +* A reverse mapping update for the btree update
> > +* Fixing the freelist (again)
> > +* A reverse mapping update for the freelist fix
> > +
> > +* An update to the reference counting information
> > +* A reverse mapping update for the refcount update
> > +* Fixing the freelist (a third time)
> > +* A reverse mapping update for the freelist fix
> > +
> > +* Freeing any space that was unmapped and not owned by any other
> > file
> > +* Fixing the freelist (a fourth time)
> > +* A reverse mapping update for the freelist fix
> > +
> > +* Freeing the space used by the block mapping btree
> > +* Fixing the freelist (a fifth time)
> > +* A reverse mapping update for the freelist fix
> > +
> > +Free list fixups are not usually needed more than once per AG per
> > transaction
> > +chain, but it is theoretically possible if space is very tight.
> > +For copy-on-write updates this is even worse, because this must be
> > done once to
> > +remove the space from a staging area and again to map it into the
> > file!
> > +
> > +To deal with this explosion in a calm manner, XFS expands its use of
> > deferred
> > +work items to cover most reverse mapping updates and all refcount
> > updates.
> > +This reduces the worst case size of transaction reservations by
> > breaking the
> > +work into a long chain of small updates, which increases the degree
> > of eventual
> > +consistency in the system.
> > +Again, this generally isn't a problem because XFS orders its
> > deferred work
> > +items carefully to avoid resource reuse conflicts between
> > unsuspecting threads.
> > +
> > +However, online fsck changes the rules -- remember that although
> > physical
> > +updates to per-AG structures are coordinated by locking the buffers
> > for AG
> > +headers, buffer locks are dropped between transactions.
> > +Once scrub acquires resources and takes locks for a data structure,
> > it must do
> > +all the validation work without releasing the lock.
> > +If the main lock for a space btree is an AG header buffer lock,
> > scrub may have
> > +interrupted another thread that is midway through finishing a chain.
> > +For example, if a thread performing a copy-on-write has completed a
> > reverse
> > +mapping update but not the corresponding refcount update, the two AG
> > btrees
> > +will appear inconsistent to scrub and an observation of corruption
> > will be
> > +recorded.  This observation will not be correct.
> > +If a repair is attempted in this state, the results will be
> > catastrophic!
> > +
> > +Several solutions to this problem were evaluated upon discovery of
> > this flaw:
> 
> 
> Hmm, so while having a really in depth efi example is insightful, I
> wonder if it would be more oranized to put it in a separate document
> somewhere and just reference it.  As far as ofsck is concerned, I think
> a lighter sumary would do:
> 
> 
> "Complex operations that modify multiple AGs are performed through a
> series of transactions which are logged to a journal that an offline
> fsck can either replay or discard.  Online fsck however, must be able
> to deal with these operations while they are still in progress.  This
> presents a unique challenge for ofsck since a partially completed
> transaction chain may present the appearance of inconsistencies, even
> though the operations are functioning as intended. (For a more detailed
> example, see <cite document here...>)  
> 
> The challenge then becomes how to avoid incorrectly repairing these
> non-issues as doing so would cause more harm than help."

I agree that this topic needs a much shorter introduction before moving
on to the gory details.  How does this strike you?

"Complex operations can make modifications to multiple per-AG data
structures with a chain of transactions.  These chains, once committed
to the log, are restarted during log recovery if the system crashes
while processing the chain.  Because the AG header buffers are unlocked
between transactions within a chain, online checking must coordinate
with chained operations that are in progress to avoid incorrectly
detecting inconsistencies due to pending chains.  Furthermore, online
repair must not run when operations are pending because the metadata are
temporarily inconsistent with each other, and rebuilding is not
possible."

"Only online fsck has this requirement of total consistency of AG
metadata, and should be relatively rare as compared to filesystem change
operations.  Online fsck coordinates with transaction chains as follows:

* "For each AG, maintain a count of intent items targetting that AG.
  The count should be bumped whenever a new item is added to the chain.
  The count should be dropped when the filesystem has locked the AG
  header buffers and finished the work.

* "When online fsck wants to examine an AG, it should lock the AG header
  buffers to quiesce all transaction chains that want to modify that AG.
  If the count is zero, proceed with the checking operation.  If it is
  nonzero, cycle the buffer locks to allow the chain to make forward
  progress.

"This may lead to online fsck taking a long time to complete, but
regular filesystem updates take precedence over background checking
activity.  Details about the discovery of this situation are presented
in the <next section>, and details about the solution are presented
<after that>."

These gory details of how I recognized the problem are a subsection of
the main heading, and anyone who wants to know them can read it.
Readers who'd rather move on to the solution can jump directly to the
"Intent Drains" section.  The <bracketed> text are hyperlinks.

> > +
> > +1. Add a higher level lock to allocation groups and require writer
> > threads to
> > +   acquire the higher level lock in AG order before making any
> > changes.
> > +   This would be very difficult to implement in practice because it
> > is
> > +   difficult to determine which locks need to be obtained, and in
> > what order,
> > +   without simulating the entire operation.
> > +   Performing a dry run of a file operation to discover necessary
> > locks would
> > +   make the filesystem very slow.
> > +
> > +2. Make the deferred work coordinator code aware of consecutive
> > intent items
> > +   targeting the same AG and have it hold the AG header buffers
> > locked across
> > +   the transaction roll between updates.
> > +   This would introduce a lot of complexity into the coordinator
> > since it is
> > +   only loosely coupled with the actual deferred work items.
> > +   It would also fail to solve the problem because deferred work
> > items can
> > +   generate new deferred subtasks, but all subtasks must be complete
> > before
> > +   work can start on a new sibling task.
> Hmm, that one doesnt seem like it's really an option then :-(

Right.  Now that this section has become its own gory details
subsection, the sentence preceeding the numbered list becomes:

"Several other solutions to this problem were evaluated upon discovery
of this flaw and rejected:"

> > +
> > +3. Teach online fsck to walk all transactions waiting for whichever
> > lock(s)
> > +   protect the data structure being scrubbed to look for pending
> > operations.
> > +   The checking and repair operations must factor these pending
> > operations into
> > +   the evaluations being performed.
> > +   This solution is a nonstarter because it is *extremely* invasive
> > to the main
> > +   filesystem.
> > +
> > +4. Recognize that only online fsck has this requirement of total
> > consistency
> > +   of AG metadata, and that online fsck should be relatively rare as
> > compared
> > +   to filesystem change operations.
> > +   For each AG, maintain a count of intent items targetting that AG.
> > +   When online fsck wants to examine an AG, it should lock the AG
> > header
> > +   buffers to quiesce all transaction chains that want to modify
> > that AG, and
> > +   only proceed with the scrub if the count is zero.
> > +   In other words, scrub only proceeds if it can lock the AG header
> > buffers and
> > +   there can't possibly be any intents in progress.
> > +   This may lead to fairness and starvation issues, but regular
> > filesystem
> > +   updates take precedence over online fsck activity.
> So basically it sounds like 4 is the only reasonable option?

Yes.

> If the discussion concerning the other options have died down, I would
> clean them out.

That's just the problem -- I've sent this and the code patches to the
list several times now, and mostly haven't heard any solid replies.  So
it's a bit premature to take it out, and again it might be useful to
capture the roads not taken.

> They're great for brain storming and invitations for
> collaboration, but ideally the goal of any of that should be to narrow
> down an agreed upon plan of action.  And the goal of your document
> should make clear what that plan is.  So if no one has any objections
> by now, maybe just tie it right into the last line:
> 
> "The challenge then becomes how to avoid incorrectly repairing these
> non-issues as doing so would cause more harm than help. 
> Fortunately only online fsck has this requirement of total
> consistency..."

> > +
> > +Intent Drains
> > +`````````````
> > +
> > +The fourth solution is implemented in the current iteration of
> This solution is implemented...

"Online fsck uses an atomic intent item counter and lock cycling to
coordinate with transaction chains.  There are two key properties to the
drain mechanism..."

> > online fsck,
> > +with atomic_t providing the active intent counter.
> > +
> > +There are two key properties to the drain mechanism.
> > +First, the counter is incremented when a deferred work item is
> > *queued* to a
> > +transaction, and it is decremented after the associated intent done
> > log item is
> > +*committed* to another transaction.
> > +The second property is that deferred work can be added to a
> > transaction without
> > +holding an AG header lock, but per-AG work items cannot be marked
> > done without
> > +locking that AG header buffer to log the physical updates and the
> > intent done
> > +log item.
> > +The first property enables scrub to yield to running transaction
> > chains, which
> > +is an explicit deprioritization of online fsck to benefit file
> > operations.
> > +The second property of the drain is key to the correct coordination
> > of scrub,
> > +since scrub will always be able to decide if a conflict is possible.
> > +
> > +For regular filesystem code, the drain works as follows:
> > +
> > +1. Call the appropriate subsystem function to add a deferred work
> > item to a
> > +   transaction.
> > +
> > +2. The function calls ``xfs_drain_bump`` to increase the counter.
> > +
> > +3. When the deferred item manager wants to finish the deferred work
> > item, it
> > +   calls ``->finish_item`` to complete it.
> > +
> > +4. The ``->finish_item`` implementation logs some changes and calls
> > +   ``xfs_drain_drop`` to decrease the sloppy counter and wake up any
> > threads
> > +   waiting on the drain.
> > +
> > +5. The subtransaction commits, which unlocks the resource associated
> > with the
> > +   intent item.
> > +
> > +For scrub, the drain works as follows:
> > +
> > +1. Lock the resource(s) associated with the metadata being scrubbed.
> > +   For example, a scan of the refcount btree would lock the AGI and
> > AGF header
> > +   buffers.
> > +
> > +2. If the counter is zero (``xfs_drain_busy`` returns false), there
> > are no
> > +   chains in progress and the operation may proceed.
> > +
> > +3. Otherwise, release the resources grabbed in step 1.
> > +
> > +4. Wait for the intent counter to reach zero
> > (``xfs_drain_intents``), then go
> > +   back to step 1 unless a signal has been caught.
> > +
> > +To avoid polling in step 4, the drain provides a waitqueue for scrub
> > threads to
> > +be woken up whenever the intent count drops to zero.
> I think all that makes sense

Good! :)

> > +
> > +The proposed patchset is the
> > +`scrub intent drain series
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-drain-intents>`_.
> > +
> > +.. _jump_labels:
> > +
> > +Static Keys (aka Jump Label Patching)
> > +`````````````````````````````````````
> > +
> > +Online fsck for XFS separates the regular filesystem from the
> > checking and
> > +repair code as much as possible.
> > +However, there are a few parts of online fsck (such as the intent
> > drains, and
> > +later, live update hooks) where it is useful for the online fsck
> > code to know
> > +what's going on in the rest of the filesystem.
> > +Since it is not expected that online fsck will be constantly running
> > in the
> > +background, it is very important to minimize the runtime overhead
> > imposed by
> > +these hooks when online fsck is compiled into the kernel but not
> > actively
> > +running on behalf of userspace.
> > +Taking locks in the hot path of a writer thread to access a data
> > structure only
> > +to find that no further action is necessary is expensive -- on the
> > author's
> > +computer, this have an overhead of 40-50ns per access.
> > +Fortunately, the kernel supports dynamic code patching, which
> > enables XFS to
> > +replace a static branch to hook code with ``nop`` sleds when online
> > fsck isn't
> > +running.
> > +This sled has an overhead of however long it takes the instruction
> > decoder to
> > +skip past the sled, which seems to be on the order of less than 1ns
> > and
> > +does not access memory outside of instruction fetching.
> > +
> > +When online fsck enables the static key, the sled is replaced with
> > an
> > +unconditional branch to call the hook code.
> > +The switchover is quite expensive (~22000ns) but is paid entirely by
> > the
> > +program that invoked online fsck, and can be amortized if multiple
> > threads
> > +enter online fsck at the same time, or if multiple filesystems are
> > being
> > +checked at the same time.
> > +Changing the branch direction requires taking the CPU hotplug lock,
> > and since
> > +CPU initialization requires memory allocation, online fsck must be
> > careful not
> > +to change a static key while holding any locks or resources that
> > could be
> > +accessed in the memory reclaim paths.
> > +To minimize contention on the CPU hotplug lock, care should be taken
> > not to
> > +enable or disable static keys unnecessarily.
> > +
> > +Because static keys are intended to minimize hook overhead for
> > regular
> > +filesystem operations when xfs_scrub is not running, the intended
> > usage
> > +patterns are as follows:
> > +
> > +- The hooked part of XFS should declare a static-scoped static key
> > that
> > +  defaults to false.
> > +  The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
> > +  The static key itself should be declared as a ``static`` variable.
> > +
> > +- When deciding to invoke code that's only used by scrub, the
> > regular
> > +  filesystem should call the ``static_branch_unlikely`` predicate to
> > avoid the
> > +  scrub-only hook code if the static key is not enabled.
> > +
> > +- The regular filesystem should export helper functions that call
> > +  ``static_branch_inc`` to enable and ``static_branch_dec`` to
> > disable the
> > +  static key.
> > +  Wrapper functions make it easy to compile out the relevant code if
> > the kernel
> > +  distributor turns off online fsck at build time.
> > +
> > +- Scrub functions wanting to turn on scrub-only XFS functionality
> > should call
> > +  the ``xchk_fshooks_enable`` from the setup function to enable a
> > specific
> > +  hook.
> > +  This must be done before obtaining any resources that are used by
> > memory
> > +  reclaim.
> > +  Callers had better be sure they really need the functionality
> > gated by the
> > +  static key; the ``TRY_HARDER`` flag is useful here.
> > +
> > +Online scrub has resource acquisition helpers (e.g.
> > ``xchk_perag_lock``) to
> > +handle locking AGI and AGF buffers for all scrubber functions.
> > +If it detects a conflict between scrub and the running transactions,
> > it will
> > +try to wait for intents to complete.
> > +If the caller of the helper has not enabled the static key, the
> > helper will
> > +return -EDEADLOCK, which should result in the scrub being restarted
> > with the
> > +``TRY_HARDER`` flag set.
> > +The scrub setup function should detect that flag, enable the static
> > key, and
> > +try the scrub again.
> > +Scrub teardown disables all static keys obtained by
> > ``xchk_fshooks_enable``.
> 
> Ok, this part here seems pretty well documented.  Organizing nits aside
> I think it looks good.

Thanks for digging into all of this!

--D

> Allison
> 
> > +
> > +For more information, please see the kernel documentation of
> > +Documentation/staging/static-keys.rst.
> > 
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/14] xfs: document pageable kernel memory
  2023-02-02  7:14     ` Allison Henderson
@ 2023-02-02 23:14       ` Darrick J. Wong
  2023-02-09  5:41         ` Allison Henderson
  0 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2023-02-02 23:14 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Thu, Feb 02, 2023 at 07:14:22AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add a discussion of pageable kernel memory, since online fsck needs
> > quite a bit more memory than most other parts of the filesystem to
> > stage
> > records and other information.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  490
> > ++++++++++++++++++++
> >  1 file changed, 490 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index 419eb54ee200..9d7a2ef1d0dd 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -383,6 +383,8 @@ Algorithms") of Srinivasan.
> >  However, any data structure builder that maintains a resource lock
> > for the
> >  duration of the repair is *always* an offline algorithm.
> >  
> > +.. _secondary_metadata:
> > +
> >  Secondary Metadata
> >  ``````````````````
> >  
> > @@ -1746,3 +1748,491 @@ Scrub teardown disables all static keys
> > obtained by ``xchk_fshooks_enable``.
> >  
> >  For more information, please see the kernel documentation of
> >  Documentation/staging/static-keys.rst.
> > +
> > +.. _xfile:
> > +
> > +Pageable Kernel Memory
> > +----------------------
> > +
> > +Demonstrations of the first few prototypes of online repair revealed
> > new
> > +technical requirements that were not originally identified.
> > +For the first demonstration, the code walked whatever filesystem
> > +metadata it needed to synthesize new records and inserted records
> > into a new
> > +btree as it found them.
> > +This was subpar since any additional corruption or runtime errors
> > encountered
> > +during the walk would shut down the filesystem.
> > +After remount, the blocks containing the half-rebuilt data structure
> > would not
> > +be accessible until another repair was attempted.
> > +Solving the problem of half-rebuilt data structures will be
> > discussed in the
> > +next section.
> > +
> > +For the second demonstration, the synthesized records were instead
> > stored in
> > +kernel slab memory.
> > +Doing so enabled online repair to abort without writing to the
> > filesystem if
> > +the metadata walk failed, which prevented online fsck from making
> > things worse.
> > +However, even this approach needed improving upon.
> > +
> > +There are four reasons why traditional Linux kernel memory
> > management isn't
> > +suitable for storing large datasets:
> > +
> > +1. Although it is tempting to allocate a contiguous block of memory
> > to create a
> > +   C array, this cannot easily be done in the kernel because it
> > cannot be
> > +   relied upon to allocate multiple contiguous memory pages.
> > +
> > +2. While disparate physical pages can be virtually mapped together,
> > installed
> > +   memory might still not be large enough to stage the entire record
> > set in
> > +   memory while constructing a new btree.
> > +
> > +3. To overcome these two difficulties, the implementation was
> > adjusted to use
> > +   doubly linked lists, which means every record object needed two
> > 64-bit list
> > +   head pointers, which is a lot of overhead.
> > +
> > +4. Kernel memory is pinned, which can drive the system out of
> > memory, leading
> > +   to OOM kills of unrelated processes.
> > +
> I think I maybe might just jump to what ever the current plan is
> instead of trying to keep a record of the dev history in the document.
> I'm sure we're not done yet, dev really never is, so in order for the
> documentation to be maintained, it would just get bigger and bigger to
> keep documenting it this way.  It's not that the above isnt valuable,
> but maybe a different kind of document really.

OK, I've shortened this introduction to outline the requirements, and
trimmed the historical information to a sidebar:

"Some online checking functions work by scanning the filesystem to build
a shadow copy of an ondisk metadata structure in memory and comparing
the two copies. For online repair to rebuild a metadata structure, it
must compute the record set that will be stored in the new structure
before it can persist that new structure to disk. Ideally, repairs
complete with a single atomic commit that introduces a new data
structure. To meet these goals, the kernel needs to collect a large
amount of information in a place that doesn’t require the correct
operation of the filesystem.

"Kernel memory isn’t suitable because:

*   Allocating a contiguous region of memory to create a C array is very
    difficult, especially on 32-bit systems.

*   Linked lists of records introduce double pointer overhead which is
    very high and eliminate the possibility of indexed lookups.

*   Kernel memory is pinned, which can drive the system into OOM
    conditions.

*   The system might not have sufficient memory to stage all the
    information.

"At any given time, online fsck does not need to keep the entire record
set in memory, which means that individual records can be paged out if
necessary. Continued development of online fsck demonstrated that the
ability to perform indexed data storage would also be very useful.
Fortunately, the Linux kernel already has a facility for
byte-addressable and pageable storage: tmpfs. In-kernel graphics drivers
(most notably i915) take advantage of tmpfs files to store intermediate
data that doesn’t need to be in memory at all times, so that usage
precedent is already established. Hence, the xfile was born!

Historical Sidebar
------------------

"The first edition of online repair inserted records into a new btree as
it found them, which failed because filesystem could shut down with a
built data structure, which would be live after recovery finished.

"The second edition solved the half-rebuilt structure problem by storing
everything in memory, but frequently ran the system out of memory.

"The third edition solved the OOM problem by using linked lists, but the
list overhead was extreme."

> 
> 
> > +For the third iteration, attention swung back to the possibility of
> > using
> 
> Due to the large volume of metadata that needs to be processed, ofsck
> uses...
> 
> > +byte-indexed array-like storage to reduce the overhead of in-memory
> > records.
> > +At any given time, online repair does not need to keep the entire
> > record set in
> > +memory, which means that individual records can be paged out.
> > +Creating new temporary files in the XFS filesystem to store
> > intermediate data
> > +was explored and rejected for some types of repairs because a
> > filesystem with
> > +compromised space and inode metadata should never be used to fix
> > compromised
> > +space or inode metadata.
> > +However, the kernel already has a facility for byte-addressable and
> > pageable
> > +storage: shmfs.
> > +In-kernel graphics drivers (most notably i915) take advantage of
> > shmfs files
> > +to store intermediate data that doesn't need to be in memory at all
> > times, so
> > +that usage precedent is already established.
> > +Hence, the ``xfile`` was born!
> > +
> > +xfile Access Models
> > +```````````````````
> > +
> > +A survey of the intended uses of xfiles suggested these use cases:
> > +
> > +1. Arrays of fixed-sized records (space management btrees, directory
> > and
> > +   extended attribute entries)
> > +
> > +2. Sparse arrays of fixed-sized records (quotas and link counts)
> > +
> > +3. Large binary objects (BLOBs) of variable sizes (directory and
> > extended
> > +   attribute names and values)
> > +
> > +4. Staging btrees in memory (reverse mapping btrees)
> > +
> > +5. Arbitrary contents (realtime space management)
> > +
> > +To support the first four use cases, high level data structures wrap
> > the xfile
> > +to share functionality between online fsck functions.
> > +The rest of this section discusses the interfaces that the xfile
> > presents to
> > +four of those five higher level data structures.
> > +The fifth use case is discussed in the :ref:`realtime summary
> > <rtsummary>` case
> > +study.
> > +
> > +The most general storage interface supported by the xfile enables
> > the reading
> > +and writing of arbitrary quantities of data at arbitrary offsets in
> > the xfile.
> > +This capability is provided by ``xfile_pread`` and ``xfile_pwrite``
> > functions,
> > +which behave similarly to their userspace counterparts.
> > +XFS is very record-based, which suggests that the ability to load
> > and store
> > +complete records is important.
> > +To support these cases, a pair of ``xfile_obj_load`` and
> > ``xfile_obj_store``
> > +functions are provided to read and persist objects into an xfile.
> > +They are internally the same as pread and pwrite, except that they
> > treat any
> > +error as an out of memory error.
> > +For online repair, squashing error conditions in this manner is an
> > acceptable
> > +behavior because the only reaction is to abort the operation back to
> > userspace.
> > +All five xfile usecases can be serviced by these four functions.
> > +
> > +However, no discussion of file access idioms is complete without
> > answering the
> > +question, "But what about mmap?"
> I actually wouldn't spend too much time discussing solutions that
> didn't work for what ever reason, unless someones really asking for it.
>  I think this section would read just fine to trim off the last
> paragraph here

Since I wrote this, I've been experimenting with wiring up the tmpfs
file page cache folios to the xfs buffer cache.  Pinning the folios in
this manner makes it so that online fsck can (more or less) directly
access the xfile contents.  Much to my surprise, this has actually held
up in testing, so ... it's no longer a solution that "didn't really
work". :)

I also need to s/page/folio/ now that willy has finished that
conversion.  This section has been rewritten as such:

"However, no discussion of file access idioms is complete without
answering the question, “But what about mmap?” It is convenient to
access storage directly with pointers, just like userspace code does
with regular memory. Online fsck must not drive the system into OOM
conditions, which means that xfiles must be responsive to memory
reclamation. tmpfs can only push a pagecache folio to the swap cache if
the folio is neither pinned nor locked, which means the xfile must not
pin too many folios.

"Short term direct access to xfile contents is done by locking the
pagecache folio and mapping it into kernel address space. Programmatic
access (e.g. pread and pwrite) uses this mechanism. Folio locks are not
supposed to be held for long periods of time, so long term direct access
to xfile contents is done by bumping the folio refcount, mapping it into
kernel address space, and dropping the folio lock. These long term users
must be responsive to memory reclaim by hooking into the shrinker
infrastructure to know when to release folios.

"The xfile_get_page and xfile_put_page functions are provided to
retrieve the (locked) folio that backs part of an xfile and to release
it. The only code to use these folio lease functions are the xfarray
sorting algorithms and the in-memory btrees."

> > +It would be *much* more convenient if kernel code could access
> > pageable kernel
> > +memory with pointers, just like userspace code does with regular
> > memory.
> > +Like any other filesystem that uses the page cache, reads and writes
> > of xfile
> > +data lock the cache page and map it into the kernel address space
> > for the
> > +duration of the operation.
> > +Unfortunately, shmfs can only write a file page to the swap device
> > if the page
> > +is unmapped and unlocked, which means the xfile risks causing OOM
> > problems
> > +unless it is careful not to pin too many pages.
> > +Therefore, the xfile steers most of its users towards programmatic
> > access so
> > +that backing pages are not kept locked in memory for longer than is
> > necessary.
> > +However, for callers performing quick linear scans of xfile data,
> > +``xfile_get_page`` and ``xfile_put_page`` functions are provided to
> > pin a page
> > +in memory.
> > +So far, the only code to use these functions are the xfarray
> > :ref:`sorting
> > +<xfarray_sort>` algorithms.
> > +
> > +xfile Access Coordination
> > +`````````````````````````
> > +
> > +For security reasons, xfiles must be owned privately by the kernel.
> > +They are marked ``S_PRIVATE`` to prevent interference from the
> > security system,
> > +must never be mapped into process file descriptor tables, and their
> > pages must
> > +never be mapped into userspace processes.
> > +
> > +To avoid locking recursion issues with the VFS, all accesses to the
> > shmfs file
> > +are performed by manipulating the page cache directly.
> > +xfile writes call the ``->write_begin`` and ``->write_end``
> > functions of the
> > +xfile's address space to grab writable pages, copy the caller's
> > buffer into the
> > +page, and release the pages.
> > +xfile reads call ``shmem_read_mapping_page_gfp`` to grab pages
> xfile readers

OK.

> > directly before
> > +copying the contents into the caller's buffer.
> > +In other words, xfiles ignore the VFS read and write code paths to
> > avoid
> > +having to create a dummy ``struct kiocb`` and to avoid taking inode
> > and
> > +freeze locks.
> > +
> > +If an xfile is shared between threads to stage repairs, the caller
> > must provide
> > +its own locks to coordinate access.
> Ofsck threads that share an xfile between stage repairs will use their
> own locks to coordinate access with each other.
> 
> ?

Hm.  I wonder if there's a misunderstanding here?

Online fsck functions themselves are single-threaded, which is to say
that they themselves neither queue workers nor start kthreads.  However,
an xfile created by a running fsck function can be accessed from other
thread if the fsck function also hooks itself into filesystem code.

The live update section has a nice diagram of how that works:
https://djwong.org/docs/xfs-online-fsck-design/#filesystem-hooks

> > +
> > +.. _xfarray:
> > +
> > +Arrays of Fixed-Sized Records
> > +`````````````````````````````
> > +
> > +In XFS, each type of indexed space metadata (free space, inodes,
> > reference
> > +counts, file fork space, and reverse mappings) consists of a set of
> > fixed-size
> > +records indexed with a classic B+ tree.
> > +Directories have a set of fixed-size dirent records that point to
> > the names,
> > +and extended attributes have a set of fixed-size attribute keys that
> > point to
> > +names and values.
> > +Quota counters and file link counters index records with numbers.
> > +During a repair, scrub needs to stage new records during the
> > gathering step and
> > +retrieve them during the btree building step.
> > +
> > +Although this requirement can be satisfied by calling the read and
> > write
> > +methods of the xfile directly, it is simpler for callers for there
> > to be a
> > +higher level abstraction to take care of computing array offsets, to
> > provide
> > +iterator functions, and to deal with sparse records and sorting.
> > +The ``xfarray`` abstraction presents a linear array for fixed-size
> > records atop
> > +the byte-accessible xfile.
> > +
> > +.. _xfarray_access_patterns:
> > +
> > +Array Access Patterns
> > +^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Array access patterns in online fsck tend to fall into three
> > categories.
> > +Iteration of records is assumed to be necessary for all cases and
> > will be
> > +covered in the next section.
> > +
> > +The first type of caller handles records that are indexed by
> > position.
> > +Gaps may exist between records, and a record may be updated multiple
> > times
> > +during the collection step.
> > +In other words, these callers want a sparse linearly addressed table
> > file.
> > +The typical use case are quota records or file link count records.
> > +Access to array elements is performed programmatically via
> > ``xfarray_load`` and
> > +``xfarray_store`` functions, which wrap the similarly-named xfile
> > functions to
> > +provide loading and storing of array elements at arbitrary array
> > indices.
> > +Gaps are defined to be null records, and null records are defined to
> > be a
> > +sequence of all zero bytes.
> > +Null records are detected by calling ``xfarray_element_is_null``.
> > +They are created either by calling ``xfarray_unset`` to null out an
> > existing
> > +record or by never storing anything to an array index.
> > +
> > +The second type of caller handles records that are not indexed by
> > position
> > +and do not require multiple updates to a record.
> > +The typical use case here is rebuilding space btrees and key/value
> > btrees.
> > +These callers can add records to the array without caring about
> > array indices
> > +via the ``xfarray_append`` function, which stores a record at the
> > end of the
> > +array.
> > +For callers that require records to be presentable in a specific
> > order (e.g.
> > +rebuilding btree data), the ``xfarray_sort`` function can arrange
> > the sorted
> > +records; this function will be covered later.
> > +
> > +The third type of caller is a bag, which is useful for counting
> > records.
> > +The typical use case here is constructing space extent reference
> > counts from
> > +reverse mapping information.
> > +Records can be put in the bag in any order, they can be removed from
> > the bag
> > +at any time, and uniqueness of records is left to callers.
> > +The ``xfarray_store_anywhere`` function is used to insert a record
> > in any
> > +null record slot in the bag; and the ``xfarray_unset`` function
> > removes a
> > +record from the bag.
> > +
> > +The proposed patchset is the
> > +`big in-memory array
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=big-array>`_.
> > +
> > +Iterating Array Elements
> > +^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Most users of the xfarray require the ability to iterate the records
> > stored in
> > +the array.
> > +Callers can probe every possible array index with the following:
> > +
> > +.. code-block:: c
> > +
> > +       xfarray_idx_t i;
> > +       foreach_xfarray_idx(array, i) {
> > +           xfarray_load(array, i, &rec);
> > +
> > +           /* do something with rec */
> > +       }
> > +
> > +All users of this idiom must be prepared to handle null records or
> > must already
> > +know that there aren't any.
> > +
> > +For xfarray users that want to iterate a sparse array, the
> > ``xfarray_iter``
> > +function ignores indices in the xfarray that have never been written
> > to by
> > +calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to
> > skip areas
> > +of the array that are not populated with memory pages.
> > +Once it finds a page, it will skip the zeroed areas of the page.
> > +
> > +.. code-block:: c
> > +
> > +       xfarray_idx_t i = XFARRAY_CURSOR_INIT;
> > +       while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
> > +           /* do something with rec */
> > +       }
> > +
> > +.. _xfarray_sort:
> > +
> > +Sorting Array Elements
> > +^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +During the fourth demonstration of online repair, a community
> > reviewer remarked
> > +that for performance reasons, online repair ought to load batches of
> > records
> > +into btree record blocks instead of inserting records into a new
> > btree one at a
> > +time.
> > +The btree insertion code in XFS is responsible for maintaining
> > correct ordering
> > +of the records, so naturally the xfarray must also support sorting
> > the record
> > +set prior to bulk loading.
> > +
> > +The sorting algorithm used in the xfarray is actually a combination
> > of adaptive
> > +quicksort and a heapsort subalgorithm in the spirit of
> > +`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
> > +`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations
> > for the Linux
> > +kernel.
> > +To sort records in a reasonably short amount of time, ``xfarray``
> > takes
> > +advantage of the binary subpartitioning offered by quicksort, but it
> > also uses
> > +heapsort to hedge aginst performance collapse if the chosen
> > quicksort pivots
> > +are poor.
> > +Both algorithms are (in general) O(n * lg(n)), but there is a wide
> > performance
> > +gulf between the two implementations.
> > +
> > +The Linux kernel already contains a reasonably fast implementation
> > of heapsort.
> > +It only operates on regular C arrays, which limits the scope of its
> > usefulness.
> > +There are two key places where the xfarray uses it:
> > +
> > +* Sorting any record subset backed by a single xfile page.
> > +
> > +* Loading a small number of xfarray records from potentially
> > disparate parts
> > +  of the xfarray into a memory buffer, and sorting the buffer.
> > +
> > +In other words, ``xfarray`` uses heapsort to constrain the nested
> > recursion of
> > +quicksort, thereby mitigating quicksort's worst runtime behavior.
> > +
> > +Choosing a quicksort pivot is a tricky business.
> > +A good pivot splits the set to sort in half, leading to the divide
> > and conquer
> > +behavior that is crucial to  O(n * lg(n)) performance.
> > +A poor pivot barely splits the subset at all, leading to O(n\
> > :sup:`2`)
> > +runtime.
> > +The xfarray sort routine tries to avoid picking a bad pivot by
> > sampling nine
> > +records into a memory buffer and using the kernel heapsort to
> > identify the
> > +median of the nine.
> > +
> > +Most modern quicksort implementations employ Tukey's "ninther" to
> > select a
> > +pivot from a classic C array.
> > +Typical ninther implementations pick three unique triads of records,
> > sort each
> > +of the triads, and then sort the middle value of each triad to
> > determine the
> > +ninther value.
> > +As stated previously, however, xfile accesses are not entirely
> > cheap.
> > +It turned out to be much more performant to read the nine elements
> > into a
> > +memory buffer, run the kernel's in-memory heapsort on the buffer,
> > and choose
> > +the 4th element of that buffer as the pivot.
> > +Tukey's ninthers are described in J. W. Tukey, `The ninther, a
> > technique for
> > +low-effort robust (resistant) location in large samples`, in
> > *Contributions to
> > +Survey Sampling and Applied Statistics*, edited by H. David,
> > (Academic Press,
> > +1978), pp. 251–257.
> > +
> > +The partitioning of quicksort is fairly textbook -- rearrange the
> > record
> > +subset around the pivot, then set up the current and next stack
> > frames to
> > +sort with the larger and the smaller halves of the pivot,
> > respectively.
> > +This keeps the stack space requirements to log2(record count).
> > +
> > +As a final performance optimization, the hi and lo scanning phase of
> > quicksort
> > +keeps examined xfile pages mapped in the kernel for as long as
> > possible to
> > +reduce map/unmap cycles.
> > +Surprisingly, this reduces overall sort runtime by nearly half again
> > after
> > +accounting for the application of heapsort directly onto xfile
> > pages.
> This sorting section is insightful, but I think I'd be ok with out it
> too.  Or maybe save it for later in the document as an "implementation
> details" section, or something similar.  It seems like there's still a
> lot to cover about how ofsck works in general before we start drilling
> into things like the runtime complexity of the sorting algorithm it
> uses.  

How about I demote the details of how sorting works to a case study?

> > +
> > +Blob Storage
> > +````````````
> > +
> > +Extended attributes and directories add an additional requirement
> > for staging
> > +records: arbitrary byte sequences of finite length.
> > +Each directory entry record needs to store entry name,
> > +and each extended attribute needs to store both the attribute name
> > and value.
> > +The names, keys, and values can consume a large amount of memory, so
> > the
> > +``xfblob`` abstraction was created to simplify management of these
> > blobs
> > +atop an xfile.
> > +
> > +Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions
> > to retrieve
> > +and persist objects.
> > +The store function returns a magic cookie for every object that it
> > persists.
> > +Later, callers provide this cookie to the ``xblob_load`` to recall
> > the object.
> > +The ``xfblob_free`` function frees a specific blob, and the
> > ``xfblob_truncate``
> > +function frees them all because compaction is not needed.
> > +
> > +The details of repairing directories and extended attributes will be
> > discussed
> > +in a subsequent section about atomic extent swapping.
> > +However, it should be noted that these repair functions only use
> > blob storage
> > +to cache a small number of entries before adding them to a temporary
> > ondisk
> > +file, which is why compaction is not required.
> > +
> > +The proposed patchset is at the start of the
> > +`extended attribute repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-xattrs>`_ series.
> > +
> > +.. _xfbtree:
> > +
> > +In-Memory B+Trees
> > +`````````````````
> > +
> > +The chapter about :ref:`secondary metadata<secondary_metadata>`
> > mentioned that
> > +checking and repairing of secondary metadata commonly requires
> > coordination
> > +between a live metadata scan of the filesystem and writer threads
> > that are
> > +updating that metadata.
> > +Keeping the scan data up to date requires requires the ability to
> > propagate
> > +metadata updates from the filesystem into the data being collected
> > by the scan.
> > +This *can* be done by appending concurrent updates into a separate
> > log file and
> > +applying them before writing the new metadata to disk, but this
> > leads to
> > +unbounded memory consumption if the rest of the system is very busy.
> > +Another option is to skip the side-log and commit live updates from
> > the
> > +filesystem directly into the scan data, which trades more overhead
> > for a lower
> > +maximum memory requirement.
> > +In both cases, the data structure holding the scan results must
> > support indexed
> > +access to perform well.
> > +
> > +Given that indexed lookups of scan data is required for both
> > strategies, online
> > +fsck employs the second strategy of committing live updates directly
> > into
> > +scan data.
> > +Because xfarrays are not indexed and do not enforce record ordering,
> > they
> > +are not suitable for this task.
> > +Conveniently, however, XFS has a library to create and maintain
> > ordered reverse
> > +mapping records: the existing rmap btree code!
> > +If only there was a means to create one in memory.
> > +
> > +Recall that the :ref:`xfile <xfile>` abstraction represents memory
> > pages as a
> > +regular file, which means that the kernel can create byte or block
> > addressable
> > +virtual address spaces at will.
> > +The XFS buffer cache specializes in abstracting IO to block-
> > oriented  address
> > +spaces, which means that adaptation of the buffer cache to interface
> > with
> > +xfiles enables reuse of the entire btree library.
> > +Btrees built atop an xfile are collectively known as ``xfbtrees``.
> > +The next few sections describe how they actually work.
> > +
> > +The proposed patchset is the
> > +`in-memory btree
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=in-memory-btrees>`_
> > +series.
> > +
> > +Using xfiles as a Buffer Cache Target
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Two modifications are necessary to support xfiles as a buffer cache
> > target.
> > +The first is to make it possible for the ``struct xfs_buftarg``
> > structure to
> > +host the ``struct xfs_buf`` rhashtable, because normally those are
> > held by a
> > +per-AG structure.
> > +The second change is to modify the buffer ``ioapply`` function to
> > "read" cached
> > +pages from the xfile and "write" cached pages back to the xfile.
> > +Multiple access to individual buffers is controlled by the
> > ``xfs_buf`` lock,
> > +since the xfile does not provide any locking on its own.
> > +With this adaptation in place, users of the xfile-backed buffer
> > cache use
> > +exactly the same APIs as users of the disk-backed buffer cache.
> > +The separation between xfile and buffer cache implies higher memory
> > usage since
> > +they do not share pages, but this property could some day enable
> > transactional
> > +updates to an in-memory btree.
> > +Today, however, it simply eliminates the need for new code.
> > +
> > +Space Management with an xfbtree
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Space management for an xfile is very simple -- each btree block is
> > one memory
> > +page in size.
> > +These blocks use the same header format as an on-disk btree, but the
> > in-memory
> > +block verifiers ignore the checksums, assuming that xfile memory is
> > no more
> > +corruption-prone than regular DRAM.
> > +Reusing existing code here is more important than absolute memory
> > efficiency.
> > +
> > +The very first block of an xfile backing an xfbtree contains a
> > header block.
> > +The header describes the owner, height, and the block number of the
> > root
> > +xfbtree block.
> > +
> > +To allocate a btree block, use ``xfile_seek_data`` to find a gap in
> > the file.
> > +If there are no gaps, create one by extending the length of the
> > xfile.
> > +Preallocate space for the block with ``xfile_prealloc``, and hand
> > back the
> > +location.
> > +To free an xfbtree block, use ``xfile_discard`` (which internally
> > uses
> > +``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
> > +
> > +Populating an xfbtree
> > +^^^^^^^^^^^^^^^^^^^^^
> > +
> > +An online fsck function that wants to create an xfbtree should
> > proceed as
> > +follows:
> > +
> > +1. Call ``xfile_create`` to create an xfile.
> > +
> > +2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target
> > structure
> > +   pointing to the xfile.
> > +
> > +3. Pass the buffer cache target, buffer ops, and other information
> > to
> > +   ``xfbtree_create`` to write an initial tree header and root block
> > to the
> > +   xfile.
> > +   Each btree type should define a wrapper that passes necessary
> > arguments to
> > +   the creation function.
> > +   For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take
> > care of
> > +   all the necessary details for callers.
> > +   A ``struct xfbtree`` object will be returned.
> > +
> > +4. Pass the xfbtree object to the btree cursor creation function for
> > the
> > +   btree type.
> > +   Following the example above, ``xfs_rmapbt_mem_cursor`` takes care
> > of this
> > +   for callers.
> > +
> > +5. Pass the btree cursor to the regular btree functions to make
> > queries against
> > +   and to update the in-memory btree.
> > +   For example, a btree cursor for an rmap xfbtree can be passed to
> > the
> > +   ``xfs_rmap_*`` functions just like any other btree cursor.
> > +   See the :ref:`next section<xfbtree_commit>` for information on
> > dealing with
> > +   xfbtree updates that are logged to a transaction.
> > +
> > +6. When finished, delete the btree cursor, destroy the xfbtree
> > object, free the
> > +   buffer target, and the destroy the xfile to release all
> > resources.
> > +
> > +.. _xfbtree_commit:
> > +
> > +Committing Logged xfbtree Buffers
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Although it is a clever hack to reuse the rmap btree code to handle
> > the staging
> > +structure, the ephemeral nature of the in-memory btree block storage
> > presents
> > +some challenges of its own.
> > +The XFS transaction manager must not commit buffer log items for
> > buffers backed
> > +by an xfile because the log format does not understand updates for
> > devices
> > +other than the data device.
> > +An ephemeral xfbtree probably will not exist by the time the AIL
> > checkpoints
> > +log transactions back into the filesystem, and certainly won't exist
> > during
> > +log recovery.
> > +For these reasons, any code updating an xfbtree in transaction
> > context must
> > +remove the buffer log items from the transaction and write the
> > updates into the
> > +backing xfile before committing or cancelling the transaction.
> > +
> > +The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions
> > implement
> > +this functionality as follows:
> > +
> > +1. Find each buffer log item whose buffer targets the xfile.
> > +
> > +2. Record the dirty/ordered status of the log item.
> > +
> > +3. Detach the log item from the buffer.
> > +
> > +4. Queue the buffer to a special delwri list.
> > +
> > +5. Clear the transaction dirty flag if the only dirty log items were
> > the ones
> > +   that were detached in step 3.
> > +
> > +6. Submit the delwri list to commit the changes to the xfile, if the
> > updates
> > +   are being committed.
> > +
> > +After removing xfile logged buffers from the transaction in this
> > manner, the
> > +transaction can be committed or cancelled.
> Rest of this looks pretty good, organizing nits aside.

Cool, thank you!!

--D

> Allison
> 
> > 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH v24.3 12/14] xfs: document directory tree repairs
  2022-12-30 22:10   ` [PATCH 12/14] xfs: document directory tree repairs Darrick J. Wong
  2023-01-14  2:32     ` [PATCH v24.2 " Darrick J. Wong
@ 2023-02-03  2:12     ` Darrick J. Wong
  2023-02-25  7:33       ` Allison Henderson
  1 sibling, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2023-02-03  2:12 UTC (permalink / raw)
  To: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Directory tree repairs are the least complete part of online fsck, due
to the lack of directory parent pointers.  However, even without that
feature, we can still make some corrections to the directory tree -- we
can salvage as many directory entries as we can from a damaged
directory, and we can reattach orphaned inodes to the lost+found, just
as xfs_repair does now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
v24.2: updated with my latest thoughts about how to use parent pointers
v24.3: updated to reflect the online fsck code I built for parent pointers
---
 .../filesystems/xfs-online-fsck-design.rst         |  410 ++++++++++++++++++++
 1 file changed, 410 insertions(+)

diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index af7755fe0107..51d040e4a2d0 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -4359,3 +4359,413 @@ The proposed patchset is the
 `extended attribute repair
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
 series.
+
+Fixing Directories
+------------------
+
+Fixing directories is difficult with currently available filesystem features,
+since directory entries are not redundant.
+The offline repair tool scans all inodes to find files with nonzero link count,
+and then it scans all directories to establish parentage of those linked files.
+Damaged files and directories are zapped, and files with no parent are
+moved to the ``/lost+found`` directory.
+It does not try to salvage anything.
+
+The best that online repair can do at this time is to read directory data
+blocks and salvage any dirents that look plausible, correct link counts, and
+move orphans back into the directory tree.
+The salvage process is discussed in the case study at the end of this section.
+The :ref:`file link count fsck <nlinks>` code takes care of fixing link counts
+and moving orphans to the ``/lost+found`` directory.
+
+Case Study: Salvaging Directories
+`````````````````````````````````
+
+Unlike extended attributes, directory blocks are all the same size, so
+salvaging directories is straightforward:
+
+1. Find the parent of the directory.
+   If the dotdot entry is not unreadable, try to confirm that the alleged
+   parent has a child entry pointing back to the directory being repaired.
+   Otherwise, walk the filesystem to find it.
+
+2. Walk the first partition of data fork of the directory to find the directory
+   entry data blocks.
+   When one is found,
+
+   a. Walk the directory data block to find candidate entries.
+      When an entry is found:
+
+      i. Check the name for problems, and ignore the name if there are.
+
+      ii. Retrieve the inumber and grab the inode.
+          If that succeeds, add the name, inode number, and file type to the
+          staging xfarray and xblob.
+
+3. If the memory usage of the xfarray and xfblob exceed a certain amount of
+   memory or there are no more directory data blocks to examine, unlock the
+   directory and add the staged dirents into the temporary directory.
+   Truncate the staging files.
+
+4. Use atomic extent swapping to exchange the new and old directory structures.
+   The old directory blocks are now attached to the temporary file.
+
+5. Reap the temporary file.
+
+**Future Work Question**: Should repair revalidate the dentry cache when
+rebuilding a directory?
+
+*Answer*: Yes, though the current dentry cache code doesn't provide a means
+to walk every dentry of a specific directory.
+If the cache contains an entry that the salvaging code does not find, the
+repair cannot proceed.
+
+**Future Work Question**: Can the dentry cache know about a directory entry
+that cannot be salvaged?
+
+*Answer*: In theory, the dentry cache should be a subset of the directory
+entries on disk because there's no way to load a dentry without having
+something to read in the directory.
+However, it is possible for a coherency problem to be introduced if the ondisk
+structures becomes corrupt *after* the cache loads.
+In theory it is necessary to scan all dentry cache entries for a directory to
+ensure that one of the following apply:
+
+1. The cached dentry reflects an ondisk dirent in the new directory.
+
+2. The cached dentry no longer has a corresponding ondisk dirent in the new
+   directory and the dentry can be purged from the cache.
+
+3. The cached dentry no longer has an ondisk dirent but the dentry cannot be
+   purged.
+   This is bad.
+
+As mentioned above, the dentry cache does not have a means to walk all the
+dentries with a particular directory as a parent.
+This makes detecting situations #2 and #3 impossible, and remains an
+interesting question for research.
+
+The proposed patchset is the
+`directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
+series.
+
+Parent Pointers
+```````````````
+
+The lack of secondary directory metadata hinders directory tree reconstruction
+in much the same way that the historic lack of reverse space mapping
+information once hindered reconstruction of filesystem space metadata.
+The parent pointer feature, however, makes total directory reconstruction
+possible.
+
+Directory parent pointers were first proposed as an XFS feature more than a
+decade ago by SGI.
+Each link from a parent directory to a child file is mirrored with an extended
+attribute in the child that could be used to identify the parent directory.
+Unfortunately, this early implementation had major shortcomings and was never
+merged into Linux XFS:
+
+1. The XFS codebase of the late 2000s did not have the infrastructure to
+   enforce strong referential integrity in the directory tree.
+   It did not guarantee that a change in a forward link would always be
+   followed up with the corresponding change to the reverse links.
+
+2. Referential integrity was not integrated into offline repair.
+   Checking and repairs were performed on mounted filesystems without taking
+   any kernel or inode locks to coordinate access.
+   It is not clear how this actually worked properly.
+
+3. The extended attribute did not record the name of the directory entry in the
+   parent, so the SGI parent pointer implementation cannot be used to reconnect
+   the directory tree.
+
+4. Extended attribute forks only support 65,536 extents, which means that
+   parent pointer attribute creation is likely to fail at some point before the
+   maximum file link count is achieved.
+
+Allison Henderson, Chandan Babu, and Catherine Hoang are working on a second
+implementation that solves all shortcomings of the first.
+During 2022, Allison introduced log intent items to track physical
+manipulations of the extended attribute structures.
+This solves the referential integrity problem by making it possible to commit
+a dirent update and a parent pointer update in the same transaction.
+Chandan increased the maximum extent counts of both data and attribute forks,
+thereby addressing the fourth problem.
+
+To solve the third problem, parent pointers include the dirent name and
+location of the entry within the parent directory.
+In other words, child files use extended attributes to store pointers to
+parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``.
+
+On a filesystem with parent pointers, the directory checking process can be
+strengthened to ensure that the target of each dirent also contains a parent
+pointer pointing back to the dirent.
+Likewise, each parent pointer can be checked by ensuring that the target of
+each parent pointer is a directory and that it contains a dirent matching
+the parent pointer.
+Both online and offline repair can use this strategy.
+
+Case Study: Repairing Directories with Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
+a :ref:`directory entry live update hook <liveupdate>` as follows:
+
+1. Set up a temporary directory for generating the new directory structure,
+   an xfblob for storing entry names, and an xfarray for stashing directory
+   updates.
+
+2. Set up an inode scanner and hook into the directory entry code to receive
+   updates on directory operations.
+
+3. For each parent pointer found in each file scanned, decide if the parent
+   pointer references the directory of interest.
+   If so:
+
+   a. Stash an addname entry for this dirent in the xfarray for later.
+
+   b. When finished scanning that file, flush the stashed updates to the
+      temporary directory.
+
+4. For each live directory update received via the hook, decide if the child
+   has already been scanned.
+   If so:
+
+   a. Stash an addname or removename entry for this dirent update in the
+      xfarray for later.
+      We cannot write directly to the temporary directory because hook
+      functions are not allowed to modify filesystem metadata.
+      Instead, we stash updates in the xfarray and rely on the scanner thread
+      to apply the stashed updates to the temporary directory.
+
+5. When the scan is complete, atomically swap the contents of the temporary
+   directory and the directory being repaired.
+   The temporary directory now contains the damaged directory structure.
+
+6. Reap the temporary directory.
+
+7. Update the dirent position field of parent pointers as necessary.
+   This may require the queuing of a substantial number of xattr log intent
+   items.
+
+The proposed patchset is the
+`parent pointers directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_
+series.
+
+**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields
+match in the reconstructed directory?
+
+*Answer*: There are a few ways to solve this problem:
+
+1. The field could be designated advisory, since the other three values are
+   sufficient to find the entry in the parent.
+   However, this makes indexed key lookup impossible while repairs are ongoing.
+
+2. We could allow creating directory entries at specified offsets, which solves
+   the referential integrity problem but runs the risk that dirent creation
+   will fail due to conflicts with the free space in the directory.
+
+   These conflicts could be resolved by appending the directory entry and
+   amending the xattr code to support updating an xattr key and reindexing the
+   dabtree, though this would have to be performed with the parent directory
+   still locked.
+
+3. Same as above, but remove the old parent pointer entry and add a new one
+   atomically.
+
+4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``,
+   which would provide the attr name uniqueness that we require, without
+   forcing repair code to update the dirent position.
+   Unfortunately, this requires changes to the xattr code to support attr
+   names as long as 263 bytes.
+
+5. Change the ondisk xattr format to ``(parent_inum, hash(name)) →
+   (name, parent_gen)``.
+   If the hash is sufficiently resistant to collisions (e.g. sha256) then
+   this should provide the attr name uniqueness that we require.
+   Names shorter than 247 bytes could be stored directly.
+
+Case Study: Repairing Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Online reconstruction of a file's parent pointer information works similarly to
+directory reconstruction:
+
+1. Set up a temporary file for generating a new extended attribute structure,
+   an xfblob for storing parent pointer names, and an xfarray for stashing
+   parent pointer updates.
+
+2. Set up an inode scanner and hook into the directory entry code to receive
+   updates on directory operations.
+
+3. For each directory entry found in each directory scanned, decide if the
+   dirent references the file of interest.
+   If so:
+
+   a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray
+      for later.
+
+   b. When finished scanning the directory, flush the stashed updates to the
+      temporary directory.
+
+4. For each live directory update received via the hook, decide if the parent
+   has already been scanned.
+   If so:
+
+   a. Stash an addpptr or removepptr entry for this dirent update in the
+      xfarray for later.
+      We cannot write parent pointers directly to the temporary file because
+      hook functions are not allowed to modify filesystem metadata.
+      Instead, we stash updates in the xfarray and rely on the scanner thread
+      to apply the stashed parent pointer updates to the temporary file.
+
+5. Copy all non-parent pointer extended attributes to the temporary file.
+
+6. When the scan is complete, atomically swap the attribute fork of the
+   temporary file and the file being repaired.
+   The temporary file now contains the damaged extended attribute structure.
+
+7. Reap the temporary file.
+
+The proposed patchset is the
+`parent pointers repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_
+series.
+
+Digression: Offline Checking of Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Examining parent pointers in offline repair works differently because corrupt
+files are erased long before directory tree connectivity checks are performed.
+Parent pointer checks are therefore a second pass to be added to the existing
+connectivity checks:
+
+1. After the set of surviving files has been established (i.e. phase 6),
+   walk the surviving directories of each AG in the filesystem.
+   This is already performed as part of the connectivity checks.
+
+2. For each directory entry found, record the name in an xfblob, and store
+   ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a
+   per-AG in-memory slab.
+
+3. For each AG in the filesystem,
+
+   a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and
+      dirent_pos.
+
+   b. For each inode in the AG,
+
+      1. Scan the inode for parent pointers.
+         Record the names in a per-file xfblob, and store ``(parent_inum,
+         parent_gen, dirent_pos)`` tuples in a per-file slab.
+
+      2. Sort the per-file tuples in order of parent_inum, and dirent_pos.
+
+      3. Position one slab cursor at the start of the inode's records in the
+         per-AG tuple slab.
+         This should be trivial since the per-AG tuples are in child inumber
+         order.
+
+      4. Position a second slab cursor at the start of the per-file tuple slab.
+
+      5. Iterate the two cursors in lockstep, comparing the parent_ino and
+         dirent_pos fields of the records under each cursor.
+
+         a. Tuples in the per-AG list but not the per-file list are missing and
+            need to be written to the inode.
+
+         b. Tuples in the per-file list but not the per-AG list are dangling
+            and need to be removed from the inode.
+
+         c. For tuples in both lists, update the parent_gen and name components
+            of the parent pointer if necessary.
+
+4. Move on to examining link counts, as we do today.
+
+The proposed patchset is the
+`offline parent pointers repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_
+series.
+
+Rebuilding directories from parent pointers in offline repair is very
+challenging because it currently uses a single-pass scan of the filesystem
+during phase 3 to decide which files are corrupt enough to be zapped.
+This scan would have to be converted into a multi-pass scan:
+
+1. The first pass of the scan zaps corrupt inodes, forks, and attributes
+   much as it does now.
+   Corrupt directories are noted but not zapped.
+
+2. The next pass records parent pointers pointing to the directories noted
+   as being corrupt in the first pass.
+   This second pass may have to happen after the phase 4 scan for duplicate
+   blocks, if phase 4 is also capable of zapping directories.
+
+3. The third pass resets corrupt directories to an empty shortform directory.
+   Free space metadata has not been ensured yet, so repair cannot yet use the
+   directory building code in libxfs.
+
+4. At the start of phase 6, space metadata have been rebuilt.
+   Use the parent pointer information recorded during step 2 to reconstruct
+   the dirents and add them to the now-empty directories.
+
+This code has not yet been constructed.
+
+.. _orphanage:
+
+The Orphanage
+-------------
+
+Filesystems present files as a directed, and hopefully acyclic, graph.
+In other words, a tree.
+The root of the filesystem is a directory, and each entry in a directory points
+downwards either to more subdirectories or to non-directory files.
+Unfortunately, a disruption in the directory graph pointers result in a
+disconnected graph, which makes files impossible to access via regular path
+resolution.
+The directory parent pointer online scrub code can detect a dotdot entry
+pointing to a parent directory that doesn't have a link back to the child
+directory, and the file link count checker can detect a file that isn't pointed
+to by any directory in the filesystem.
+If the file in question has a positive link count, the file in question is an
+orphan.
+
+When orphans are found, they should be reconnected to the directory tree.
+Offline fsck solves the problem by creating a directory ``/lost+found`` to
+serve as an orphanage, and linking orphan files into the orphanage by using the
+inumber as the name.
+Reparenting a file to the orphanage does not reset any of its permissions or
+ACLs.
+
+This process is more involved in the kernel than it is in userspace.
+The directory and file link count repair setup functions must use the regular
+VFS mechanisms to create the orphanage directory with all the necessary
+security attributes and dentry cache entries, just like a regular directory
+tree modification.
+
+Orphaned files are adopted by the orphanage as follows:
+
+1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function
+   to try to ensure that the lost and found directory actually exists.
+   This also attaches the orphanage directory to the scrub context.
+
+2. If the decision is made to reconnect a file, take the IOLOCK of both the
+   orphanage and the file being reattached.
+   The ``xrep_orphanage_iolock_two`` function follows the inode locking
+   strategy discussed earlier.
+
+3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name``
+   to compute the new name in the orphanage and the block reservation required.
+
+4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair
+   transaction.
+
+5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost
+   and found, and update the kernel dentry cache.
+
+The proposed patches are in the
+`orphanage adoption
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
+series.

^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 05/14] xfs: document the filesystem metadata checking strategy
  2023-02-02 19:04       ` Darrick J. Wong
@ 2023-02-09  5:41         ` Allison Henderson
  0 siblings, 0 replies; 220+ messages in thread
From: Allison Henderson @ 2023-02-09  5:41 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Thu, 2023-02-02 at 11:04 -0800, Darrick J. Wong wrote:
> On Sat, Jan 21, 2023 at 01:38:33AM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Begin the fifth chapter of the online fsck design documentation,
> > > where
> > > we discuss the details of the data structures and algorithms used
> > > by
> > > the
> > > kernel to examine filesystem metadata and cross-reference it
> > > around
> > > the
> > > filesystem.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  579
> > > ++++++++++++++++++++
> > >  .../filesystems/xfs-self-describing-metadata.rst   |    1 
> > >  2 files changed, 580 insertions(+)
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index 42e82971e036..f45bf97fa9c4 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -864,3 +864,582 @@ Proposed patchsets include
> > >  and
> > >  `preservation of sickness info during memory reclaim
> > >  <
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=indirect-health-reporting>`_.
> > > +
> > > +5. Kernel Algorithms and Data Structures
> > > +========================================
> > > +
> > > +This section discusses the key algorithms and data structures of
> > > the
> > > kernel
> > > +code that provide the ability to check and repair metadata while
> > > the
> > > system
> > > +is running.
> > > +The first chapters in this section reveal the pieces that
> > > provide
> > > the
> > > +foundation for checking metadata.
> > > +The remainder of this section presents the mechanisms through
> > > which
> > > XFS
> > > +regenerates itself.
> > > +
> > > +Self Describing Metadata
> > > +------------------------
> > > +
> > > +Starting with XFS version 5 in 2012, XFS updated the format of
> > > nearly every
> > > +ondisk block header to record a magic number, a checksum, a
> > > universally
> > > +"unique" identifier (UUID), an owner code, the ondisk address of
> > > the
> > > block,
> > > +and a log sequence number.
> > > +When loading a block buffer from disk, the magic number, UUID,
> > > owner, and
> > > +ondisk address confirm that the retrieved block matches the
> > > specific
> > > owner of
> > > +the current filesystem, and that the information contained in
> > > the
> > > block is
> > > +supposed to be found at the ondisk address.
> > > +The first three components enable checking tools to disregard
> > > alleged metadata
> > > +that doesn't belong to the filesystem, and the fourth component
> > > enables the
> > > +filesystem to detect lost writes.
> > Add...
> > 
> > "When ever a file system operation modifies a block, the change is
> > submitted to the journal as a transaction.  The journal then
> > processes
> > these transactions marking them done once they are safely committed
> > to
> > the disk"
> 
> Ok, I'll add that transition.  Though I'll s/journal/log/ since this
> is
> xfs. :)
> 
> > At this point we havnt talked much at all about transactions or
> > logs,
> > and we've just barely begin to cover blocks.  I think you at least
> > want
> > a quick blip to describe the relation of these two things, or it
> > may
> > not be clear why we suddenly jumped into logs.
> 
> Point taken.  Thanks for the suggestion.
> 
> > > +
> > > +The logging code maintains the checksum and the log sequence
> > > number
> > > of the last
> > > +transactional update.
> > > +Checksums are useful for detecting torn writes and other
> > > mischief
> > "Checksums (or crc's) are useful for detecting incomplete or torn
> > writes as well as other discrepancies..."
> 
> Checksums are a general concept, whereas CRCs denote a particular
> family
> of checksums.  The statement would still apply even if we used a
> different family (e.g. erasure codes, cryptographic hash functions)
> of
> function instead of crc32c.
> 
> I will, however, avoid the undefined term 'mischief'.  Thanks for the
> correction.
> 
> "Checksums are useful for detecting torn writes and other
> discrepancies
> that can be introduced between the computer and its storage devices."
> 
> > > between the
> > > +computer and its storage devices.
> > > +Sequence number tracking enables log recovery to avoid applying
> > > out
> > > of date
> > > +log updates to the filesystem.
> > > +
> > > +These two features improve overall runtime resiliency by
> > > providing a
> > > means for
> > > +the filesystem to detect obvious corruption when reading
> > > metadata
> > > blocks from
> > > +disk, but these buffer verifiers cannot provide any consistency
> > > checking
> > > +between metadata structures.
> > > +
> > > +For more information, please see the documentation for
> > > +Documentation/filesystems/xfs-self-describing-metadata.rst
> > > +
> > > +Reverse Mapping
> > > +---------------
> > > +
> > > +The original design of XFS (circa 1993) is an improvement upon
> > > 1980s
> > > Unix
> > > +filesystem design.
> > > +In those days, storage density was expensive, CPU time was
> > > scarce,
> > > and
> > > +excessive seek time could kill performance.
> > > +For performance reasons, filesystem authors were reluctant to
> > > add
> > > redundancy to
> > > +the filesystem, even at the cost of data integrity.
> > > +Filesystems designers in the early 21st century choose different
> > > strategies to
> > > +increase internal redundancy -- either storing nearly identical
> > > copies of
> > > +metadata, or more space-efficient techniques such as erasure
> > > coding.
> > "such as erasure coding which may encode sections of the data with
> > redundant symbols and in more than one location"
> > 
> > That ties it into the next line.  If you go on to talk about a term
> > you
> > have not previously defined, i think you want to either define it
> > quickly or just drop it all together.  Right now your goal is to
> > just
> > give the reader context, so you want it to move quickly.
> 
> How about I shorten it to:
> 
> "...or more space-efficient encoding techniques." ?
Sure, I think that would be fine

> 
> and end the paragraph there?
> 
> > > +Obvious corruptions are typically repaired by copying replicas
> > > or
> > > +reconstructing from codes.
> > > +
> > I think I would have just jumped straight from xfs history to
> > modern
> > xfs...
> > > +For XFS, a different redundancy strategy was chosen to modernize
> > > the
> > > design:
> > > +a secondary space usage index that maps allocated disk extents
> > > back
> > > to their
> > > +owners.
> > > +By adding a new index, the filesystem retains most of its
> > > ability to
> > > scale
> > > +well to heavily threaded workloads involving large datasets,
> > > since
> > > the primary
> > > +file metadata (the directory tree, the file block map, and the
> > > allocation
> > > +groups) remain unchanged.
> > > 
> > 
> > > +Although the reverse-mapping feature increases overhead costs
> > > for
> > > space
> > > +mapping activities just like any other system that improves
> > > redundancy, it
> > "Like any system that improves redundancy, the reverse-mapping
> > feature
> > increases overhead costs for space mapping activities. However,
> > it..."
> 
> I like this better.  These two sentences have been changed to read:
> 
> "Like any system that improves redundancy, the reverse-mapping
> feature
> increases overhead costs for space mapping activities.  However, it
> has
> two critical advantages: first, the reverse index is key to enabling
> online fsck and other requested functionality such as free space
> defragmentation, better media failure reporting, and filesystem
> shrinking."
Alrighty, sounds good

> 
> > > +has two critical advantages: first, the reverse index is key to
> > > enabling online
> > > +fsck and other requested functionality such as filesystem
> > > reorganization,
> > > +better media failure reporting, and shrinking.
> > > +Second, the different ondisk storage format of the reverse
> > > mapping
> > > btree
> > > +defeats device-level deduplication, because the filesystem
> > > requires
> > > real
> > > +redundancy.
> > > +
> > > +A criticism of adding the secondary index is that it does
> > > nothing to
> > > improve
> > > +the robustness of user data storage itself.
> > > +This is a valid point, but adding a new index for file data
> > > block
> > > checksums
> > > +increases write amplification and turns data overwrites into
> > > copy-
> > > writes, which
> > > +age the filesystem prematurely.
> > > +In keeping with thirty years of precedent, users who want file
> > > data
> > > integrity
> > > +can supply as powerful a solution as they require.
> > > +As for metadata, the complexity of adding a new secondary index
> > > of
> > > space usage
> > > +is much less than adding volume management and storage device
> > > mirroring to XFS
> > > +itself.
> > > +Perfection of RAID and volume management are best left to
> > > existing
> > > layers in
> > > +the kernel.
> > I think I would cull the entire above paragraph.  rmap, crc and
> > raid
> > all have very different points of redundancy, so criticism that an
> > apple is not an orange or visavis just feels like a shortsighted
> > comparison that's probably more of a distraction than anything.
> > 
> > Sometimes it feels like this document kinda gets off into tangents
> > like it's preemptively trying to position it's self for an argument
> > that hasn't happened yet.
> 
> It does!  Each of the many tangents that you've pointed out are a
> reaction to some discussion that we've had on the list, or at an
> LSF, or <cough> fs nerds sniping on social media.  The reason I
> capture all of these offtopic arguments is to discourage people from
> wasting time rehashing discussions that were settled long ago.
> 
> Admittedly, that is a very defensive reaction on my part...
> 
> > But I think it has the effect of pulling the
> > readers attention off topic into an argument they never thought to
> > consider in the first place.  The topic of this section is to
> > explain
> > what rmap is.  So lets stay on topic and finish laying out that
> > ground
> > work first before getting into how it compares to other solutions
> 
> ...and you're right to point out that mentioning these things is
> distracting and provides fuel to reignite a flamewar.  At the same
> time,
> I think there's value in identifying the roads not taken, and why.
> 
> What if I turned these tangents into explicitly labelled sidebars?
> Would that help readers who want to stick to the topic?
> 
Sure, I think that would be a reasonable compromise

> > > +
> > > +The information captured in a reverse space mapping record is as
> > > follows:
> > > +
> > > +.. code-block:: c
> > > +
> > > +       struct xfs_rmap_irec {
> > > +           xfs_agblock_t    rm_startblock;   /* extent start
> > > block
> > > */
> > > +           xfs_extlen_t     rm_blockcount;   /* extent length */
> > > +           uint64_t         rm_owner;        /* extent owner */
> > > +           uint64_t         rm_offset;       /* offset within
> > > the
> > > owner */
> > > +           unsigned int     rm_flags;        /* state flags */
> > > +       };
> > > +
> > > +The first two fields capture the location and size of the
> > > physical
> > > space,
> > > +in units of filesystem blocks.
> > > +The owner field tells scrub which metadata structure or file
> > > inode
> > > have been
> > > +assigned this space.
> > > +For space allocated to files, the offset field tells scrub where
> > > the
> > > space was
> > > +mapped within the file fork.
> > > +Finally, the flags field provides extra information about the
> > > space
> > > usage --
> > > +is this an attribute fork extent?  A file mapping btree extent? 
> > > Or
> > > an
> > > +unwritten data extent?
> > > +
> > > +Online filesystem checking judges the consistency of each
> > > primary
> > > metadata
> > > +record by comparing its information against all other space
> > > indices.
> > > +The reverse mapping index plays a key role in the consistency
> > > checking process
> > > +because it contains a centralized alternate copy of all space
> > > allocation
> > > +information.
> > > +Program runtime and ease of resource acquisition are the only
> > > real
> > > limits to
> > > +what online checking can consult.
> > > +For example, a file data extent mapping can be checked against:
> > > +
> > > +* The absence of an entry in the free space information.
> > > +* The absence of an entry in the inode index.
> > > +* The absence of an entry in the reference count data if the
> > > file is
> > > not
> > > +  marked as having shared extents.
> > > +* The correspondence of an entry in the reverse mapping
> > > information.
> > > +
> > > +A key observation here is that only the reverse mapping can
> > > provide
> > > a positive
> > > +affirmation of correctness if the primary metadata is in doubt.
> > if any of the above metadata is in doubt...
> 
> Fixed.
> 
> > > +The checking code for most primary metadata follows a path
> > > similar
> > > to the
> > > +one outlined above.
> > > +
> > > +A second observation to make about this secondary index is that
> > > proving its
> > > +consistency with the primary metadata is difficult.
> > 
> > > +Demonstrating that a given reverse mapping record exactly
> > > corresponds to the
> > > +primary space metadata involves a full scan of all primary space
> > > metadata,
> > > +which is very time intensive.
> > "But why?" Wonders the reader. Just jump into an example:
> > 
> > "In order to verify that an rmap extent does not incorrectly over
> > lap
> > with another record, we would need a full scan of all the other
> > records, which is time intensive."
> 
> I want to shorten it even further:
> 
> "Validating that reverse mapping records are correct requires a full
> scan of all primary space metadata, which is very time intensive."
Ok, I think that sounds fine

> 
> > 
> > ?
> > 
> > And then the below is a separate observation right?  
> 
> Right.
> 
> > > +Scanning activity for online fsck can only use non-blocking lock
> > > acquisition
> > > +primitives if the locking order is not the regular order as used
> > > by
> > > the rest of
> > > +the filesystem.
> > Lastly, it should be noted that most file system operations tend to
> > lock primary metadata before locking the secondary metadata.
> 
> This isn't accurate -- metadata structures don't have separate locks.
> So it's not true to say that we lock primary or secondary metadata.
> 
> We /can/ say that file operations lock the inode, then the AGI, then
> the
> AGF; or that directory operations lock the parent and child ILOCKs in
> inumber order; and that if scrub wants to take locks in any other
> order,
> it can only do that via trylocks and backoff.
I see, ok maybe giving one or both of those examples is clearer then

> 
> > This
> > means that scanning operations that acquire the secondary metadata
> > first may need to yield the secondary lock to filesystem operations
> > that have already acquired the primary lock. 
> > 
> > ?
> > 
> > > +This means that forward progress during this part of a scan of
> > > the
> > > reverse
> > > +mapping data cannot be guaranteed if system load is especially
> > > heavy.
> > > +Therefore, it is not practical for online check to detect
> > > reverse
> > > mapping
> > > +records that lack a counterpart in the primary metadata.
> > Such as <quick list / quick example>
> > 
> > > +Instead, scrub relies on rigorous cross-referencing during the
> > > primary space
> > > +mapping structure checks.
> 
> I've converted this section into a bullet list:
> 
> "There are several observations to make about reverse mapping
> indices:
> 
> "1. Reverse mappings can provide a positive affirmation of
> correctness if
> any of the above primary metadata are in doubt.  The checking code
> for
> most primary metadata follows a path similar to the one outlined
> above.
> 
> "2. Proving the consistency of secondary metadata with the primary
> metadata is difficult because that requires a full scan of all
> primary
> space metadata, which is very time intensive.  For example, checking
> a
> reverse mapping record for a file extent mapping btree block requires
> locking the file and searching the entire btree to confirm the block.
> Instead, scrub relies on rigorous cross-referencing during the
> primary
> space mapping structure checks.
> 
> "3. Consistency scans must use non-blocking lock acquisition
> primitives
> if the required locking order is not the same order used by regular
> filesystem operations.  This means that forward progress during this
> part of a scan of the reverse mapping data cannot be guaranteed if
> system load is heavy."
Ok, I think that reads cleaner

> 
> > > +
> > 
> > The below paragraph sounds like a re-cap?
> > 
> > "So to recap, reverse mappings also...."
> 
> Yep.
> 
> > > +Reverse mappings also play a key role in reconstruction of
> > > primary
> > > metadata.
> > > +The secondary information is general enough for online repair to
> > > synthesize a
> > > +complete copy of any primary space management metadata by
> > > locking
> > > that
> > > +resource, querying all reverse mapping indices looking for
> > > records
> > > matching
> > > +the relevant resource, and transforming the mapping into an
> > > appropriate format.
> > > +The details of how these records are staged, written to disk,
> > > and
> > > committed
> > > +into the filesystem are covered in subsequent sections.
> > I also think the section would be ok if you were to trim off this
> > last
> > paragraph too.
> 
> Hm.  I still want to set up the expectation that there's more to
> come.
> How about a brief two-sentence transition paragraph:
> 
> "In summary, reverse mappings play a key role in reconstruction of
> primary metadata.  The details of how these records are staged,
> written
> to disk, and committed into the filesystem are covered in subsequent
> sections."
Ok, I think that's a cleaner wrap up
> 
> > 
> > > +
> > > +Checking and Cross-Referencing
> > > +------------------------------
> > > +
> > > +The first step of checking a metadata structure is to examine
> > > every
> > > record
> > > +contained within the structure and its relationship with the
> > > rest of
> > > the
> > > +system.
> > > +XFS contains multiple layers of checking to try to prevent
> > > inconsistent
> > > +metadata from wreaking havoc on the system.
> > > +Each of these layers contributes information that helps the
> > > kernel
> > > to make
> > > +three decisions about the health of a metadata structure:
> > > +
> > > +- Is a part of this structure obviously corrupt
> > > (``XFS_SCRUB_OFLAG_CORRUPT``) ?
> > > +- Is this structure inconsistent with the rest of the system
> > > +  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
> > > +- Is there so much damage around the filesystem that cross-
> > > referencing is not
> > > +  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
> > > +- Can the structure be optimized to improve performance or
> > > reduce
> > > the size of
> > > +  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
> > > +- Does the structure contain data that is not inconsistent but
> > > deserves review
> > > +  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
> > > +
> > > +The following sections describe how the metadata scrubbing
> > > process
> > > works.
> > > +
> > > +Metadata Buffer Verification
> > > +````````````````````````````
> > > +
> > > +The lowest layer of metadata protection in XFS are the metadata
> > > verifiers built
> > > +into the buffer cache.
> > > +These functions perform inexpensive internal consistency
> > > checking of
> > > the block
> > > +itself, and answer these questions:
> > > +
> > > +- Does the block belong to this filesystem?
> > > +
> > > +- Does the block belong to the structure that asked for the
> > > read?
> > > +  This assumes that metadata blocks only have one owner, which
> > > is
> > > always true
> > > +  in XFS.
> > > +
> > > +- Is the type of data stored in the block within a reasonable
> > > range
> > > of what
> > > +  scrub is expecting?
> > > +
> > > +- Does the physical location of the block match the location it
> > > was
> > > read from?
> > > +
> > > +- Does the block checksum match the data?
> > > +
> > > +The scope of the protections here are very limited -- verifiers
> > > can
> > > only
> > > +establish that the filesystem code is reasonably free of gross
> > > corruption bugs
> > > +and that the storage system is reasonably competent at
> > > retrieval.
> > > +Corruption problems observed at runtime cause the generation of
> > > health reports,
> > > +failed system calls, and in the extreme case, filesystem
> > > shutdowns
> > > if the
> > > +corrupt metadata force the cancellation of a dirty transaction.
> > > +
> > > +Every online fsck scrubbing function is expected to read every
> > > ondisk metadata
> > > +block of a structure in the course of checking the structure.
> > > +Corruption problems observed during a check are immediately
> > > reported
> > > to
> > > +userspace as corruption; during a cross-reference, they are
> > > reported
> > > as a
> > > +failure to cross-reference once the full examination is
> > > complete.
> > > +Reads satisfied by a buffer already in cache (and hence already
> > > verified)
> > > +bypass these checks.
> > > +
> > > +Internal Consistency Checks
> > > +```````````````````````````
> > > +
> > > +The next higher level of metadata protection is the internal
> > > record
> > "After the buffer cache, the next level of metadata protection
> > is..."
> 
> Changed.  I'll do the same to the next section as well.
> 
> > > +verification code built into the filesystem.
> > 
> > > +These checks are split between the buffer verifiers, the in-
> > > filesystem users of
> > > +the buffer cache, and the scrub code itself, depending on the
> > > amount
> > > of higher
> > > +level context required.
> > > +The scope of checking is still internal to the block.
> > > +For performance reasons, regular code may skip some of these
> > > checks
> > > unless
> > > +debugging is enabled or a write is about to occur.
> > > +Scrub functions, of course, must check all possible problems.
> > I'd put this chunk after the list below.
> > 
> > > +Either way, these higher level checking functions answer these
> > > questions:
> > Then this becomes:
> > "These higher level checking functions..."
> 
> Done.
> 
> > > +
> > > +- Does the type of data stored in the block match what scrub is
> > > expecting?
> > > +
> > > +- Does the block belong to the owning structure that asked for
> > > the
> > > read?
> > > +
> > > +- If the block contains records, do the records fit within the
> > > block?
> > > +
> > > +- If the block tracks internal free space information, is it
> > > consistent with
> > > +  the record areas?
> > > +
> > > +- Are the records contained inside the block free of obvious
> > > corruptions?
> > > +
> > > +Record checks in this category are more rigorous and more time-
> > > intensive.
> > > +For example, block pointers and inumbers are checked to ensure
> > > that
> > > they point
> > > +within the dynamically allocated parts of an allocation group
> > > and
> > > within
> > > +the filesystem.
> > > +Names are checked for invalid characters, and flags are checked
> > > for
> > > invalid
> > > +combinations.
> > > +Other record attributes are checked for sensible values.
> > > +Btree records spanning an interval of the btree keyspace are
> > > checked
> > > for
> > > +correct order and lack of mergeability (except for file fork
> > > mappings).
> > > +
> > > +Validation of Userspace-Controlled Record Attributes
> > > +````````````````````````````````````````````````````
> > > +
> > > +Various pieces of filesystem metadata are directly controlled by
> > > userspace.
> > > +Because of this nature, validation work cannot be more precise
> > > than
> > > checking
> > > +that a value is within the possible range.
> > > +These fields include:
> > > +
> > > +- Superblock fields controlled by mount options
> > > +- Filesystem labels
> > > +- File timestamps
> > > +- File permissions
> > > +- File size
> > > +- File flags
> > > +- Names present in directory entries, extended attribute keys,
> > > and
> > > filesystem
> > > +  labels
> > > +- Extended attribute key namespaces
> > > +- Extended attribute values
> > > +- File data block contents
> > > +- Quota limits
> > > +- Quota timer expiration (if resource usage exceeds the soft
> > > limit)
> > > +
> > > +Cross-Referencing Space Metadata
> > > +````````````````````````````````
> > > +
> > > +The next higher level of checking is cross-referencing records
> > > between metadata
> > 
> > I kinda like the list first so that the reader has an idea of what
> > these checks are before getting into discussion about them.  It
> > just
> > makes it a little more obvious as to why it's "prohibitively
> > expensive"
> > or "dependent on the context of the structure" after having just
> > looked
> > at it
> 
> <nod>
> 
> > The rest looks good from here.
> 
> Woot.  Onto the next reply! :)
> 
> --D
> 
> > Allison
> > 
> > > +structures.
> > > +For regular runtime code, the cost of these checks is considered
> > > to
> > > be
> > > +prohibitively expensive, but as scrub is dedicated to rooting
> > > out
> > > +inconsistencies, it must pursue all avenues of inquiry.
> > > +The exact set of cross-referencing is highly dependent on the
> > > context of the
> > > +data structure being checked.
> > > +
> > > +The XFS btree code has keyspace scanning functions that online
> > > fsck
> > > uses to
> > > +cross reference one structure with another.
> > > +Specifically, scrub can scan the key space of an index to
> > > determine
> > > if that
> > > +keyspace is fully, sparsely, or not at all mapped to records.
> > > +For the reverse mapping btree, it is possible to mask parts of
> > > the
> > > key for the
> > > +purposes of performing a keyspace scan so that scrub can decide
> > > if
> > > the rmap
> > > +btree contains records mapping a certain extent of physical
> > > space
> > > without the
> > > +sparsenses of the rest of the rmap keyspace getting in the way.
> > > +
> > > +Btree blocks undergo the following checks before cross-
> > > referencing:
> > > +
> > > +- Does the type of data stored in the block match what scrub is
> > > expecting?
> > > +
> > > +- Does the block belong to the owning structure that asked for
> > > the
> > > read?
> > > +
> > > +- Do the records fit within the block?
> > > +
> > > +- Are the records contained inside the block free of obvious
> > > corruptions?
> > > +
> > > +- Are the name hashes in the correct order?
> > > +
> > > +- Do node pointers within the btree point to valid block
> > > addresses
> > > for the type
> > > +  of btree?
> > > +
> > > +- Do child pointers point towards the leaves?
> > > +
> > > +- Do sibling pointers point across the same level?
> > > +
> > > +- For each node block record, does the record key accurate
> > > reflect
> > > the contents
> > > +  of the child block?
> > > +
> > > +Space allocation records are cross-referenced as follows:
> > > +
> > > +1. Any space mentioned by any metadata structure are cross-
> > > referenced as
> > > +   follows:
> > > +
> > > +   - Does the reverse mapping index list only the appropriate
> > > owner
> > > as the
> > > +     owner of each block?
> > > +
> > > +   - Are none of the blocks claimed as free space?
> > > +
> > > +   - If these aren't file data blocks, are none of the blocks
> > > claimed as space
> > > +     shared by different owners?
> > > +
> > > +2. Btree blocks are cross-referenced as follows:
> > > +
> > > +   - Everything in class 1 above.
> > > +
> > > +   - If there's a parent node block, do the keys listed for this
> > > block match the
> > > +     keyspace of this block?
> > > +
> > > +   - Do the sibling pointers point to valid blocks?  Of the same
> > > level?
> > > +
> > > +   - Do the child pointers point to valid blocks?  Of the next
> > > level
> > > down?
> > > +
> > > +3. Free space btree records are cross-referenced as follows:
> > > +
> > > +   - Everything in class 1 and 2 above.
> > > +
> > > +   - Does the reverse mapping index list no owners of this
> > > space?
> > > +
> > > +   - Is this space not claimed by the inode index for inodes?
> > > +
> > > +   - Is it not mentioned by the reference count index?
> > > +
> > > +   - Is there a matching record in the other free space btree?
> > > +
> > > +4. Inode btree records are cross-referenced as follows:
> > > +
> > > +   - Everything in class 1 and 2 above.
> > > +
> > > +   - Is there a matching record in free inode btree?
> > > +
> > > +   - Do cleared bits in the holemask correspond with inode
> > > clusters?
> > > +
> > > +   - Do set bits in the freemask correspond with inode records
> > > with
> > > zero link
> > > +     count?
> > > +
> > > +5. Inode records are cross-referenced as follows:
> > > +
> > > +   - Everything in class 1.
> > > +
> > > +   - Do all the fields that summarize information about the file
> > > forks actually
> > > +     match those forks?
> > > +
> > > +   - Does each inode with zero link count correspond to a record
> > > in
> > > the free
> > > +     inode btree?
> > > +
> > > +6. File fork space mapping records are cross-referenced as
> > > follows:
> > > +
> > > +   - Everything in class 1 and 2 above.
> > > +
> > > +   - Is this space not mentioned by the inode btrees?
> > > +
> > > +   - If this is a CoW fork mapping, does it correspond to a CoW
> > > entry in the
> > > +     reference count btree?
> > > +
> > > +7. Reference count records are cross-referenced as follows:
> > > +
> > > +   - Everything in class 1 and 2 above.
> > > +
> > > +   - Within the space subkeyspace of the rmap btree (that is to
> > > say,
> > > all
> > > +     records mapped to a particular space extent and ignoring
> > > the
> > > owner info),
> > > +     are there the same number of reverse mapping records for
> > > each
> > > block as the
> > > +     reference count record claims?
> > > +
> > > +Proposed patchsets are the series to find gaps in
> > > +`refcount btree
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-detect-refcount-gaps>`_,
> > > +`inode btree
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-detect-inobt-gaps>`_, and
> > > +`rmap btree
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-detect-rmapbt-gaps>`_ records;
> > > +to find
> > > +`mergeable records
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-detect-mergeable-records>`_;
> > > +and to
> > > +`improve cross referencing with rmap
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-strengthen-rmap-checking>`_
> > > +before starting a repair.
> > > +
> > > +Checking Extended Attributes
> > > +````````````````````````````
> > > +
> > > +Extended attributes implement a key-value store that enable
> > > fragments of data
> > > +to be attached to any file.
> > > +Both the kernel and userspace can access the keys and values,
> > > subject to
> > > +namespace and privilege restrictions.
> > > +Most typically these fragments are metadata about the file --
> > > origins, security
> > > +contexts, user-supplied labels, indexing information, etc.
> > > +
> > > +Names can be as long as 255 bytes and can exist in several
> > > different
> > > +namespaces.
> > > +Values can be as large as 64KB.
> > > +A file's extended attributes are stored in blocks mapped by the
> > > attr
> > > fork.
> > > +The mappings point to leaf blocks, remote value blocks, or
> > > dabtree
> > > blocks.
> > > +Block 0 in the attribute fork is always the top of the
> > > structure,
> > > but otherwise
> > > +each of the three types of blocks can be found at any offset in
> > > the
> > > attr fork.
> > > +Leaf blocks contain attribute key records that point to the name
> > > and
> > > the value.
> > > +Names are always stored elsewhere in the same leaf block.
> > > +Values that are less than 3/4 the size of a filesystem block are
> > > also stored
> > > +elsewhere in the same leaf block.
> > > +Remote value blocks contain values that are too large to fit
> > > inside
> > > a leaf.
> > > +If the leaf information exceeds a single filesystem block, a
> > > dabtree
> > > (also
> > > +rooted at block 0) is created to map hashes of the attribute
> > > names
> > > to leaf
> > > +blocks in the attr fork.
> > > +
> > > +Checking an extended attribute structure is not so
> > > straightfoward
> > > due to the
> > > +lack of separation between attr blocks and index blocks.
> > > +Scrub must read each block mapped by the attr fork and ignore
> > > the
> > > non-leaf
> > > +blocks:
> > > +
> > > +1. Walk the dabtree in the attr fork (if present) to ensure that
> > > there are no
> > > +   irregularities in the blocks or dabtree mappings that do not
> > > point to
> > > +   attr leaf blocks.
> > > +
> > > +2. Walk the blocks of the attr fork looking for leaf blocks.
> > > +   For each entry inside a leaf:
> > > +
> > > +   a. Validate that the name does not contain invalid
> > > characters.
> > > +
> > > +   b. Read the attr value.
> > > +      This performs a named lookup of the attr name to ensure
> > > the
> > > correctness
> > > +      of the dabtree.
> > > +      If the value is stored in a remote block, this also
> > > validates
> > > the
> > > +      integrity of the remote value block.
> > > +
> > > +Checking and Cross-Referencing Directories
> > > +``````````````````````````````````````````
> > > +
> > > +The filesystem directory tree is a directed acylic graph
> > > structure,
> > > with files
> > > +constituting the nodes, and directory entries (dirents)
> > > constituting
> > > the edges.
> > > +Directories are a special type of file containing a set of
> > > mappings
> > > from a
> > > +255-byte sequence (name) to an inumber.
> > > +These are called directory entries, or dirents for short.
> > > +Each directory file must have exactly one directory pointing to
> > > the
> > > file.
> > > +A root directory points to itself.
> > > +Directory entries point to files of any type.
> > > +Each non-directory file may have multiple directories point to
> > > it.
> > > +
> > > +In XFS, directories are implemented as a file containing up to
> > > three
> > > 32GB
> > > +partitions.
> > > +The first partition contains directory entry data blocks.
> > > +Each data block contains variable-sized records associating a
> > > user-
> > > provided
> > > +name with an inumber and, optionally, a file type.
> > > +If the directory entry data grows beyond one block, the second
> > > partition (which
> > > +exists as post-EOF extents) is populated with a block containing
> > > free space
> > > +information and an index that maps hashes of the dirent names to
> > > directory data
> > > +blocks in the first partition.
> > > +This makes directory name lookups very fast.
> > > +If this second partition grows beyond one block, the third
> > > partition
> > > is
> > > +populated with a linear array of free space information for
> > > faster
> > > +expansions.
> > > +If the free space has been separated and the second partition
> > > grows
> > > again
> > > +beyond one block, then a dabtree is used to map hashes of dirent
> > > names to
> > > +directory data blocks.
> > > +
> > > +Checking a directory is pretty straightfoward:
> > > +
> > > +1. Walk the dabtree in the second partition (if present) to
> > > ensure
> > > that there
> > > +   are no irregularities in the blocks or dabtree mappings that
> > > do
> > > not point to
> > > +   dirent blocks.
> > > +
> > > +2. Walk the blocks of the first partition looking for directory
> > > entries.
> > > +   Each dirent is checked as follows:
> > > +
> > > +   a. Does the name contain no invalid characters?
> > > +
> > > +   b. Does the inumber correspond to an actual, allocated inode?
> > > +
> > > +   c. Does the child inode have a nonzero link count?
> > > +
> > > +   d. If a file type is included in the dirent, does it match
> > > the
> > > type of the
> > > +      inode?
> > > +
> > > +   e. If the child is a subdirectory, does the child's dotdot
> > > pointer point
> > > +      back to the parent?
> > > +
> > > +   f. If the directory has a second partition, perform a named
> > > lookup of the
> > > +      dirent name to ensure the correctness of the dabtree.
> > > +
> > > +3. Walk the free space list in the third partition (if present)
> > > to
> > > ensure that
> > > +   the free spaces it describes are really unused.
> > > +
> > > +Checking operations involving :ref:`parents <dirparent>` and
> > > +:ref:`file link counts <nlinks>` are discussed in more detail in
> > > later
> > > +sections.
> > > +
> > > +Checking Directory/Attribute Btrees
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +As stated in previous sections, the directory/attribute btree
> > > (dabtree) index
> > > +maps user-provided names to improve lookup times by avoiding
> > > linear
> > > scans.
> > > +Internally, it maps a 32-bit hash of the name to a block offset
> > > within the
> > > +appropriate file fork.
> > > +
> > > +The internal structure of a dabtree closely resembles the btrees
> > > that record
> > > +fixed-size metadata records -- each dabtree block contains a
> > > magic
> > > number, a
> > > +checksum, sibling pointers, a UUID, a tree level, and a log
> > > sequence
> > > number.
> > > +The format of leaf and node records are the same -- each entry
> > > points to the
> > > +next level down in the hierarchy, with dabtree node records
> > > pointing
> > > to dabtree
> > > +leaf blocks, and dabtree leaf records pointing to non-dabtree
> > > blocks
> > > elsewhere
> > > +in the fork.
> > > +
> > > +Checking and cross-referencing the dabtree is very similar to
> > > what
> > > is done for
> > > +space btrees:
> > > +
> > > +- Does the type of data stored in the block match what scrub is
> > > expecting?
> > > +
> > > +- Does the block belong to the owning structure that asked for
> > > the
> > > read?
> > > +
> > > +- Do the records fit within the block?
> > > +
> > > +- Are the records contained inside the block free of obvious
> > > corruptions?
> > > +
> > > +- Are the name hashes in the correct order?
> > > +
> > > +- Do node pointers within the dabtree point to valid fork
> > > offsets
> > > for dabtree
> > > +  blocks?
> > > +
> > > +- Do leaf pointers within the dabtree point to valid fork
> > > offsets
> > > for directory
> > > +  or attr leaf blocks?
> > > +
> > > +- Do child pointers point towards the leaves?
> > > +
> > > +- Do sibling pointers point across the same level?
> > > +
> > > +- For each dabtree node record, does the record key accurate
> > > reflect
> > > the
> > > +  contents of the child dabtree block?
> > > +
> > > +- For each dabtree leaf record, does the record key accurate
> > > reflect
> > > the
> > > +  contents of the directory or attr block?
> > > +
> > > +Cross-Referencing Summary Counters
> > > +``````````````````````````````````
> > > +
> > > +XFS maintains three classes of summary counters: available
> > > resources, quota
> > > +resource usage, and file link counts.
> > > +
> > > +In theory, the amount of available resources (data blocks,
> > > inodes,
> > > realtime
> > > +extents) can be found by walking the entire filesystem.
> > > +This would make for very slow reporting, so a transactional
> > > filesystem can
> > > +maintain summaries of this information in the superblock.
> > > +Cross-referencing these values against the filesystem metadata
> > > should be a
> > > +simple matter of walking the free space and inode metadata in
> > > each
> > > AG and the
> > > +realtime bitmap, but there are complications that will be
> > > discussed
> > > in
> > > +:ref:`more detail <fscounters>` later.
> > > +
> > > +:ref:`Quota usage <quotacheck>` and :ref:`file link count
> > > <nlinks>`
> > > +checking are sufficiently complicated to warrant separate
> > > sections.
> > > +
> > > +Post-Repair Reverification
> > > +``````````````````````````
> > > +
> > > +After performing a repair, the checking code is run a second
> > > time to
> > > validate
> > > +the new structure, and the results of the health assessment are
> > > recorded
> > > +internally and returned to the calling process.
> > > +This step is critical for enabling system administrator to
> > > monitor
> > > the status
> > > +of the filesystem and the progress of any repairs.
> > > +For developers, it is a useful means to judge the efficacy of
> > > error
> > > detection
> > > +and correction in the online and offline checking tools.
> > > diff --git a/Documentation/filesystems/xfs-self-describing-
> > > metadata.rst b/Documentation/filesystems/xfs-self-describing-
> > > metadata.rst
> > > index b79dbf36dc94..a10c4ae6955e 100644
> > > --- a/Documentation/filesystems/xfs-self-describing-metadata.rst
> > > +++ b/Documentation/filesystems/xfs-self-describing-metadata.rst
> > > @@ -1,4 +1,5 @@
> > >  .. SPDX-License-Identifier: GPL-2.0
> > > +.. _xfs_self_describing_metadata:
> > >  
> > >  ============================
> > >  XFS Self Describing Metadata
> > > 
> > 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/14] xfs: document pageable kernel memory
  2023-02-02 23:14       ` Darrick J. Wong
@ 2023-02-09  5:41         ` Allison Henderson
  2023-02-09 23:14           ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-02-09  5:41 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Thu, 2023-02-02 at 15:14 -0800, Darrick J. Wong wrote:
> On Thu, Feb 02, 2023 at 07:14:22AM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Add a discussion of pageable kernel memory, since online fsck
> > > needs
> > > quite a bit more memory than most other parts of the filesystem
> > > to
> > > stage
> > > records and other information.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  490
> > > ++++++++++++++++++++
> > >  1 file changed, 490 insertions(+)
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index 419eb54ee200..9d7a2ef1d0dd 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -383,6 +383,8 @@ Algorithms") of Srinivasan.
> > >  However, any data structure builder that maintains a resource
> > > lock
> > > for the
> > >  duration of the repair is *always* an offline algorithm.
> > >  
> > > +.. _secondary_metadata:
> > > +
> > >  Secondary Metadata
> > >  ``````````````````
> > >  
> > > @@ -1746,3 +1748,491 @@ Scrub teardown disables all static keys
> > > obtained by ``xchk_fshooks_enable``.
> > >  
> > >  For more information, please see the kernel documentation of
> > >  Documentation/staging/static-keys.rst.
> > > +
> > > +.. _xfile:
> > > +
> > > +Pageable Kernel Memory
> > > +----------------------
> > > +
> > > +Demonstrations of the first few prototypes of online repair
> > > revealed
> > > new
> > > +technical requirements that were not originally identified.
> > > +For the first demonstration, the code walked whatever filesystem
> > > +metadata it needed to synthesize new records and inserted
> > > records
> > > into a new
> > > +btree as it found them.
> > > +This was subpar since any additional corruption or runtime
> > > errors
> > > encountered
> > > +during the walk would shut down the filesystem.
> > > +After remount, the blocks containing the half-rebuilt data
> > > structure
> > > would not
> > > +be accessible until another repair was attempted.
> > > +Solving the problem of half-rebuilt data structures will be
> > > discussed in the
> > > +next section.
> > > +
> > > +For the second demonstration, the synthesized records were
> > > instead
> > > stored in
> > > +kernel slab memory.
> > > +Doing so enabled online repair to abort without writing to the
> > > filesystem if
> > > +the metadata walk failed, which prevented online fsck from
> > > making
> > > things worse.
> > > +However, even this approach needed improving upon.
> > > +
> > > +There are four reasons why traditional Linux kernel memory
> > > management isn't
> > > +suitable for storing large datasets:
> > > +
> > > +1. Although it is tempting to allocate a contiguous block of
> > > memory
> > > to create a
> > > +   C array, this cannot easily be done in the kernel because it
> > > cannot be
> > > +   relied upon to allocate multiple contiguous memory pages.
> > > +
> > > +2. While disparate physical pages can be virtually mapped
> > > together,
> > > installed
> > > +   memory might still not be large enough to stage the entire
> > > record
> > > set in
> > > +   memory while constructing a new btree.
> > > +
> > > +3. To overcome these two difficulties, the implementation was
> > > adjusted to use
> > > +   doubly linked lists, which means every record object needed
> > > two
> > > 64-bit list
> > > +   head pointers, which is a lot of overhead.
> > > +
> > > +4. Kernel memory is pinned, which can drive the system out of
> > > memory, leading
> > > +   to OOM kills of unrelated processes.
> > > +
> > I think I maybe might just jump to what ever the current plan is
> > instead of trying to keep a record of the dev history in the
> > document.
> > I'm sure we're not done yet, dev really never is, so in order for
> > the
> > documentation to be maintained, it would just get bigger and bigger
> > to
> > keep documenting it this way.  It's not that the above isnt
> > valuable,
> > but maybe a different kind of document really.
> 
> OK, I've shortened this introduction to outline the requirements, and
> trimmed the historical information to a sidebar:
> 
> "Some online checking functions work by scanning the filesystem to
> build
> a shadow copy of an ondisk metadata structure in memory and comparing
> the two copies. For online repair to rebuild a metadata structure, it
> must compute the record set that will be stored in the new structure
> before it can persist that new structure to disk. Ideally, repairs
> complete with a single atomic commit that introduces a new data
> structure. To meet these goals, the kernel needs to collect a large
> amount of information in a place that doesn’t require the correct
> operation of the filesystem.
> 
> "Kernel memory isn’t suitable because:
> 
> *   Allocating a contiguous region of memory to create a C array is
> very
>     difficult, especially on 32-bit systems.
> 
> *   Linked lists of records introduce double pointer overhead which
> is
>     very high and eliminate the possibility of indexed lookups.
> 
> *   Kernel memory is pinned, which can drive the system into OOM
>     conditions.
> 
> *   The system might not have sufficient memory to stage all the
>     information.
> 
> "At any given time, online fsck does not need to keep the entire
> record
> set in memory, which means that individual records can be paged out
> if
> necessary. Continued development of online fsck demonstrated that the
> ability to perform indexed data storage would also be very useful.
> Fortunately, the Linux kernel already has a facility for
> byte-addressable and pageable storage: tmpfs. In-kernel graphics
> drivers
> (most notably i915) take advantage of tmpfs files to store
> intermediate
> data that doesn’t need to be in memory at all times, so that usage
> precedent is already established. Hence, the xfile was born!
> 
> Historical Sidebar
> ------------------
> 
> "The first edition of online repair inserted records into a new btree
> as
> it found them, which failed because filesystem could shut down with a
> built data structure, which would be live after recovery finished.
> 
> "The second edition solved the half-rebuilt structure problem by
> storing
> everything in memory, but frequently ran the system out of memory.
> 
> "The third edition solved the OOM problem by using linked lists, but
> the
> list overhead was extreme."
Ok, I think that's cleaner

> 
> > 
> > 
> > > +For the third iteration, attention swung back to the possibility
> > > of
> > > using
> > 
> > Due to the large volume of metadata that needs to be processed,
> > ofsck
> > uses...
> > 
> > > +byte-indexed array-like storage to reduce the overhead of in-
> > > memory
> > > records.
> > > +At any given time, online repair does not need to keep the
> > > entire
> > > record set in
> > > +memory, which means that individual records can be paged out.
> > > +Creating new temporary files in the XFS filesystem to store
> > > intermediate data
> > > +was explored and rejected for some types of repairs because a
> > > filesystem with
> > > +compromised space and inode metadata should never be used to fix
> > > compromised
> > > +space or inode metadata.
> > > +However, the kernel already has a facility for byte-addressable
> > > and
> > > pageable
> > > +storage: shmfs.
> > > +In-kernel graphics drivers (most notably i915) take advantage of
> > > shmfs files
> > > +to store intermediate data that doesn't need to be in memory at
> > > all
> > > times, so
> > > +that usage precedent is already established.
> > > +Hence, the ``xfile`` was born!
> > > +
> > > +xfile Access Models
> > > +```````````````````
> > > +
> > > +A survey of the intended uses of xfiles suggested these use
> > > cases:
> > > +
> > > +1. Arrays of fixed-sized records (space management btrees,
> > > directory
> > > and
> > > +   extended attribute entries)
> > > +
> > > +2. Sparse arrays of fixed-sized records (quotas and link counts)
> > > +
> > > +3. Large binary objects (BLOBs) of variable sizes (directory and
> > > extended
> > > +   attribute names and values)
> > > +
> > > +4. Staging btrees in memory (reverse mapping btrees)
> > > +
> > > +5. Arbitrary contents (realtime space management)
> > > +
> > > +To support the first four use cases, high level data structures
> > > wrap
> > > the xfile
> > > +to share functionality between online fsck functions.
> > > +The rest of this section discusses the interfaces that the xfile
> > > presents to
> > > +four of those five higher level data structures.
> > > +The fifth use case is discussed in the :ref:`realtime summary
> > > <rtsummary>` case
> > > +study.
> > > +
> > > +The most general storage interface supported by the xfile
> > > enables
> > > the reading
> > > +and writing of arbitrary quantities of data at arbitrary offsets
> > > in
> > > the xfile.
> > > +This capability is provided by ``xfile_pread`` and
> > > ``xfile_pwrite``
> > > functions,
> > > +which behave similarly to their userspace counterparts.
> > > +XFS is very record-based, which suggests that the ability to
> > > load
> > > and store
> > > +complete records is important.
> > > +To support these cases, a pair of ``xfile_obj_load`` and
> > > ``xfile_obj_store``
> > > +functions are provided to read and persist objects into an
> > > xfile.
> > > +They are internally the same as pread and pwrite, except that
> > > they
> > > treat any
> > > +error as an out of memory error.
> > > +For online repair, squashing error conditions in this manner is
> > > an
> > > acceptable
> > > +behavior because the only reaction is to abort the operation
> > > back to
> > > userspace.
> > > +All five xfile usecases can be serviced by these four functions.
> > > +
> > > +However, no discussion of file access idioms is complete without
> > > answering the
> > > +question, "But what about mmap?"
> > I actually wouldn't spend too much time discussing solutions that
> > didn't work for what ever reason, unless someones really asking for
> > it.
> >  I think this section would read just fine to trim off the last
> > paragraph here
> 
> Since I wrote this, I've been experimenting with wiring up the tmpfs
> file page cache folios to the xfs buffer cache.  Pinning the folios
> in
> this manner makes it so that online fsck can (more or less) directly
> access the xfile contents.  Much to my surprise, this has actually
> held
> up in testing, so ... it's no longer a solution that "didn't really
> work". :)
> 
> I also need to s/page/folio/ now that willy has finished that
> conversion.  This section has been rewritten as such:
> 
> "However, no discussion of file access idioms is complete without
> answering the question, “But what about mmap?” It is convenient to
> access storage directly with pointers, just like userspace code does
> with regular memory. Online fsck must not drive the system into OOM
> conditions, which means that xfiles must be responsive to memory
> reclamation. tmpfs can only push a pagecache folio to the swap cache
> if
> the folio is neither pinned nor locked, which means the xfile must
> not
> pin too many folios.
> 
> "Short term direct access to xfile contents is done by locking the
> pagecache folio and mapping it into kernel address space.
> Programmatic
> access (e.g. pread and pwrite) uses this mechanism. Folio locks are
> not
> supposed to be held for long periods of time, so long term direct
> access
> to xfile contents is done by bumping the folio refcount, mapping it
> into
> kernel address space, and dropping the folio lock. These long term
> users
> must be responsive to memory reclaim by hooking into the shrinker
> infrastructure to know when to release folios.
> 
> "The xfile_get_page and xfile_put_page functions are provided to
> retrieve the (locked) folio that backs part of an xfile and to
> release
> it. The only code to use these folio lease functions are the xfarray
> sorting algorithms and the in-memory btrees."
Alrighty, sounds like a good upate then

> 
> > > +It would be *much* more convenient if kernel code could access
> > > pageable kernel
> > > +memory with pointers, just like userspace code does with regular
> > > memory.
> > > +Like any other filesystem that uses the page cache, reads and
> > > writes
> > > of xfile
> > > +data lock the cache page and map it into the kernel address
> > > space
> > > for the
> > > +duration of the operation.
> > > +Unfortunately, shmfs can only write a file page to the swap
> > > device
> > > if the page
> > > +is unmapped and unlocked, which means the xfile risks causing
> > > OOM
> > > problems
> > > +unless it is careful not to pin too many pages.
> > > +Therefore, the xfile steers most of its users towards
> > > programmatic
> > > access so
> > > +that backing pages are not kept locked in memory for longer than
> > > is
> > > necessary.
> > > +However, for callers performing quick linear scans of xfile
> > > data,
> > > +``xfile_get_page`` and ``xfile_put_page`` functions are provided
> > > to
> > > pin a page
> > > +in memory.
> > > +So far, the only code to use these functions are the xfarray
> > > :ref:`sorting
> > > +<xfarray_sort>` algorithms.
> > > +
> > > +xfile Access Coordination
> > > +`````````````````````````
> > > +
> > > +For security reasons, xfiles must be owned privately by the
> > > kernel.
> > > +They are marked ``S_PRIVATE`` to prevent interference from the
> > > security system,
> > > +must never be mapped into process file descriptor tables, and
> > > their
> > > pages must
> > > +never be mapped into userspace processes.
> > > +
> > > +To avoid locking recursion issues with the VFS, all accesses to
> > > the
> > > shmfs file
> > > +are performed by manipulating the page cache directly.
> > > +xfile writes call the ``->write_begin`` and ``->write_end``
> > > functions of the
> > > +xfile's address space to grab writable pages, copy the caller's
> > > buffer into the
> > > +page, and release the pages.
> > > +xfile reads call ``shmem_read_mapping_page_gfp`` to grab pages
> > xfile readers
> 
> OK.
> 
> > > directly before
> > > +copying the contents into the caller's buffer.
> > > +In other words, xfiles ignore the VFS read and write code paths
> > > to
> > > avoid
> > > +having to create a dummy ``struct kiocb`` and to avoid taking
> > > inode
> > > and
> > > +freeze locks.
> > > +
> > > +If an xfile is shared between threads to stage repairs, the
> > > caller
> > > must provide
> > > +its own locks to coordinate access.
> > Ofsck threads that share an xfile between stage repairs will use
> > their
> > own locks to coordinate access with each other.
> > 
> > ?
> 
> Hm.  I wonder if there's a misunderstanding here?
> 
> Online fsck functions themselves are single-threaded, which is to say
> that they themselves neither queue workers nor start kthreads. 
> However,
> an xfile created by a running fsck function can be accessed from
> other
> thread if the fsck function also hooks itself into filesystem code.
> 
> The live update section has a nice diagram of how that works:
> https://djwong.org/docs/xfs-online-fsck-design/#filesystem-hooks
> 

Oh ok, I think I got hung up on who the callers were.  How about
"xfiles shared between threads running from hooked filesystem functions
will use their own locks to coordinate access with each other."

> > > +
> > > +.. _xfarray:
> > > +
> > > +Arrays of Fixed-Sized Records
> > > +`````````````````````````````
> > > +
> > > +In XFS, each type of indexed space metadata (free space, inodes,
> > > reference
> > > +counts, file fork space, and reverse mappings) consists of a set
> > > of
> > > fixed-size
> > > +records indexed with a classic B+ tree.
> > > +Directories have a set of fixed-size dirent records that point
> > > to
> > > the names,
> > > +and extended attributes have a set of fixed-size attribute keys
> > > that
> > > point to
> > > +names and values.
> > > +Quota counters and file link counters index records with
> > > numbers.
> > > +During a repair, scrub needs to stage new records during the
> > > gathering step and
> > > +retrieve them during the btree building step.
> > > +
> > > +Although this requirement can be satisfied by calling the read
> > > and
> > > write
> > > +methods of the xfile directly, it is simpler for callers for
> > > there
> > > to be a
> > > +higher level abstraction to take care of computing array
> > > offsets, to
> > > provide
> > > +iterator functions, and to deal with sparse records and sorting.
> > > +The ``xfarray`` abstraction presents a linear array for fixed-
> > > size
> > > records atop
> > > +the byte-accessible xfile.
> > > +
> > > +.. _xfarray_access_patterns:
> > > +
> > > +Array Access Patterns
> > > +^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Array access patterns in online fsck tend to fall into three
> > > categories.
> > > +Iteration of records is assumed to be necessary for all cases
> > > and
> > > will be
> > > +covered in the next section.
> > > +
> > > +The first type of caller handles records that are indexed by
> > > position.
> > > +Gaps may exist between records, and a record may be updated
> > > multiple
> > > times
> > > +during the collection step.
> > > +In other words, these callers want a sparse linearly addressed
> > > table
> > > file.
> > > +The typical use case are quota records or file link count
> > > records.
> > > +Access to array elements is performed programmatically via
> > > ``xfarray_load`` and
> > > +``xfarray_store`` functions, which wrap the similarly-named
> > > xfile
> > > functions to
> > > +provide loading and storing of array elements at arbitrary array
> > > indices.
> > > +Gaps are defined to be null records, and null records are
> > > defined to
> > > be a
> > > +sequence of all zero bytes.
> > > +Null records are detected by calling
> > > ``xfarray_element_is_null``.
> > > +They are created either by calling ``xfarray_unset`` to null out
> > > an
> > > existing
> > > +record or by never storing anything to an array index.
> > > +
> > > +The second type of caller handles records that are not indexed
> > > by
> > > position
> > > +and do not require multiple updates to a record.
> > > +The typical use case here is rebuilding space btrees and
> > > key/value
> > > btrees.
> > > +These callers can add records to the array without caring about
> > > array indices
> > > +via the ``xfarray_append`` function, which stores a record at
> > > the
> > > end of the
> > > +array.
> > > +For callers that require records to be presentable in a specific
> > > order (e.g.
> > > +rebuilding btree data), the ``xfarray_sort`` function can
> > > arrange
> > > the sorted
> > > +records; this function will be covered later.
> > > +
> > > +The third type of caller is a bag, which is useful for counting
> > > records.
> > > +The typical use case here is constructing space extent reference
> > > counts from
> > > +reverse mapping information.
> > > +Records can be put in the bag in any order, they can be removed
> > > from
> > > the bag
> > > +at any time, and uniqueness of records is left to callers.
> > > +The ``xfarray_store_anywhere`` function is used to insert a
> > > record
> > > in any
> > > +null record slot in the bag; and the ``xfarray_unset`` function
> > > removes a
> > > +record from the bag.
> > > +
> > > +The proposed patchset is the
> > > +`big in-memory array
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=big-array>`_.
> > > +
> > > +Iterating Array Elements
> > > +^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Most users of the xfarray require the ability to iterate the
> > > records
> > > stored in
> > > +the array.
> > > +Callers can probe every possible array index with the following:
> > > +
> > > +.. code-block:: c
> > > +
> > > +       xfarray_idx_t i;
> > > +       foreach_xfarray_idx(array, i) {
> > > +           xfarray_load(array, i, &rec);
> > > +
> > > +           /* do something with rec */
> > > +       }
> > > +
> > > +All users of this idiom must be prepared to handle null records
> > > or
> > > must already
> > > +know that there aren't any.
> > > +
> > > +For xfarray users that want to iterate a sparse array, the
> > > ``xfarray_iter``
> > > +function ignores indices in the xfarray that have never been
> > > written
> > > to by
> > > +calling ``xfile_seek_data`` (which internally uses
> > > ``SEEK_DATA``) to
> > > skip areas
> > > +of the array that are not populated with memory pages.
> > > +Once it finds a page, it will skip the zeroed areas of the page.
> > > +
> > > +.. code-block:: c
> > > +
> > > +       xfarray_idx_t i = XFARRAY_CURSOR_INIT;
> > > +       while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
> > > +           /* do something with rec */
> > > +       }
> > > +
> > > +.. _xfarray_sort:
> > > +
> > > +Sorting Array Elements
> > > +^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +During the fourth demonstration of online repair, a community
> > > reviewer remarked
> > > +that for performance reasons, online repair ought to load
> > > batches of
> > > records
> > > +into btree record blocks instead of inserting records into a new
> > > btree one at a
> > > +time.
> > > +The btree insertion code in XFS is responsible for maintaining
> > > correct ordering
> > > +of the records, so naturally the xfarray must also support
> > > sorting
> > > the record
> > > +set prior to bulk loading.
> > > +
> > > +The sorting algorithm used in the xfarray is actually a
> > > combination
> > > of adaptive
> > > +quicksort and a heapsort subalgorithm in the spirit of
> > > +`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
> > > +`pdqsort <https://github.com/orlp/pdqsort>`_, with
> > > customizations
> > > for the Linux
> > > +kernel.
> > > +To sort records in a reasonably short amount of time,
> > > ``xfarray``
> > > takes
> > > +advantage of the binary subpartitioning offered by quicksort,
> > > but it
> > > also uses
> > > +heapsort to hedge aginst performance collapse if the chosen
> > > quicksort pivots
> > > +are poor.
> > > +Both algorithms are (in general) O(n * lg(n)), but there is a
> > > wide
> > > performance
> > > +gulf between the two implementations.
> > > +
> > > +The Linux kernel already contains a reasonably fast
> > > implementation
> > > of heapsort.
> > > +It only operates on regular C arrays, which limits the scope of
> > > its
> > > usefulness.
> > > +There are two key places where the xfarray uses it:
> > > +
> > > +* Sorting any record subset backed by a single xfile page.
> > > +
> > > +* Loading a small number of xfarray records from potentially
> > > disparate parts
> > > +  of the xfarray into a memory buffer, and sorting the buffer.
> > > +
> > > +In other words, ``xfarray`` uses heapsort to constrain the
> > > nested
> > > recursion of
> > > +quicksort, thereby mitigating quicksort's worst runtime
> > > behavior.
> > > +
> > > +Choosing a quicksort pivot is a tricky business.
> > > +A good pivot splits the set to sort in half, leading to the
> > > divide
> > > and conquer
> > > +behavior that is crucial to  O(n * lg(n)) performance.
> > > +A poor pivot barely splits the subset at all, leading to O(n\
> > > :sup:`2`)
> > > +runtime.
> > > +The xfarray sort routine tries to avoid picking a bad pivot by
> > > sampling nine
> > > +records into a memory buffer and using the kernel heapsort to
> > > identify the
> > > +median of the nine.
> > > +
> > > +Most modern quicksort implementations employ Tukey's "ninther"
> > > to
> > > select a
> > > +pivot from a classic C array.
> > > +Typical ninther implementations pick three unique triads of
> > > records,
> > > sort each
> > > +of the triads, and then sort the middle value of each triad to
> > > determine the
> > > +ninther value.
> > > +As stated previously, however, xfile accesses are not entirely
> > > cheap.
> > > +It turned out to be much more performant to read the nine
> > > elements
> > > into a
> > > +memory buffer, run the kernel's in-memory heapsort on the
> > > buffer,
> > > and choose
> > > +the 4th element of that buffer as the pivot.
> > > +Tukey's ninthers are described in J. W. Tukey, `The ninther, a
> > > technique for
> > > +low-effort robust (resistant) location in large samples`, in
> > > *Contributions to
> > > +Survey Sampling and Applied Statistics*, edited by H. David,
> > > (Academic Press,
> > > +1978), pp. 251–257.
> > > +
> > > +The partitioning of quicksort is fairly textbook -- rearrange
> > > the
> > > record
> > > +subset around the pivot, then set up the current and next stack
> > > frames to
> > > +sort with the larger and the smaller halves of the pivot,
> > > respectively.
> > > +This keeps the stack space requirements to log2(record count).
> > > +
> > > +As a final performance optimization, the hi and lo scanning
> > > phase of
> > > quicksort
> > > +keeps examined xfile pages mapped in the kernel for as long as
> > > possible to
> > > +reduce map/unmap cycles.
> > > +Surprisingly, this reduces overall sort runtime by nearly half
> > > again
> > > after
> > > +accounting for the application of heapsort directly onto xfile
> > > pages.
> > This sorting section is insightful, but I think I'd be ok with out
> > it
> > too.  Or maybe save it for later in the document as an
> > "implementation
> > details" section, or something similar.  It seems like there's
> > still a
> > lot to cover about how ofsck works in general before we start
> > drilling
> > into things like the runtime complexity of the sorting algorithm it
> > uses.  
> 
> How about I demote the details of how sorting works to a case study?
Sure, sounds good
> 
> > > +
> > > +Blob Storage
> > > +````````````
> > > +
> > > +Extended attributes and directories add an additional
> > > requirement
> > > for staging
> > > +records: arbitrary byte sequences of finite length.
> > > +Each directory entry record needs to store entry name,
> > > +and each extended attribute needs to store both the attribute
> > > name
> > > and value.
> > > +The names, keys, and values can consume a large amount of
> > > memory, so
> > > the
> > > +``xfblob`` abstraction was created to simplify management of
> > > these
> > > blobs
> > > +atop an xfile.
> > > +
> > > +Blob arrays provide ``xfblob_load`` and ``xfblob_store``
> > > functions
> > > to retrieve
> > > +and persist objects.
> > > +The store function returns a magic cookie for every object that
> > > it
> > > persists.
> > > +Later, callers provide this cookie to the ``xblob_load`` to
> > > recall
> > > the object.
> > > +The ``xfblob_free`` function frees a specific blob, and the
> > > ``xfblob_truncate``
> > > +function frees them all because compaction is not needed.
> > > +
> > > +The details of repairing directories and extended attributes
> > > will be
> > > discussed
> > > +in a subsequent section about atomic extent swapping.
> > > +However, it should be noted that these repair functions only use
> > > blob storage
> > > +to cache a small number of entries before adding them to a
> > > temporary
> > > ondisk
> > > +file, which is why compaction is not required.
> > > +
> > > +The proposed patchset is at the start of the
> > > +`extended attribute repair
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-xattrs>`_ series.
> > > +
> > > +.. _xfbtree:
> > > +
> > > +In-Memory B+Trees
> > > +`````````````````
> > > +
> > > +The chapter about :ref:`secondary metadata<secondary_metadata>`
> > > mentioned that
> > > +checking and repairing of secondary metadata commonly requires
> > > coordination
> > > +between a live metadata scan of the filesystem and writer
> > > threads
> > > that are
> > > +updating that metadata.
> > > +Keeping the scan data up to date requires requires the ability
> > > to
> > > propagate
> > > +metadata updates from the filesystem into the data being
> > > collected
> > > by the scan.
> > > +This *can* be done by appending concurrent updates into a
> > > separate
> > > log file and
> > > +applying them before writing the new metadata to disk, but this
> > > leads to
> > > +unbounded memory consumption if the rest of the system is very
> > > busy.
> > > +Another option is to skip the side-log and commit live updates
> > > from
> > > the
> > > +filesystem directly into the scan data, which trades more
> > > overhead
> > > for a lower
> > > +maximum memory requirement.
> > > +In both cases, the data structure holding the scan results must
> > > support indexed
> > > +access to perform well.
> > > +
> > > +Given that indexed lookups of scan data is required for both
> > > strategies, online
> > > +fsck employs the second strategy of committing live updates
> > > directly
> > > into
> > > +scan data.
> > > +Because xfarrays are not indexed and do not enforce record
> > > ordering,
> > > they
> > > +are not suitable for this task.
> > > +Conveniently, however, XFS has a library to create and maintain
> > > ordered reverse
> > > +mapping records: the existing rmap btree code!
> > > +If only there was a means to create one in memory.
> > > +
> > > +Recall that the :ref:`xfile <xfile>` abstraction represents
> > > memory
> > > pages as a
> > > +regular file, which means that the kernel can create byte or
> > > block
> > > addressable
> > > +virtual address spaces at will.
> > > +The XFS buffer cache specializes in abstracting IO to block-
> > > oriented  address
> > > +spaces, which means that adaptation of the buffer cache to
> > > interface
> > > with
> > > +xfiles enables reuse of the entire btree library.
> > > +Btrees built atop an xfile are collectively known as
> > > ``xfbtrees``.
> > > +The next few sections describe how they actually work.
> > > +
> > > +The proposed patchset is the
> > > +`in-memory btree
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=in-memory-btrees>`_
> > > +series.
> > > +
> > > +Using xfiles as a Buffer Cache Target
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Two modifications are necessary to support xfiles as a buffer
> > > cache
> > > target.
> > > +The first is to make it possible for the ``struct xfs_buftarg``
> > > structure to
> > > +host the ``struct xfs_buf`` rhashtable, because normally those
> > > are
> > > held by a
> > > +per-AG structure.
> > > +The second change is to modify the buffer ``ioapply`` function
> > > to
> > > "read" cached
> > > +pages from the xfile and "write" cached pages back to the xfile.
> > > +Multiple access to individual buffers is controlled by the
> > > ``xfs_buf`` lock,
> > > +since the xfile does not provide any locking on its own.
> > > +With this adaptation in place, users of the xfile-backed buffer
> > > cache use
> > > +exactly the same APIs as users of the disk-backed buffer cache.
> > > +The separation between xfile and buffer cache implies higher
> > > memory
> > > usage since
> > > +they do not share pages, but this property could some day enable
> > > transactional
> > > +updates to an in-memory btree.
> > > +Today, however, it simply eliminates the need for new code.
> > > +
> > > +Space Management with an xfbtree
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Space management for an xfile is very simple -- each btree block
> > > is
> > > one memory
> > > +page in size.
> > > +These blocks use the same header format as an on-disk btree, but
> > > the
> > > in-memory
> > > +block verifiers ignore the checksums, assuming that xfile memory
> > > is
> > > no more
> > > +corruption-prone than regular DRAM.
> > > +Reusing existing code here is more important than absolute
> > > memory
> > > efficiency.
> > > +
> > > +The very first block of an xfile backing an xfbtree contains a
> > > header block.
> > > +The header describes the owner, height, and the block number of
> > > the
> > > root
> > > +xfbtree block.
> > > +
> > > +To allocate a btree block, use ``xfile_seek_data`` to find a gap
> > > in
> > > the file.
> > > +If there are no gaps, create one by extending the length of the
> > > xfile.
> > > +Preallocate space for the block with ``xfile_prealloc``, and
> > > hand
> > > back the
> > > +location.
> > > +To free an xfbtree block, use ``xfile_discard`` (which
> > > internally
> > > uses
> > > +``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the
> > > xfile.
> > > +
> > > +Populating an xfbtree
> > > +^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +An online fsck function that wants to create an xfbtree should
> > > proceed as
> > > +follows:
> > > +
> > > +1. Call ``xfile_create`` to create an xfile.
> > > +
> > > +2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache
> > > target
> > > structure
> > > +   pointing to the xfile.
> > > +
> > > +3. Pass the buffer cache target, buffer ops, and other
> > > information
> > > to
> > > +   ``xfbtree_create`` to write an initial tree header and root
> > > block
> > > to the
> > > +   xfile.
> > > +   Each btree type should define a wrapper that passes necessary
> > > arguments to
> > > +   the creation function.
> > > +   For example, rmap btrees define ``xfs_rmapbt_mem_create`` to
> > > take
> > > care of
> > > +   all the necessary details for callers.
> > > +   A ``struct xfbtree`` object will be returned.
> > > +
> > > +4. Pass the xfbtree object to the btree cursor creation function
> > > for
> > > the
> > > +   btree type.
> > > +   Following the example above, ``xfs_rmapbt_mem_cursor`` takes
> > > care
> > > of this
> > > +   for callers.
> > > +
> > > +5. Pass the btree cursor to the regular btree functions to make
> > > queries against
> > > +   and to update the in-memory btree.
> > > +   For example, a btree cursor for an rmap xfbtree can be passed
> > > to
> > > the
> > > +   ``xfs_rmap_*`` functions just like any other btree cursor.
> > > +   See the :ref:`next section<xfbtree_commit>` for information
> > > on
> > > dealing with
> > > +   xfbtree updates that are logged to a transaction.
> > > +
> > > +6. When finished, delete the btree cursor, destroy the xfbtree
> > > object, free the
> > > +   buffer target, and the destroy the xfile to release all
> > > resources.
> > > +
> > > +.. _xfbtree_commit:
> > > +
> > > +Committing Logged xfbtree Buffers
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Although it is a clever hack to reuse the rmap btree code to
> > > handle
> > > the staging
> > > +structure, the ephemeral nature of the in-memory btree block
> > > storage
> > > presents
> > > +some challenges of its own.
> > > +The XFS transaction manager must not commit buffer log items for
> > > buffers backed
> > > +by an xfile because the log format does not understand updates
> > > for
> > > devices
> > > +other than the data device.
> > > +An ephemeral xfbtree probably will not exist by the time the AIL
> > > checkpoints
> > > +log transactions back into the filesystem, and certainly won't
> > > exist
> > > during
> > > +log recovery.
> > > +For these reasons, any code updating an xfbtree in transaction
> > > context must
> > > +remove the buffer log items from the transaction and write the
> > > updates into the
> > > +backing xfile before committing or cancelling the transaction.
> > > +
> > > +The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel``
> > > functions
> > > implement
> > > +this functionality as follows:
> > > +
> > > +1. Find each buffer log item whose buffer targets the xfile.
> > > +
> > > +2. Record the dirty/ordered status of the log item.
> > > +
> > > +3. Detach the log item from the buffer.
> > > +
> > > +4. Queue the buffer to a special delwri list.
> > > +
> > > +5. Clear the transaction dirty flag if the only dirty log items
> > > were
> > > the ones
> > > +   that were detached in step 3.
> > > +
> > > +6. Submit the delwri list to commit the changes to the xfile, if
> > > the
> > > updates
> > > +   are being committed.
> > > +
> > > +After removing xfile logged buffers from the transaction in this
> > > manner, the
> > > +transaction can be committed or cancelled.
> > Rest of this looks pretty good, organizing nits aside.
> 
> Cool, thank you!!
> 
> --D
> 
> > Allison
> > 
> > > 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 06/14] xfs: document how online fsck deals with eventual consistency
  2023-02-02 19:55       ` Darrick J. Wong
@ 2023-02-09  5:41         ` Allison Henderson
  0 siblings, 0 replies; 220+ messages in thread
From: Allison Henderson @ 2023-02-09  5:41 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Thu, 2023-02-02 at 11:55 -0800, Darrick J. Wong wrote:
> On Tue, Jan 31, 2023 at 06:11:30AM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Writes to an XFS filesystem employ an eventual consistency update
> > > model
> > > to break up complex multistep metadata updates into small chained
> > > transactions.  This is generally good for performance and
> > > scalability
> > > because XFS doesn't need to prepare for enormous transactions,
> > > but it
> > > also means that online fsck must be careful not to attempt a fsck
> > > action
> > > unless it can be shown that there are no other threads processing
> > > a
> > > transaction chain.  This part of the design documentation covers
> > > the
> > > thinking behind the consistency model and how scrub deals with
> > > it.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  303
> > > ++++++++++++++++++++
> > >  1 file changed, 303 insertions(+)
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index f45bf97fa9c4..419eb54ee200 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -1443,3 +1443,306 @@ This step is critical for enabling system
> > > administrator to monitor the status
> > >  of the filesystem and the progress of any repairs.
> > >  For developers, it is a useful means to judge the efficacy of
> > > error
> > > detection
> > >  and correction in the online and offline checking tools.
> > > +
> > > +Eventual Consistency vs. Online Fsck
> > > +------------------------------------
> > > +
> > > +Midway through the development of online scrubbing, the fsstress
> > > tests
> > > +uncovered a misinteraction between online fsck and compound
> > > transaction chains
> > > +created by other writer threads that resulted in false reports
> > > of
> > > metadata
> > > +inconsistency.
> > > +The root cause of these reports is the eventual consistency
> > > model
> > > introduced by
> > > +the expansion of deferred work items and compound transaction
> > > chains
> > > when
> > > +reverse mapping and reflink were introduced.
> > 
> > 
> > 
> 
> Was there supposed to be a comment here?
No, sometimes I'll fiddle with paraphrasing, but if it's not enough of
an improvement, I'll scrap it, but I think evolution leaves the white
space

> 
> > > +
> > > +Originally, transaction chains were added to XFS to avoid
> > > deadlocks
> > > when
> > > +unmapping space from files.
> > > +Deadlock avoidance rules require that AGs only be locked in
> > > increasing order,
> > > +which makes it impossible (say) to use a single transaction to
> > > free
> > > a space
> > > +extent in AG 7 and then try to free a now superfluous block
> > > mapping
> > > btree block
> > > +in AG 3.
> > > +To avoid these kinds of deadlocks, XFS creates Extent Freeing
> > > Intent
> > > (EFI) log
> > > +items to commit to freeing some space in one transaction while
> > > deferring the
> > > +actual metadata updates to a fresh transaction.
> > > +The transaction sequence looks like this:
> > > +
> > > +1. The first transaction contains a physical update to the
> > > file's
> > > block mapping
> > > +   structures to remove the mapping from the btree blocks.
> > > +   It then attaches to the in-memory transaction an action item
> > > to
> > > schedule
> > > +   deferred freeing of space.
> > > +   Concretely, each transaction maintains a list of ``struct
> > > +   xfs_defer_pending`` objects, each of which maintains a list
> > > of
> > > ``struct
> > > +   xfs_extent_free_item`` objects.
> > > +   Returning to the example above, the action item tracks the
> > > freeing of both
> > > +   the unmapped space from AG 7 and the block mapping btree
> > > (BMBT)
> > > block from
> > > +   AG 3.
> > > +   Deferred frees recorded in this manner are committed in the
> > > log
> > > by creating
> > > +   an EFI log item from the ``struct xfs_extent_free_item``
> > > object
> > > and
> > > +   attaching the log item to the transaction.
> > > +   When the log is persisted to disk, the EFI item is written
> > > into
> > > the ondisk
> > > +   transaction record.
> > > +   EFIs can list up to 16 extents to free, all sorted in AG
> > > order.
> > > +
> > > +2. The second transaction contains a physical update to the free
> > > space btrees
> > > +   of AG 3 to release the former BMBT block and a second
> > > physical
> > > update to the
> > > +   free space btrees of AG 7 to release the unmapped file space.
> > > +   Observe that the the physical updates are resequenced in the
> > > correct order
> > > +   when possible.
> > > +   Attached to the transaction is a an extent free done (EFD)
> > > log
> > > item.
> > > +   The EFD contains a pointer to the EFI logged in transaction
> > > #1 so
> > > that log
> > > +   recovery can tell if the EFI needs to be replayed.
> > > +
> > > +If the system goes down after transaction #1 is written back to
> > > the
> > > filesystem
> > > +but before #2 is committed, a scan of the filesystem metadata
> > > would
> > > show
> > > +inconsistent filesystem metadata because there would not appear
> > > to
> > > be any owner
> > > +of the unmapped space.
> > > +Happily, log recovery corrects this inconsistency for us -- when
> > > recovery finds
> > > +an intent log item but does not find a corresponding intent done
> > > item, it will
> > > +reconstruct the incore state of the intent item and finish it.
> > > +In the example above, the log must replay both frees described
> > > in
> > > the recovered
> > > +EFI to complete the recovery phase.
> > > +
> > > +There are two subtleties to XFS' transaction chaining strategy
> > > to
> > > consider.
> > > +The first is that log items must be added to a transaction in
> > > the
> > > correct order
> > > +to prevent conflicts with principal objects that are not held by
> > > the
> > > +transaction.
> > > +In other words, all per-AG metadata updates for an unmapped
> > > block
> > > must be
> > > +completed before the last update to free the extent, and extents
> > > should not
> > > +be reallocated until that last update commits to the log.
> > > +The second subtlety comes from the fact that AG header buffers
> > > are
> > > (usually)
> > > +released between each transaction in a chain.
> > > +This means that other threads can observe an AG in an
> > > intermediate
> > > state,
> > > +but as long as the first subtlety is handled, this should not
> > > affect
> > > the
> > > +correctness of filesystem operations.
> > > +Unmounting the filesystem flushes all pending work to disk,
> > > which
> > > means that
> > > +offline fsck never sees the temporary inconsistencies caused by
> > > deferred work
> > > +item processing.
> > > +In this manner, XFS employs a form of eventual consistency to
> > > avoid
> > > deadlocks
> > > +and increase parallelism.
> > > +
> > > +During the design phase of the reverse mapping and reflink
> > > features,
> > > it was
> > > +decided that it was impractical to cram all the reverse mapping
> > > updates for a
> > > +single filesystem change into a single transaction because a
> > > single
> > > file
> > > +mapping operation can explode into many small updates:
> > > +
> > > +* The block mapping update itself
> > > +* A reverse mapping update for the block mapping update
> > > +* Fixing the freelist
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* A shape change to the block mapping btree
> > > +* A reverse mapping update for the btree update
> > > +* Fixing the freelist (again)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* An update to the reference counting information
> > > +* A reverse mapping update for the refcount update
> > > +* Fixing the freelist (a third time)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* Freeing any space that was unmapped and not owned by any other
> > > file
> > > +* Fixing the freelist (a fourth time)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +* Freeing the space used by the block mapping btree
> > > +* Fixing the freelist (a fifth time)
> > > +* A reverse mapping update for the freelist fix
> > > +
> > > +Free list fixups are not usually needed more than once per AG
> > > per
> > > transaction
> > > +chain, but it is theoretically possible if space is very tight.
> > > +For copy-on-write updates this is even worse, because this must
> > > be
> > > done once to
> > > +remove the space from a staging area and again to map it into
> > > the
> > > file!
> > > +
> > > +To deal with this explosion in a calm manner, XFS expands its
> > > use of
> > > deferred
> > > +work items to cover most reverse mapping updates and all
> > > refcount
> > > updates.
> > > +This reduces the worst case size of transaction reservations by
> > > breaking the
> > > +work into a long chain of small updates, which increases the
> > > degree
> > > of eventual
> > > +consistency in the system.
> > > +Again, this generally isn't a problem because XFS orders its
> > > deferred work
> > > +items carefully to avoid resource reuse conflicts between
> > > unsuspecting threads.
> > > +
> > > +However, online fsck changes the rules -- remember that although
> > > physical
> > > +updates to per-AG structures are coordinated by locking the
> > > buffers
> > > for AG
> > > +headers, buffer locks are dropped between transactions.
> > > +Once scrub acquires resources and takes locks for a data
> > > structure,
> > > it must do
> > > +all the validation work without releasing the lock.
> > > +If the main lock for a space btree is an AG header buffer lock,
> > > scrub may have
> > > +interrupted another thread that is midway through finishing a
> > > chain.
> > > +For example, if a thread performing a copy-on-write has
> > > completed a
> > > reverse
> > > +mapping update but not the corresponding refcount update, the
> > > two AG
> > > btrees
> > > +will appear inconsistent to scrub and an observation of
> > > corruption
> > > will be
> > > +recorded.  This observation will not be correct.
> > > +If a repair is attempted in this state, the results will be
> > > catastrophic!
> > > +
> > > +Several solutions to this problem were evaluated upon discovery
> > > of
> > > this flaw:
> > 
> > 
> > Hmm, so while having a really in depth efi example is insightful, I
> > wonder if it would be more oranized to put it in a separate
> > document
> > somewhere and just reference it.  As far as ofsck is concerned, I
> > think
> > a lighter sumary would do:
> > 
> > 
> > "Complex operations that modify multiple AGs are performed through
> > a
> > series of transactions which are logged to a journal that an
> > offline
> > fsck can either replay or discard.  Online fsck however, must be
> > able
> > to deal with these operations while they are still in progress. 
> > This
> > presents a unique challenge for ofsck since a partially completed
> > transaction chain may present the appearance of inconsistencies,
> > even
> > though the operations are functioning as intended. (For a more
> > detailed
> > example, see <cite document here...>)  
> > 
> > The challenge then becomes how to avoid incorrectly repairing these
> > non-issues as doing so would cause more harm than help."
> 
> I agree that this topic needs a much shorter introduction before
> moving
> on to the gory details.  How does this strike you?
> 
> "Complex operations can make modifications to multiple per-AG data
> structures with a chain of transactions.  These chains, once
> committed
> to the log, are restarted during log recovery if the system crashes
> while processing the chain.  Because the AG header buffers are
> unlocked
> between transactions within a chain, online checking must coordinate
> with chained operations that are in progress to avoid incorrectly
> detecting inconsistencies due to pending chains.  Furthermore, online
> repair must not run when operations are pending because the metadata
> are
> temporarily inconsistent with each other, and rebuilding is not
> possible."
> 
> "Only online fsck has this requirement of total consistency of AG
> metadata, and should be relatively rare as compared to filesystem
> change
> operations.  Online fsck coordinates with transaction chains as
> follows:
> 
> * "For each AG, maintain a count of intent items targetting that AG.
>   The count should be bumped whenever a new item is added to the
> chain.
>   The count should be dropped when the filesystem has locked the AG
>   header buffers and finished the work.
> 
> * "When online fsck wants to examine an AG, it should lock the AG
> header
>   buffers to quiesce all transaction chains that want to modify that
> AG.
>   If the count is zero, proceed with the checking operation.  If it
> is
>   nonzero, cycle the buffer locks to allow the chain to make forward
>   progress.
> 
> "This may lead to online fsck taking a long time to complete, but
> regular filesystem updates take precedence over background checking
> activity.  Details about the discovery of this situation are
> presented
> in the <next section>, and details about the solution are presented
> <after that>."
> 
> These gory details of how I recognized the problem are a subsection
> of
> the main heading, and anyone who wants to know them can read it.
> Readers who'd rather move on to the solution can jump directly to the
> "Intent Drains" section.  The <bracketed> text are hyperlinks.
Ok, I think that works.  Much lighter, and more to the point for ofsck
> 
> > > +
> > > +1. Add a higher level lock to allocation groups and require
> > > writer
> > > threads to
> > > +   acquire the higher level lock in AG order before making any
> > > changes.
> > > +   This would be very difficult to implement in practice because
> > > it
> > > is
> > > +   difficult to determine which locks need to be obtained, and
> > > in
> > > what order,
> > > +   without simulating the entire operation.
> > > +   Performing a dry run of a file operation to discover
> > > necessary
> > > locks would
> > > +   make the filesystem very slow.
> > > +
> > > +2. Make the deferred work coordinator code aware of consecutive
> > > intent items
> > > +   targeting the same AG and have it hold the AG header buffers
> > > locked across
> > > +   the transaction roll between updates.
> > > +   This would introduce a lot of complexity into the coordinator
> > > since it is
> > > +   only loosely coupled with the actual deferred work items.
> > > +   It would also fail to solve the problem because deferred work
> > > items can
> > > +   generate new deferred subtasks, but all subtasks must be
> > > complete
> > > before
> > > +   work can start on a new sibling task.
> > Hmm, that one doesnt seem like it's really an option then :-(
> 
> Right.  Now that this section has become its own gory details
> subsection, the sentence preceeding the numbered list becomes:
> 
> "Several other solutions to this problem were evaluated upon
> discovery
> of this flaw and rejected:"
Ok

> 
> > > +
> > > +3. Teach online fsck to walk all transactions waiting for
> > > whichever
> > > lock(s)
> > > +   protect the data structure being scrubbed to look for pending
> > > operations.
> > > +   The checking and repair operations must factor these pending
> > > operations into
> > > +   the evaluations being performed.
> > > +   This solution is a nonstarter because it is *extremely*
> > > invasive
> > > to the main
> > > +   filesystem.
> > > +
> > > +4. Recognize that only online fsck has this requirement of total
> > > consistency
> > > +   of AG metadata, and that online fsck should be relatively
> > > rare as
> > > compared
> > > +   to filesystem change operations.
> > > +   For each AG, maintain a count of intent items targetting that
> > > AG.
> > > +   When online fsck wants to examine an AG, it should lock the
> > > AG
> > > header
> > > +   buffers to quiesce all transaction chains that want to modify
> > > that AG, and
> > > +   only proceed with the scrub if the count is zero.
> > > +   In other words, scrub only proceeds if it can lock the AG
> > > header
> > > buffers and
> > > +   there can't possibly be any intents in progress.
> > > +   This may lead to fairness and starvation issues, but regular
> > > filesystem
> > > +   updates take precedence over online fsck activity.
> > So basically it sounds like 4 is the only reasonable option?
> 
> Yes.
> 
> > If the discussion concerning the other options have died down, I
> > would
> > clean them out.
> 
> That's just the problem -- I've sent this and the code patches to the
> list several times now, and mostly haven't heard any solid replies. 
> So
> it's a bit premature to take it out, and again it might be useful to
> capture the roads not taken.
> 
> > They're great for brain storming and invitations for
> > collaboration, but ideally the goal of any of that should be to
> > narrow
> > down an agreed upon plan of action.  And the goal of your document
> > should make clear what that plan is.  So if no one has any
> > objections
> > by now, maybe just tie it right into the last line:
> > 
> > "The challenge then becomes how to avoid incorrectly repairing
> > these
> > non-issues as doing so would cause more harm than help. 
> > Fortunately only online fsck has this requirement of total
> > consistency..."
> 
> > > +
> > > +Intent Drains
> > > +`````````````
> > > +
> > > +The fourth solution is implemented in the current iteration of
> > This solution is implemented...
> 
> "Online fsck uses an atomic intent item counter and lock cycling to
> coordinate with transaction chains.  There are two key properties to
> the
> drain mechanism..."
Ok, sounds fine
> 
> > > online fsck,
> > > +with atomic_t providing the active intent counter.
> > > +
> > > +There are two key properties to the drain mechanism.
> > > +First, the counter is incremented when a deferred work item is
> > > *queued* to a
> > > +transaction, and it is decremented after the associated intent
> > > done
> > > log item is
> > > +*committed* to another transaction.
> > > +The second property is that deferred work can be added to a
> > > transaction without
> > > +holding an AG header lock, but per-AG work items cannot be
> > > marked
> > > done without
> > > +locking that AG header buffer to log the physical updates and
> > > the
> > > intent done
> > > +log item.
> > > +The first property enables scrub to yield to running transaction
> > > chains, which
> > > +is an explicit deprioritization of online fsck to benefit file
> > > operations.
> > > +The second property of the drain is key to the correct
> > > coordination
> > > of scrub,
> > > +since scrub will always be able to decide if a conflict is
> > > possible.
> > > +
> > > +For regular filesystem code, the drain works as follows:
> > > +
> > > +1. Call the appropriate subsystem function to add a deferred
> > > work
> > > item to a
> > > +   transaction.
> > > +
> > > +2. The function calls ``xfs_drain_bump`` to increase the
> > > counter.
> > > +
> > > +3. When the deferred item manager wants to finish the deferred
> > > work
> > > item, it
> > > +   calls ``->finish_item`` to complete it.
> > > +
> > > +4. The ``->finish_item`` implementation logs some changes and
> > > calls
> > > +   ``xfs_drain_drop`` to decrease the sloppy counter and wake up
> > > any
> > > threads
> > > +   waiting on the drain.
> > > +
> > > +5. The subtransaction commits, which unlocks the resource
> > > associated
> > > with the
> > > +   intent item.
> > > +
> > > +For scrub, the drain works as follows:
> > > +
> > > +1. Lock the resource(s) associated with the metadata being
> > > scrubbed.
> > > +   For example, a scan of the refcount btree would lock the AGI
> > > and
> > > AGF header
> > > +   buffers.
> > > +
> > > +2. If the counter is zero (``xfs_drain_busy`` returns false),
> > > there
> > > are no
> > > +   chains in progress and the operation may proceed.
> > > +
> > > +3. Otherwise, release the resources grabbed in step 1.
> > > +
> > > +4. Wait for the intent counter to reach zero
> > > (``xfs_drain_intents``), then go
> > > +   back to step 1 unless a signal has been caught.
> > > +
> > > +To avoid polling in step 4, the drain provides a waitqueue for
> > > scrub
> > > threads to
> > > +be woken up whenever the intent count drops to zero.
> > I think all that makes sense
> 
> Good! :)
> 
> > > +
> > > +The proposed patchset is the
> > > +`scrub intent drain series
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-drain-intents>`_.
> > > +
> > > +.. _jump_labels:
> > > +
> > > +Static Keys (aka Jump Label Patching)
> > > +`````````````````````````````````````
> > > +
> > > +Online fsck for XFS separates the regular filesystem from the
> > > checking and
> > > +repair code as much as possible.
> > > +However, there are a few parts of online fsck (such as the
> > > intent
> > > drains, and
> > > +later, live update hooks) where it is useful for the online fsck
> > > code to know
> > > +what's going on in the rest of the filesystem.
> > > +Since it is not expected that online fsck will be constantly
> > > running
> > > in the
> > > +background, it is very important to minimize the runtime
> > > overhead
> > > imposed by
> > > +these hooks when online fsck is compiled into the kernel but not
> > > actively
> > > +running on behalf of userspace.
> > > +Taking locks in the hot path of a writer thread to access a data
> > > structure only
> > > +to find that no further action is necessary is expensive -- on
> > > the
> > > author's
> > > +computer, this have an overhead of 40-50ns per access.
> > > +Fortunately, the kernel supports dynamic code patching, which
> > > enables XFS to
> > > +replace a static branch to hook code with ``nop`` sleds when
> > > online
> > > fsck isn't
> > > +running.
> > > +This sled has an overhead of however long it takes the
> > > instruction
> > > decoder to
> > > +skip past the sled, which seems to be on the order of less than
> > > 1ns
> > > and
> > > +does not access memory outside of instruction fetching.
> > > +
> > > +When online fsck enables the static key, the sled is replaced
> > > with
> > > an
> > > +unconditional branch to call the hook code.
> > > +The switchover is quite expensive (~22000ns) but is paid
> > > entirely by
> > > the
> > > +program that invoked online fsck, and can be amortized if
> > > multiple
> > > threads
> > > +enter online fsck at the same time, or if multiple filesystems
> > > are
> > > being
> > > +checked at the same time.
> > > +Changing the branch direction requires taking the CPU hotplug
> > > lock,
> > > and since
> > > +CPU initialization requires memory allocation, online fsck must
> > > be
> > > careful not
> > > +to change a static key while holding any locks or resources that
> > > could be
> > > +accessed in the memory reclaim paths.
> > > +To minimize contention on the CPU hotplug lock, care should be
> > > taken
> > > not to
> > > +enable or disable static keys unnecessarily.
> > > +
> > > +Because static keys are intended to minimize hook overhead for
> > > regular
> > > +filesystem operations when xfs_scrub is not running, the
> > > intended
> > > usage
> > > +patterns are as follows:
> > > +
> > > +- The hooked part of XFS should declare a static-scoped static
> > > key
> > > that
> > > +  defaults to false.
> > > +  The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
> > > +  The static key itself should be declared as a ``static``
> > > variable.
> > > +
> > > +- When deciding to invoke code that's only used by scrub, the
> > > regular
> > > +  filesystem should call the ``static_branch_unlikely``
> > > predicate to
> > > avoid the
> > > +  scrub-only hook code if the static key is not enabled.
> > > +
> > > +- The regular filesystem should export helper functions that
> > > call
> > > +  ``static_branch_inc`` to enable and ``static_branch_dec`` to
> > > disable the
> > > +  static key.
> > > +  Wrapper functions make it easy to compile out the relevant
> > > code if
> > > the kernel
> > > +  distributor turns off online fsck at build time.
> > > +
> > > +- Scrub functions wanting to turn on scrub-only XFS
> > > functionality
> > > should call
> > > +  the ``xchk_fshooks_enable`` from the setup function to enable
> > > a
> > > specific
> > > +  hook.
> > > +  This must be done before obtaining any resources that are used
> > > by
> > > memory
> > > +  reclaim.
> > > +  Callers had better be sure they really need the functionality
> > > gated by the
> > > +  static key; the ``TRY_HARDER`` flag is useful here.
> > > +
> > > +Online scrub has resource acquisition helpers (e.g.
> > > ``xchk_perag_lock``) to
> > > +handle locking AGI and AGF buffers for all scrubber functions.
> > > +If it detects a conflict between scrub and the running
> > > transactions,
> > > it will
> > > +try to wait for intents to complete.
> > > +If the caller of the helper has not enabled the static key, the
> > > helper will
> > > +return -EDEADLOCK, which should result in the scrub being
> > > restarted
> > > with the
> > > +``TRY_HARDER`` flag set.
> > > +The scrub setup function should detect that flag, enable the
> > > static
> > > key, and
> > > +try the scrub again.
> > > +Scrub teardown disables all static keys obtained by
> > > ``xchk_fshooks_enable``.
> > 
> > Ok, this part here seems pretty well documented.  Organizing nits
> > aside
> > I think it looks good.
> 
> Thanks for digging into all of this!
> 
> --D
> 
> > Allison
> > 
> > > +
> > > +For more information, please see the kernel documentation of
> > > +Documentation/staging/static-keys.rst.
> > > 
> > 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 08/14] xfs: document btree bulk loading
  2022-12-30 22:10   ` [PATCH 08/14] xfs: document btree bulk loading Darrick J. Wong
@ 2023-02-09  5:47     ` Allison Henderson
  2023-02-10  0:24       ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-02-09  5:47 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add a discussion of the btree bulk loading code, which makes it easy
> to
> take an in-memory recordset and write it out to disk in an efficient
> manner.  This also enables atomic switchover from the old to the new
> structure with minimal potential for leaking the old blocks.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  632
> ++++++++++++++++++++
>  1 file changed, 632 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index 9d7a2ef1d0dd..eb61d867e55c 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -2236,3 +2236,635 @@ this functionality as follows:
>  
>  After removing xfile logged buffers from the transaction in this
> manner, the
>  transaction can be committed or cancelled.
> +
> +Bulk Loading of Ondisk B+Trees
> +------------------------------
> +
> +As mentioned previously, early iterations of online repair built new
> btree
> +structures by creating a new btree and adding observations
> individually.
> +Loading a btree one record at a time had a slight advantage of not
> requiring
> +the incore records to be sorted prior to commit, but was very slow
> and leaked
> +blocks if the system went down during a repair.
> +Loading records one at a time also meant that repair could not
> control the
> +loading factor of the blocks in the new btree.
> +
> +Fortunately, the venerable ``xfs_repair`` tool had a more efficient
> means for
> +rebuilding a btree index from a collection of records -- bulk btree
> loading.
> +This was implemented rather inefficiently code-wise, since
> ``xfs_repair``
> +had separate copy-pasted implementations for each btree type.
> +
> +To prepare for online fsck, each of the four bulk loaders were
> studied, notes
> +were taken, and the four were refactored into a single generic btree
> bulk
> +loading mechanism.
> +Those notes in turn have been refreshed and are presented below.
> +
> +Geometry Computation
> +````````````````````
> +
> +The zeroth step of bulk loading is to assemble the entire record set
> that will
> +be stored in the new btree, and sort the records.
> +Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape
> of the
> +btree from the record set, the type of btree, and any load factor
> preferences.
> +This information is required for resource reservation.
> +
> +First, the geometry computation computes the minimum and maximum
> records that
> +will fit in a leaf block from the size of a btree block and the size
> of the
> +block header.
> +Roughly speaking, the maximum number of records is::
> +
> +        maxrecs = (block_size - header_size) / record_size
> +
> +The XFS design specifies that btree blocks should be merged when
> possible,
> +which means the minimum number of records is half of maxrecs::
> +
> +        minrecs = maxrecs / 2
> +
> +The next variable to determine is the desired loading factor.
> +This must be at least minrecs and no more than maxrecs.
> +Choosing minrecs is undesirable because it wastes half the block.
> +Choosing maxrecs is also undesirable because adding a single record
> to each
> +newly rebuilt leaf block will cause a tree split, which causes a
> noticeable
> +drop in performance immediately afterwards.
> +The default loading factor was chosen to be 75% of maxrecs, which
> provides a
> +reasonably compact structure without any immediate split penalties.
	default_lload_factor = (maxrecs + minrecs) / 2;
> +If space is tight, the loading factor will be set to maxrecs to try
> to avoid
> +running out of space::
> +
> +        leaf_load_factor = enough space ? (maxrecs + minrecs) / 2 :
> maxrecs
	leaf_load_factor = enough space ? default_lload_factor :
maxrecs;

Just more readable i think

> +
> +Load factor is computed for btree node blocks using the combined
> size of the
> +btree key and pointer as the record size::
> +
> +        maxrecs = (block_size - header_size) / (key_size + ptr_size)
> +        minrecs = maxrecs / 2
	default_nload_factor = (maxrecs + minrecs) / 2;

> +        node_load_factor = enough space ? (maxrecs + minrecs) / 2 :
> maxrecs
	node_load_factor = enough space ? default_nload_factor :
maxrecs;
> +
> +Once that's done, the number of leaf blocks required to store the
> record set
> +can be computed as::
> +
> +        leaf_blocks = ceil(record_count / leaf_load_factor)
> +
> +The number of node blocks needed to point to the next level down in
> the tree
> +is computed as::
> +
> +        n_blocks = (n == 0 ? leaf_blocks : node_blocks[n])
> +        node_blocks[n + 1] = ceil(n_blocks / node_load_factor)
> +
> +The entire computation is performed recursively until the current
> level only
> +needs one block.
> +The resulting geometry is as follows:
> +
> +- For AG-rooted btrees, this level is the root level, so the height
> of the new
> +  tree is ``level + 1`` and the space needed is the summation of the
> number of
> +  blocks on each level.
> +
> +- For inode-rooted btrees where the records in the top level do not
> fit in the
> +  inode fork area, the height is ``level + 2``, the space needed is
> the
> +  summation of the number of blocks on each level, and the inode
> fork points to
> +  the root block.
> +
> +- For inode-rooted btrees where the records in the top level can be
> stored in
> +  the inode fork area, then the root block can be stored in the
> inode, the
> +  height is ``level + 1``, and the space needed is one less than the
> summation
> +  of the number of blocks on each level.
> +  This only becomes relevant when non-bmap btrees gain the ability
> to root in
> +  an inode, which is a future patchset and only included here for
> completeness.
> +
> +.. _newbt:
> +
> +Reserving New B+Tree Blocks
> +```````````````````````````
> +
> +Once repair knows the number of blocks needed for the new btree, it
> allocates
> +those blocks using the free space information.
> +Each reserved extent is tracked separately by the btree builder
> state data.
> +To improve crash resilience, the reservation code also logs an
> Extent Freeing
> +Intent (EFI) item in the same transaction as each space allocation
> and attaches
> +its in-memory ``struct xfs_extent_free_item`` object to the space
> reservation.
> +If the system goes down, log recovery will use the unfinished EFIs
> to free the
> +unused space, the free space, leaving the filesystem unchanged.
> +
> +Each time the btree builder claims a block for the btree from a
> reserved
> +extent, it updates the in-memory reservation to reflect the claimed
> space.
> +Block reservation tries to allocate as much contiguous space as
> possible to
> +reduce the number of EFIs in play.
> +
> +While repair is writing these new btree blocks, the EFIs created for
> the space
> +reservations pin the tail of the ondisk log.
> +It's possible that other parts of the system will remain busy and
> push the head
> +of the log towards the pinned tail.
> +To avoid livelocking the filesystem, the EFIs must not pin the tail
> of the log
> +for too long.
> +To alleviate this problem, the dynamic relogging capability of the
> deferred ops
> +mechanism is reused here to commit a transaction at the log head
> containing an
> +EFD for the old EFI and new EFI at the head.
> +This enables the log to release the old EFI to keep the log moving
> forwards.
> +
> +EFIs have a role to play during the commit and reaping phases;
> please see the
> +next section and the section about :ref:`reaping<reaping>` for more
> details.
> +
> +Proposed patchsets are the
> +`bitmap rework
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-bitmap-rework>`_
> +and the
> +`preparation for bulk loading btrees
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-prep-for-bulk-loading>`_.
> +
> +
> +Writing the New Tree
> +````````````````````
> +
> +This part is pretty simple -- the btree builder
> (``xfs_btree_bulkload``) claims
> +a block from the reserved list, writes the new btree block header,
> fills the
> +rest of the block with records, and adds the new leaf block to a
> list of
> +written blocks.
> +Sibling pointers are set every time a new block is added to the
> level.
> +When it finishes writing the record leaf blocks, it moves on to the
> node
> +blocks.
> +To fill a node block, it walks each block in the next level down in
> the tree
> +to compute the relevant keys and write them into the parent node.
> +When it reaches the root level, it is ready to commit the new btree!
I think most of this is as straight forward as it can be, but it's a
lot visualizing too, which makes me wonder if it would benefit from an
simple illustration if possible.

On a side note: In a prior team I discovered power points, while a lot
work, were also really effective for quickly moving a crowd of people
through connected graph navigation/manipulations.  Because each one of
these steps was another slide that illustrated how the structure
evolved through the updates.  I realize that's not something that fits
in the scheme of a document like this, but maybe something supplemental
to add later.  While it was a time eater, i noticed a lot of confused
expressions just seemed to shake loose, so sometimes it was worth it.


> +
> +The first step to commit the new btree is to persist the btree
> blocks to disk
> +synchronously.
> +This is a little complicated because a new btree block could have
> been freed
> +in the recent past, so the builder must use
> ``xfs_buf_delwri_queue_here`` to
> +remove the (stale) buffer from the AIL list before it can write the
> new blocks
> +to disk.
> +Blocks are queued for IO using a delwri list and written in one
> large batch
> +with ``xfs_buf_delwri_submit``.
> +
> +Once the new blocks have been persisted to disk, control returns to
> the
> +individual repair function that called the bulk loader.
> +The repair function must log the location of the new root in a
> transaction,
> +clean up the space reservations that were made for the new btree,
> and reap the
> +old metadata blocks:
> +
> +1. Commit the location of the new btree root.
> +
> +2. For each incore reservation:
> +
> +   a. Log Extent Freeing Done (EFD) items for all the space that was
> consumed
> +      by the btree builder.  The new EFDs must point to the EFIs
> attached to
> +      the reservation to prevent log recovery from freeing the new
> blocks.
> +
> +   b. For unclaimed portions of incore reservations, create a
> regular deferred
> +      extent free work item to be free the unused space later in the
> +      transaction chain.
> +
> +   c. The EFDs and EFIs logged in steps 2a and 2b must not overrun
> the
> +      reservation of the committing transaction.
> +      If the btree loading code suspects this might be about to
> happen, it must
> +      call ``xrep_defer_finish`` to clear out the deferred work and
> obtain a
> +      fresh transaction.
> +
> +3. Clear out the deferred work a second time to finish the commit
> and clean
> +   the repair transaction.
> +
> +The transaction rolling in steps 2c and 3 represent a weakness in
> the repair
> +algorithm, because a log flush and a crash before the end of the
> reap step can
> +result in space leaking.
> +Online repair functions minimize the chances of this occuring by
> using very
> +large transactions, which each can accomodate many thousands of
> block freeing
> +instructions.
> +Repair moves on to reaping the old blocks, which will be presented
> in a
> +subsequent :ref:`section<reaping>` after a few case studies of bulk
> loading.
> +
> +Case Study: Rebuilding the Inode Index
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +The high level process to rebuild the inode index btree is:
> +
> +1. Walk the reverse mapping records to generate ``struct
> xfs_inobt_rec``
> +   records from the inode chunk information and a bitmap of the old
> inode btree
> +   blocks.
> +
> +2. Append the records to an xfarray in inode order.
> +
> +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute
> the number
> +   of blocks needed for the inode btree.
> +   If the free space inode btree is enabled, call it again to
> estimate the
> +   geometry of the finobt.
> +
> +4. Allocate the number of blocks computed in the previous step.
> +
> +5. Use ``xfs_btree_bload`` to write the xfarray records to btree
> blocks and
> +   generate the internal node blocks.
> +   If the free space inode btree is enabled, call it again to load
> the finobt.
> +
> +6. Commit the location of the new btree root block(s) to the AGI.
> +
> +7. Reap the old btree blocks using the bitmap created in step 1.
> +
> +Details are as follows.
> +
> +The inode btree maps inumbers to the ondisk location of the
> associated
> +inode records, which means that the inode btrees can be rebuilt from
> the
> +reverse mapping information.
> +Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT``
> marks the
> +location of the old inode btree blocks.
> +Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES``
> marks the
> +location of at least one inode cluster buffer.
> +A cluster is the smallest number of ondisk inodes that can be
> allocated or
> +freed in a single transaction; it is never smaller than 1 fs block
> or 4 inodes.
> +
> +For the space represented by each inode cluster, ensure that there
> are no
> +records in the free space btrees nor any records in the reference
> count btree.
> +If there are, the space metadata inconsistencies are reason enough
> to abort the
> +operation.
> +Otherwise, read each cluster buffer to check that its contents
> appear to be
> +ondisk inodes and to decide if the file is allocated
> +(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``).
> +Accumulate the results of successive inode cluster buffer reads
> until there is
> +enough information to fill a single inode chunk record, which is 64
> consecutive
> +numbers in the inumber keyspace.
> +If the chunk is sparse, the chunk record may include holes.
> +
> +Once the repair function accumulates one chunk's worth of data, it
> calls
> +``xfarray_append`` to add the inode btree record to the xfarray.
> +This xfarray is walked twice during the btree creation step -- once
> to populate
> +the inode btree with all inode chunk records, and a second time to
> populate the
> +free inode btree with records for chunks that have free non-sparse
> inodes.
> +The number of records for the inode btree is the number of xfarray
> records,
> +but the record count for the free inode btree has to be computed as
> inode chunk
> +records are stored in the xfarray.
> +
> +The proposed patchset is the
> +`AG btree repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-ag-btrees>`_
> +series.
> +
> +Case Study: Rebuilding the Space Reference Counts
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +The high level process to rebuild the reference count btree is:
> +
> +1. Walk the reverse mapping records to generate ``struct
> xfs_refcount_irec``
> +   records for any space having more than one reverse mapping and
> add them to
> +   the xfarray.
> +   Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the
> xfarray.
Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray
even if they only have one mapping

?

You haven't mentioned any owners being disallowed, you've only stated
that you're collecting records with more than one rmap, so that would
be the inferred meaning.  

Also I think you also need to mention why.  The documentation is
starting to read a little more like pseudo code, but if it's not
explaining why it's doing things, we may as well just go to the code

> +   Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap
> of old
> +   refcount btree blocks.
> +
> +2. Sort the records in physical extent order, putting the CoW
> staging extents
> +   at the end of the xfarray.
Why?

> +
> +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute
> the number
> +   of blocks needed for the new tree.
> +
> +4. Allocate the number of blocks computed in the previous step.
> +
> +5. Use ``xfs_btree_bload`` to write the xfarray records to btree
> blocks and
> +   generate the internal node blocks.
> +
> +6. Commit the location of new btree root block to the AGF.
> +
> +7. Reap the old btree blocks using the bitmap created in step 1.
> +
> +Details are as follows; the same algorithm is used by ``xfs_repair``
> to
> +generate refcount information from reverse mapping records.
> +
> +Reverse mapping records are used to rebuild the reference count
> information.
> +Reference counts are required for correct operation of copy on write
> for shared
> +file data.
> +Imagine the reverse mapping entries as rectangles representing
> extents of
> +physical blocks, and that the rectangles can be laid down to allow
> them to
> +overlap each other.
> +From the diagram below, it is apparent that a reference count record
> must start
> +or end wherever the height of the stack changes.
> +In other words, the record emission stimulus is level-triggered::
> +
> +                        █    ███
> +              ██      █████ ████   ███        ██████
> +        ██   ████     ███████████ ████     █████████
> +        ████████████████████████████████ ███████████
> +        ^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
> +        2 1  23 21    3 43 234  2123  1 01 2  3     0
> +
> +The ondisk reference count btree does not store the refcount == 0
> cases because
> +the free space btree already records which blocks are free.
> +Extents being used to stage copy-on-write operations should be the
> only records
> +with refcount == 1.
So here you explain it... I think maybe the pseudo code would read
easier if you put it after the high level explanations of what we're
doing

> +Single-owner file blocks aren't recorded in either the free space or
> the
> +reference count btrees.
> +
> +Given the reverse mapping btree which orders records by physical
> block number,
> +a starting physical block (``sp``), a bag-like data structure to
> hold mappings
> +that cover ``sp``, and the next physical block where the level
> changes
> +(``np``), reference count information is constructed from reverse
> mapping data
> +as follows:
> +
> +While there are still unprocessed mappings in the reverse mapping
> btree:
> +
> +1. Set ``sp`` to the physical block of the next unprocessed reverse
> mapping
> +   record.
> +
> +2. Add to the bag all the reverse mappings where ``rm_startblock``
> == ``sp``.
Hmm, if this were code, I could tag the rm_startblock symbol, but that
doesnt work for a document.  While I could go look at the code to
answer this, you want your document to explain the code, not the other
way around... further commentary below...

> +
> +3. Set ``np`` to the physical block where the bag size will change.
> +   This is the minimum of (``rm_startblock`` of the next unprocessed
> mapping)
> +   and (``rm_startblock`` + ``rm_blockcount`` of each mapping in the
> bag).
> +
> +4. Record the bag size as ``old_bag_size``.
> +
> +5. While the bag isn't empty,
> +
> +   a. Remove from the bag all mappings where ``rm_startblock`` +
> +      ``rm_blockcount`` == ``np``.
> +
> +   b. Add to the bag all reverse mappings where ``rm_startblock`` ==
> ``np``.
> +
> +   c. If the bag size isn't ``old_bag_size``, store the refcount
> record
> +      ``(sp, np - sp, old_bag_size)`` in the refcount xfarray.
> +
> +   d. If the bag is empty, break out of this inner loop.
> +
> +   e. Set ``old_bag_size`` to ``bag_size``.
> +
> +   f. Set ``sp`` = ``np``.
> +
> +   g. Set ``np`` to the physical block where the bag size will
> change.
> +      Go to step 3 above.
I don't think verbalizing literal lines of code is any more explanatory
than the code.  I think it's easier just give the high level
description and then just go look at it.  

I notice you have the exact same verbiage in the code, you could just
link it:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-ag-btrees&id=771fa17dd5fd7d3d125c61232c4390e8f7ac0fb0#:~:text=*%20While%20there%20are%20still%20unprocessed%20rmaps%20in%20the%20array,and%20(startblock%20%2B%20len%20of%20each%20rmap%20in%20the%20bag)
.

Also that may cut down on future maintenance if this ever changes since
people might not think to update the document along with the code


Hmm, just thinking outside the box, what do you think of this method of
presentation:
 
  - Iterate over btree
records							tinyurl.com/4mp3j3pw
     - Find the corresponding reverse
mapping					tinyurl.com/27n7h5fa	
     - Collect all shared mappings with the same starting
block			tinyurl.com/mwdfy52b
     - Advance to the next block with a ref count
change			tinyurl.com/28689ufz				
       This position will either be the next unprocessed rmap, or the
       combined length all the collected mappings, which ever is
smaller
     - Iterate over the collected
mappings,					tinyurl.com/ye673rwa
        - Remove all mappings that start after this
position			tinyurl.com/22yp7p6u
        - Re-collect all mappings that start on this
position			tinyurl.com/2p8vytmv
        - If the size of the collection increased, update the ref
count		tinyurl.com/ecu7tud7
        - If more mappings were found, advance to the next block
with		tinyurl.com/47p4dfac
          a ref count change.  Continue until no more mappings are
found

It pulls the pseudo code up to a little higher level, plus the quick
links to jump deeper if needed and then people have all the navigation
utilities they are used to.  I just found a quick url shortener, so I'm
not really sure how long they keep those, but maybe we can find an
appropriate shorter

> +
> +The bag-like structure in this case is a type 2 xfarray as discussed
> in the
> +:ref:`xfarray access patterns<xfarray_access_patterns>` section.
> +Reverse mappings are added to the bag using
> ``xfarray_store_anywhere`` and
> +removed via ``xfarray_unset``.
> +Bag members are examined through ``xfarray_iter`` loops.
> +
> +The proposed patchset is the
> +`AG btree repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-ag-btrees>`_
> +series.
> +
> +Case Study: Rebuilding File Fork Mapping Indices
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +The high level process to rebuild a data/attr fork mapping btree is:
> +
> +1. Walk the reverse mapping records to generate ``struct
> xfs_bmbt_rec``
> +   records from the reverse mapping records for that inode and fork.
> +   Append these records to an xfarray.
> +   Compute the bitmap of the old bmap btree blocks from the
> ``BMBT_BLOCK``
> +   records.
> +
> +2. Use the ``xfs_btree_bload_compute_geometry`` function to compute
> the number
> +   of blocks needed for the new tree.
> +
> +3. Sort the records in file offset order.
> +
> +4. If the extent records would fit in the inode fork immediate area,
> commit the
> +   records to that immediate area and skip to step 8.
> +
> +5. Allocate the number of blocks computed in the previous step.
> +
> +6. Use ``xfs_btree_bload`` to write the xfarray records to btree
> blocks and
> +   generate the internal node blocks.
> +
> +7. Commit the new btree root block to the inode fork immediate area.
> +
> +8. Reap the old btree blocks using the bitmap created in step 1.
This description is not bad, but I had a hard time finding something
that resembled the description in the link below.  Maybe its in a
different branch?

> +
> +There are some complications here:
> +First, it's possible to move the fork offset to adjust the sizes of
> the
> +immediate areas if the data and attr forks are not both in BMBT
> format.
> +Second, if there are sufficiently few fork mappings, it may be
> possible to use
> +EXTENTS format instead of BMBT, which may require a conversion.
> +Third, the incore extent map must be reloaded carefully to avoid
> disturbing
> +any delayed allocation extents.
> +
> +The proposed patchset is the
> +`file repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-inodes>`_
> +series.
So I'm assuming links to kernel.org are acceptable as it looks like you
use them here, but it does imply that they need to sort of live
forever, or at least as long as any document that uses them?

> +
> +.. _reaping:
> +
> +Reaping Old Metadata Blocks
> +---------------------------
> +
> +Whenever online fsck builds a new data structure to replace one that
> is
> +suspect, there is a question of how to find and dispose of the
> blocks that
> +belonged to the old structure.
> +The laziest method of course is not to deal with them at all, but
> this slowly
> +leads to service degradations as space leaks out of the filesystem.
> +Hopefully, someone will schedule a rebuild of the free space
> information to
> +plug all those leaks.
> +Offline repair rebuilds all space metadata after recording the usage
> of
> +the files and directories that it decides not to clear, hence it can
> build new
> +structures in the discovered free space and avoid the question of
> reaping.
> +
> +As part of a repair, online fsck relies heavily on the reverse
> mapping records
> +to find space that is owned by the corresponding rmap owner yet
> truly free.
> +Cross referencing rmap records with other rmap records is necessary
> because
> +there may be other data structures that also think they own some of
> those
> +blocks (e.g. crosslinked trees).
> +Permitting the block allocator to hand them out again will not push
> the system
> +towards consistency.
> +
> +For space metadata, the process of finding extents to dispose of
> generally
> +follows this format:
> +
> +1. Create a bitmap of space used by data structures that must be
> preserved.
> +   The space reservations used to create the new metadata can be
> used here if
> +   the same rmap owner code is used to denote all of the objects
> being rebuilt.
> +
> +2. Survey the reverse mapping data to create a bitmap of space owned
> by the
> +   same ``XFS_RMAP_OWN_*`` number for the metadata that is being
> preserved.
> +
> +3. Use the bitmap disunion operator to subtract (1) from (2).
> +   The remaining set bits represent candidate extents that could be
> freed.
> +   The process moves on to step 4 below.
> +
> +Repairs for file-based metadata such as extended attributes,
> directories,
> +symbolic links, quota files and realtime bitmaps are performed by
> building a
> +new structure attached to a temporary file and swapping the forks.
> +Afterward, the mappings in the old file fork are the candidate
> blocks for
> +disposal.
> +
> +The process for disposing of old extents is as follows:
> +
> +4. For each candidate extent, count the number of reverse mapping
> records for
> +   the first block in that extent that do not have the same rmap
> owner for the
> +   data structure being repaired.
> +
> +   - If zero, the block has a single owner and can be freed.
> +
> +   - If not, the block is part of a crosslinked structure and must
> not be
> +     freed.
> +
> +5. Starting with the next block in the extent, figure out how many
> more blocks
> +   have the same zero/nonzero other owner status as that first
> block.
> +
> +6. If the region is crosslinked, delete the reverse mapping entry
> for the
> +   structure being repaired and move on to the next region.
> +
> +7. If the region is to be freed, mark any corresponding buffers in
> the buffer
> +   cache as stale to prevent log writeback.
> +
> +8. Free the region and move on.
I think this part is as straightforward as it can be.  I like links,
but they do have maintenance issues if the branch ever goes away.  It
may be worth it though just while the code is going through review, I
think it really helps to be able to just jump right into the code its
trying to describe rather than trying to track down based on the
description.  

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tree/fs/xfs/scrub/reap.c?h=repair-ag-btrees&id=d866f0e470b077806c994f4434bbe64e4a3a8662#n471:~:text=xrep_reap_ag_metadata(

I think that's the right one?  Tiny links nice for when steps are
buried in sub functions too

> +
> +However, there is one complication to this procedure.
> +Transactions are of finite size, so the reaping process must be
> careful to roll
> +the transactions to avoid overruns.
> +Overruns come from two sources:
> +
> +a. EFIs logged on behalf of space that is no longer occupied
> +
> +b. Log items for buffer invalidations
> +
> +This is also a window in which a crash during the reaping process
> can leak
> +blocks.
> +As stated earlier, online repair functions use very large
> transactions to
> +minimize the chances of this occurring.
> +
> +The proposed patchset is the
> +`preparation for bulk loading btrees
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-prep-for-bulk-loading>`_
> +series.
> +
> +Case Study: Reaping After a Regular Btree Repair
> +````````````````````````````````````````````````
> +
> +Old reference count and inode btrees are the easiest to reap because
> they have
> +rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the
> refcount
> +btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode
> btrees.
> +Creating a list of extents to reap the old btree blocks is quite
> simple,
> +conceptually:
> +
> +1. Lock the relevant AGI/AGF header buffers to prevent allocation
> and frees.
> +
> +2. For each reverse mapping record with an rmap owner corresponding
> to the
> +   metadata structure being rebuilt, set the corresponding range in
> a bitmap.
> +
> +3. Walk the current data structures that have the same rmap owner.
> +   For each block visited, clear that range in the above bitmap.
> +
> +4. Each set bit in the bitmap represents a block that could be a
> block from the
> +   old data structures and hence is a candidate for reaping.
> +   In other words, ``(rmap_records_owned_by &
> ~blocks_reachable_by_walk)``
> +   are the blocks that might be freeable.
> +
> +If it is possible to maintain the AGF lock throughout the repair
> (which is the
> +common case), then step 2 can be performed at the same time as the
> reverse
> +mapping record walk that creates the records for the new btree.
> +
> +Case Study: Rebuilding the Free Space Indices
> +`````````````````````````````````````````````
> +
> +The high level process to rebuild the free space indices is:
Looks like this one
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-ag-btrees&id=bf5f10a91ca58d883ef1231a406fa0646c4c4e50#:~:text=%2B%20*/-,%2BSTATIC%20int,-%2Bxrep_abt_build_new_trees(

> +
> +1. Walk the reverse mapping records to generate ``struct
> xfs_alloc_rec_incore``
> +   records from the gaps in the reverse mapping btree.
> +
> +2. Append the records to an xfarray.
> +
> +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute
> the number
> +   of blocks needed for each new tree.
> +
> +4. Allocate the number of blocks computed in the previous step from
> the free
> +   space information collected.
> +
> +5. Use ``xfs_btree_bload`` to write the xfarray records to btree
> blocks and
> +   generate the internal node blocks for the free space by block
> index.
> +   Call it again for the free space by length index.
nit: these two loads are flipped

> +
> +6. Commit the locations of the new btree root blocks to the AGF.
> +
> +7. Reap the old btree blocks by looking for space that is not
> recorded by the
> +   reverse mapping btree, the new free space btrees, or the AGFL.
> +
> +Repairing the free space btrees has three key complications over a
> regular
> +btree repair:
> +
> +First, free space is not explicitly tracked in the reverse mapping
> records.
> +Hence, the new free space records must be inferred from gaps in the
> physical
> +space component of the keyspace of the reverse mapping btree.
> +
> +Second, free space repairs cannot use the common btree reservation
> code because
> +new blocks are reserved out of the free space btrees.
> +This is impossible when repairing the free space btrees themselves.
> +However, repair holds the AGF buffer lock for the duration of the
> free space
> +index reconstruction, so it can use the collected free space
> information to
> +supply the blocks for the new free space btrees.
> +It is not necessary to back each reserved extent with an EFI because
> the new
> +free space btrees are constructed in what the ondisk filesystem
> thinks is
> +unowned space.
> +However, if reserving blocks for the new btrees from the collected
> free space
> +information changes the number of free space records, repair must
> re-estimate
> +the new free space btree geometry with the new record count until
> the
> +reservation is sufficient.
> +As part of committing the new btrees, repair must ensure that
> reverse mappings
> +are created for the reserved blocks and that unused reserved blocks
> are
> +inserted into the free space btrees.
> +Deferrred rmap and freeing operations are used to ensure that this
> transition
> +is atomic, similar to the other btree repair functions.
> +
> +Third, finding the blocks to reap after the repair is not overly
> +straightforward.
> +Blocks for the free space btrees and the reverse mapping btrees are
> supplied by
> +the AGFL.
> +Blocks put onto the AGFL have reverse mapping records with the owner
> +``XFS_RMAP_OWN_AG``.
> +This ownership is retained when blocks move from the AGFL into the
> free space
> +btrees or the reverse mapping btrees.
> +When repair walks reverse mapping records to synthesize free space
> records, it
> +creates a bitmap (``ag_owner_bitmap``) of all the space claimed by
> +``XFS_RMAP_OWN_AG`` records.
> +The repair context maintains a second bitmap corresponding to the
> rmap btree
> +blocks and the AGFL blocks (``rmap_agfl_bitmap``).
> +When the walk is complete, the bitmap disunion operation
> ``(ag_owner_bitmap &
> +~rmap_agfl_bitmap)`` computes the extents that are used by the old
> free space
> +btrees.
> +These blocks can then be reaped using the methods outlined above.
> +
> +The proposed patchset is the
> +`AG btree repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-ag-btrees>`_
> +series.
I think we've repeated this link couple times in the doc.  If you like
highlight links, we cloud clean out the duplicates

> +
> +.. _rmap_reap:
> +
> +Case Study: Reaping After Repairing Reverse Mapping Btrees
> +``````````````````````````````````````````````````````````
> +
> +Old reverse mapping btrees are less difficult to reap after a
> repair.
> +As mentioned in the previous section, blocks on the AGFL, the two
> free space
> +btree blocks, and the reverse mapping btree blocks all have reverse
> mapping
> +records with ``XFS_RMAP_OWN_AG`` as the owner.
> +The full process of gathering reverse mapping records and building a
> new btree
> +are described in the case study of
> +:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point
> from that
> +discussion is that the new rmap btree will not contain any records
> for the old
> +rmap btree, nor will the old btree blocks be tracked in the free
> space btrees.
> +The list of candidate reaping blocks is computed by setting the bits
> +corresponding to the gaps in the new rmap btree records, and then
> clearing the
> +bits corresponding to extents in the free space btrees and the
> current AGFL
> +blocks.
> +The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are
> reaped using the
> +methods outlined above.
> +
> +The rest of the process of rebuildng the reverse mapping btree is
> discussed
> +in a separate :ref:`case study<rmap_repair>`.
> +
> +The proposed patchset is the
> +`AG btree repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-ag-btrees>`_
> +series.
> +
> +Case Study: Rebuilding the AGFL
> +```````````````````````````````
> +
> +The allocation group free block list (AGFL) is repaired as follows:
> +
> +1. Create a bitmap for all the space that the reverse mapping data
> claims is
> +   owned by ``XFS_RMAP_OWN_AG``.
> +
> +2. Subtract the space used by the two free space btrees and the rmap
> btree.
> +
> +3. Subtract any space that the reverse mapping data claims is owned
> by any
> +   other owner, to avoid re-adding crosslinked blocks to the AGFL.
> +
> +4. Once the AGFL is full, reap any blocks leftover.
> +
> +5. The next operation to fix the freelist will right-size the list.
> 
Branch link?  Looks like maybe it's missing.  In fact this logic looks
like it might have been cut off?

In any case, maybe give some thought to the highlight link suggestions.

Allison


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/14] xfs: document pageable kernel memory
  2023-02-09  5:41         ` Allison Henderson
@ 2023-02-09 23:14           ` Darrick J. Wong
  2023-02-25  7:32             ` Allison Henderson
  0 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2023-02-09 23:14 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Thu, Feb 09, 2023 at 05:41:22AM +0000, Allison Henderson wrote:
> On Thu, 2023-02-02 at 15:14 -0800, Darrick J. Wong wrote:
> > On Thu, Feb 02, 2023 at 07:14:22AM +0000, Allison Henderson wrote:
> > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Add a discussion of pageable kernel memory, since online fsck
> > > > needs
> > > > quite a bit more memory than most other parts of the filesystem
> > > > to
> > > > stage
> > > > records and other information.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  .../filesystems/xfs-online-fsck-design.rst         |  490
> > > > ++++++++++++++++++++
> > > >  1 file changed, 490 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > index 419eb54ee200..9d7a2ef1d0dd 100644
> > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > @@ -383,6 +383,8 @@ Algorithms") of Srinivasan.
> > > >  However, any data structure builder that maintains a resource
> > > > lock
> > > > for the
> > > >  duration of the repair is *always* an offline algorithm.
> > > >  
> > > > +.. _secondary_metadata:
> > > > +
> > > >  Secondary Metadata
> > > >  ``````````````````
> > > >  
> > > > @@ -1746,3 +1748,491 @@ Scrub teardown disables all static keys
> > > > obtained by ``xchk_fshooks_enable``.
> > > >  
> > > >  For more information, please see the kernel documentation of
> > > >  Documentation/staging/static-keys.rst.
> > > > +
> > > > +.. _xfile:
> > > > +
> > > > +Pageable Kernel Memory
> > > > +----------------------
> > > > +
> > > > +Demonstrations of the first few prototypes of online repair
> > > > revealed
> > > > new
> > > > +technical requirements that were not originally identified.
> > > > +For the first demonstration, the code walked whatever filesystem
> > > > +metadata it needed to synthesize new records and inserted
> > > > records
> > > > into a new
> > > > +btree as it found them.
> > > > +This was subpar since any additional corruption or runtime
> > > > errors
> > > > encountered
> > > > +during the walk would shut down the filesystem.
> > > > +After remount, the blocks containing the half-rebuilt data
> > > > structure
> > > > would not
> > > > +be accessible until another repair was attempted.
> > > > +Solving the problem of half-rebuilt data structures will be
> > > > discussed in the
> > > > +next section.
> > > > +
> > > > +For the second demonstration, the synthesized records were
> > > > instead
> > > > stored in
> > > > +kernel slab memory.
> > > > +Doing so enabled online repair to abort without writing to the
> > > > filesystem if
> > > > +the metadata walk failed, which prevented online fsck from
> > > > making
> > > > things worse.
> > > > +However, even this approach needed improving upon.
> > > > +
> > > > +There are four reasons why traditional Linux kernel memory
> > > > management isn't
> > > > +suitable for storing large datasets:
> > > > +
> > > > +1. Although it is tempting to allocate a contiguous block of
> > > > memory
> > > > to create a
> > > > +   C array, this cannot easily be done in the kernel because it
> > > > cannot be
> > > > +   relied upon to allocate multiple contiguous memory pages.
> > > > +
> > > > +2. While disparate physical pages can be virtually mapped
> > > > together,
> > > > installed
> > > > +   memory might still not be large enough to stage the entire
> > > > record
> > > > set in
> > > > +   memory while constructing a new btree.
> > > > +
> > > > +3. To overcome these two difficulties, the implementation was
> > > > adjusted to use
> > > > +   doubly linked lists, which means every record object needed
> > > > two
> > > > 64-bit list
> > > > +   head pointers, which is a lot of overhead.
> > > > +
> > > > +4. Kernel memory is pinned, which can drive the system out of
> > > > memory, leading
> > > > +   to OOM kills of unrelated processes.
> > > > +
> > > I think I maybe might just jump to what ever the current plan is
> > > instead of trying to keep a record of the dev history in the
> > > document.
> > > I'm sure we're not done yet, dev really never is, so in order for
> > > the
> > > documentation to be maintained, it would just get bigger and bigger
> > > to
> > > keep documenting it this way.  It's not that the above isnt
> > > valuable,
> > > but maybe a different kind of document really.
> > 
> > OK, I've shortened this introduction to outline the requirements, and
> > trimmed the historical information to a sidebar:
> > 
> > "Some online checking functions work by scanning the filesystem to
> > build
> > a shadow copy of an ondisk metadata structure in memory and comparing
> > the two copies. For online repair to rebuild a metadata structure, it
> > must compute the record set that will be stored in the new structure
> > before it can persist that new structure to disk. Ideally, repairs
> > complete with a single atomic commit that introduces a new data
> > structure. To meet these goals, the kernel needs to collect a large
> > amount of information in a place that doesn’t require the correct
> > operation of the filesystem.
> > 
> > "Kernel memory isn’t suitable because:
> > 
> > *   Allocating a contiguous region of memory to create a C array is
> > very
> >     difficult, especially on 32-bit systems.
> > 
> > *   Linked lists of records introduce double pointer overhead which
> > is
> >     very high and eliminate the possibility of indexed lookups.
> > 
> > *   Kernel memory is pinned, which can drive the system into OOM
> >     conditions.
> > 
> > *   The system might not have sufficient memory to stage all the
> >     information.
> > 
> > "At any given time, online fsck does not need to keep the entire
> > record
> > set in memory, which means that individual records can be paged out
> > if
> > necessary. Continued development of online fsck demonstrated that the
> > ability to perform indexed data storage would also be very useful.
> > Fortunately, the Linux kernel already has a facility for
> > byte-addressable and pageable storage: tmpfs. In-kernel graphics
> > drivers
> > (most notably i915) take advantage of tmpfs files to store
> > intermediate
> > data that doesn’t need to be in memory at all times, so that usage
> > precedent is already established. Hence, the xfile was born!
> > 
> > Historical Sidebar
> > ------------------
> > 
> > "The first edition of online repair inserted records into a new btree
> > as
> > it found them, which failed because filesystem could shut down with a
> > built data structure, which would be live after recovery finished.
> > 
> > "The second edition solved the half-rebuilt structure problem by
> > storing
> > everything in memory, but frequently ran the system out of memory.
> > 
> > "The third edition solved the OOM problem by using linked lists, but
> > the
> > list overhead was extreme."
> Ok, I think that's cleaner
> 
> > 
> > > 
> > > 
> > > > +For the third iteration, attention swung back to the possibility
> > > > of
> > > > using
> > > 
> > > Due to the large volume of metadata that needs to be processed,
> > > ofsck
> > > uses...
> > > 
> > > > +byte-indexed array-like storage to reduce the overhead of in-
> > > > memory
> > > > records.
> > > > +At any given time, online repair does not need to keep the
> > > > entire
> > > > record set in
> > > > +memory, which means that individual records can be paged out.
> > > > +Creating new temporary files in the XFS filesystem to store
> > > > intermediate data
> > > > +was explored and rejected for some types of repairs because a
> > > > filesystem with
> > > > +compromised space and inode metadata should never be used to fix
> > > > compromised
> > > > +space or inode metadata.
> > > > +However, the kernel already has a facility for byte-addressable
> > > > and
> > > > pageable
> > > > +storage: shmfs.
> > > > +In-kernel graphics drivers (most notably i915) take advantage of
> > > > shmfs files
> > > > +to store intermediate data that doesn't need to be in memory at
> > > > all
> > > > times, so
> > > > +that usage precedent is already established.
> > > > +Hence, the ``xfile`` was born!
> > > > +
> > > > +xfile Access Models
> > > > +```````````````````
> > > > +
> > > > +A survey of the intended uses of xfiles suggested these use
> > > > cases:
> > > > +
> > > > +1. Arrays of fixed-sized records (space management btrees,
> > > > directory
> > > > and
> > > > +   extended attribute entries)
> > > > +
> > > > +2. Sparse arrays of fixed-sized records (quotas and link counts)
> > > > +
> > > > +3. Large binary objects (BLOBs) of variable sizes (directory and
> > > > extended
> > > > +   attribute names and values)
> > > > +
> > > > +4. Staging btrees in memory (reverse mapping btrees)
> > > > +
> > > > +5. Arbitrary contents (realtime space management)
> > > > +
> > > > +To support the first four use cases, high level data structures
> > > > wrap
> > > > the xfile
> > > > +to share functionality between online fsck functions.
> > > > +The rest of this section discusses the interfaces that the xfile
> > > > presents to
> > > > +four of those five higher level data structures.
> > > > +The fifth use case is discussed in the :ref:`realtime summary
> > > > <rtsummary>` case
> > > > +study.
> > > > +
> > > > +The most general storage interface supported by the xfile
> > > > enables
> > > > the reading
> > > > +and writing of arbitrary quantities of data at arbitrary offsets
> > > > in
> > > > the xfile.
> > > > +This capability is provided by ``xfile_pread`` and
> > > > ``xfile_pwrite``
> > > > functions,
> > > > +which behave similarly to their userspace counterparts.
> > > > +XFS is very record-based, which suggests that the ability to
> > > > load
> > > > and store
> > > > +complete records is important.
> > > > +To support these cases, a pair of ``xfile_obj_load`` and
> > > > ``xfile_obj_store``
> > > > +functions are provided to read and persist objects into an
> > > > xfile.
> > > > +They are internally the same as pread and pwrite, except that
> > > > they
> > > > treat any
> > > > +error as an out of memory error.
> > > > +For online repair, squashing error conditions in this manner is
> > > > an
> > > > acceptable
> > > > +behavior because the only reaction is to abort the operation
> > > > back to
> > > > userspace.
> > > > +All five xfile usecases can be serviced by these four functions.
> > > > +
> > > > +However, no discussion of file access idioms is complete without
> > > > answering the
> > > > +question, "But what about mmap?"
> > > I actually wouldn't spend too much time discussing solutions that
> > > didn't work for what ever reason, unless someones really asking for
> > > it.
> > >  I think this section would read just fine to trim off the last
> > > paragraph here
> > 
> > Since I wrote this, I've been experimenting with wiring up the tmpfs
> > file page cache folios to the xfs buffer cache.  Pinning the folios
> > in
> > this manner makes it so that online fsck can (more or less) directly
> > access the xfile contents.  Much to my surprise, this has actually
> > held
> > up in testing, so ... it's no longer a solution that "didn't really
> > work". :)
> > 
> > I also need to s/page/folio/ now that willy has finished that
> > conversion.  This section has been rewritten as such:
> > 
> > "However, no discussion of file access idioms is complete without
> > answering the question, “But what about mmap?” It is convenient to
> > access storage directly with pointers, just like userspace code does
> > with regular memory. Online fsck must not drive the system into OOM
> > conditions, which means that xfiles must be responsive to memory
> > reclamation. tmpfs can only push a pagecache folio to the swap cache
> > if
> > the folio is neither pinned nor locked, which means the xfile must
> > not
> > pin too many folios.
> > 
> > "Short term direct access to xfile contents is done by locking the
> > pagecache folio and mapping it into kernel address space.
> > Programmatic
> > access (e.g. pread and pwrite) uses this mechanism. Folio locks are
> > not
> > supposed to be held for long periods of time, so long term direct
> > access
> > to xfile contents is done by bumping the folio refcount, mapping it
> > into
> > kernel address space, and dropping the folio lock. These long term
> > users
> > must be responsive to memory reclaim by hooking into the shrinker
> > infrastructure to know when to release folios.
> > 
> > "The xfile_get_page and xfile_put_page functions are provided to
> > retrieve the (locked) folio that backs part of an xfile and to
> > release
> > it. The only code to use these folio lease functions are the xfarray
> > sorting algorithms and the in-memory btrees."
> Alrighty, sounds like a good upate then
> 
> > 
> > > > +It would be *much* more convenient if kernel code could access
> > > > pageable kernel
> > > > +memory with pointers, just like userspace code does with regular
> > > > memory.
> > > > +Like any other filesystem that uses the page cache, reads and
> > > > writes
> > > > of xfile
> > > > +data lock the cache page and map it into the kernel address
> > > > space
> > > > for the
> > > > +duration of the operation.
> > > > +Unfortunately, shmfs can only write a file page to the swap
> > > > device
> > > > if the page
> > > > +is unmapped and unlocked, which means the xfile risks causing
> > > > OOM
> > > > problems
> > > > +unless it is careful not to pin too many pages.
> > > > +Therefore, the xfile steers most of its users towards
> > > > programmatic
> > > > access so
> > > > +that backing pages are not kept locked in memory for longer than
> > > > is
> > > > necessary.
> > > > +However, for callers performing quick linear scans of xfile
> > > > data,
> > > > +``xfile_get_page`` and ``xfile_put_page`` functions are provided
> > > > to
> > > > pin a page
> > > > +in memory.
> > > > +So far, the only code to use these functions are the xfarray
> > > > :ref:`sorting
> > > > +<xfarray_sort>` algorithms.
> > > > +
> > > > +xfile Access Coordination
> > > > +`````````````````````````
> > > > +
> > > > +For security reasons, xfiles must be owned privately by the
> > > > kernel.
> > > > +They are marked ``S_PRIVATE`` to prevent interference from the
> > > > security system,
> > > > +must never be mapped into process file descriptor tables, and
> > > > their
> > > > pages must
> > > > +never be mapped into userspace processes.
> > > > +
> > > > +To avoid locking recursion issues with the VFS, all accesses to
> > > > the
> > > > shmfs file
> > > > +are performed by manipulating the page cache directly.
> > > > +xfile writes call the ``->write_begin`` and ``->write_end``
> > > > functions of the
> > > > +xfile's address space to grab writable pages, copy the caller's
> > > > buffer into the
> > > > +page, and release the pages.
> > > > +xfile reads call ``shmem_read_mapping_page_gfp`` to grab pages
> > > xfile readers
> > 
> > OK.
> > 
> > > > directly before
> > > > +copying the contents into the caller's buffer.
> > > > +In other words, xfiles ignore the VFS read and write code paths
> > > > to
> > > > avoid
> > > > +having to create a dummy ``struct kiocb`` and to avoid taking
> > > > inode
> > > > and
> > > > +freeze locks.
> > > > +
> > > > +If an xfile is shared between threads to stage repairs, the
> > > > caller
> > > > must provide
> > > > +its own locks to coordinate access.
> > > Ofsck threads that share an xfile between stage repairs will use
> > > their
> > > own locks to coordinate access with each other.
> > > 
> > > ?
> > 
> > Hm.  I wonder if there's a misunderstanding here?
> > 
> > Online fsck functions themselves are single-threaded, which is to say
> > that they themselves neither queue workers nor start kthreads. 
> > However,
> > an xfile created by a running fsck function can be accessed from
> > other
> > thread if the fsck function also hooks itself into filesystem code.
> > 
> > The live update section has a nice diagram of how that works:
> > https://djwong.org/docs/xfs-online-fsck-design/#filesystem-hooks
> > 
> 
> Oh ok, I think I got hung up on who the callers were.  How about
> "xfiles shared between threads running from hooked filesystem functions
> will use their own locks to coordinate access with each other."

I don't want to mention filesystem hooks before the chapter that
introduces them.  How about:

"For example, if a scrub function stores scan results in an xfile and
needs other threads to provide updates to the scanned data, the scrub
function must provide a lock for all threads to share."

--D

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 08/14] xfs: document btree bulk loading
  2023-02-09  5:47     ` Allison Henderson
@ 2023-02-10  0:24       ` Darrick J. Wong
  2023-02-16 15:46         ` Allison Henderson
  0 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2023-02-10  0:24 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Thu, Feb 09, 2023 at 05:47:17AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add a discussion of the btree bulk loading code, which makes it easy
> > to
> > take an in-memory recordset and write it out to disk in an efficient
> > manner.  This also enables atomic switchover from the old to the new
> > structure with minimal potential for leaking the old blocks.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  632
> > ++++++++++++++++++++
> >  1 file changed, 632 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index 9d7a2ef1d0dd..eb61d867e55c 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -2236,3 +2236,635 @@ this functionality as follows:
> >  
> >  After removing xfile logged buffers from the transaction in this
> > manner, the
> >  transaction can be committed or cancelled.
> > +
> > +Bulk Loading of Ondisk B+Trees
> > +------------------------------
> > +
> > +As mentioned previously, early iterations of online repair built new
> > btree
> > +structures by creating a new btree and adding observations
> > individually.
> > +Loading a btree one record at a time had a slight advantage of not
> > requiring
> > +the incore records to be sorted prior to commit, but was very slow
> > and leaked
> > +blocks if the system went down during a repair.
> > +Loading records one at a time also meant that repair could not
> > control the
> > +loading factor of the blocks in the new btree.
> > +
> > +Fortunately, the venerable ``xfs_repair`` tool had a more efficient
> > means for
> > +rebuilding a btree index from a collection of records -- bulk btree
> > loading.
> > +This was implemented rather inefficiently code-wise, since
> > ``xfs_repair``
> > +had separate copy-pasted implementations for each btree type.
> > +
> > +To prepare for online fsck, each of the four bulk loaders were
> > studied, notes
> > +were taken, and the four were refactored into a single generic btree
> > bulk
> > +loading mechanism.
> > +Those notes in turn have been refreshed and are presented below.
> > +
> > +Geometry Computation
> > +````````````````````
> > +
> > +The zeroth step of bulk loading is to assemble the entire record set
> > that will
> > +be stored in the new btree, and sort the records.
> > +Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape
> > of the
> > +btree from the record set, the type of btree, and any load factor
> > preferences.
> > +This information is required for resource reservation.
> > +
> > +First, the geometry computation computes the minimum and maximum
> > records that
> > +will fit in a leaf block from the size of a btree block and the size
> > of the
> > +block header.
> > +Roughly speaking, the maximum number of records is::
> > +
> > +        maxrecs = (block_size - header_size) / record_size
> > +
> > +The XFS design specifies that btree blocks should be merged when
> > possible,
> > +which means the minimum number of records is half of maxrecs::
> > +
> > +        minrecs = maxrecs / 2
> > +
> > +The next variable to determine is the desired loading factor.
> > +This must be at least minrecs and no more than maxrecs.
> > +Choosing minrecs is undesirable because it wastes half the block.
> > +Choosing maxrecs is also undesirable because adding a single record
> > to each
> > +newly rebuilt leaf block will cause a tree split, which causes a
> > noticeable
> > +drop in performance immediately afterwards.
> > +The default loading factor was chosen to be 75% of maxrecs, which
> > provides a
> > +reasonably compact structure without any immediate split penalties.
> 	default_lload_factor = (maxrecs + minrecs) / 2;
> > +If space is tight, the loading factor will be set to maxrecs to try
> > to avoid
> > +running out of space::
> > +
> > +        leaf_load_factor = enough space ? (maxrecs + minrecs) / 2 :
> > maxrecs
> 	leaf_load_factor = enough space ? default_lload_factor :
> maxrecs;
> 
> Just more readable i think

Ok, changed.

> 
> > +
> > +Load factor is computed for btree node blocks using the combined
> > size of the
> > +btree key and pointer as the record size::
> > +
> > +        maxrecs = (block_size - header_size) / (key_size + ptr_size)
> > +        minrecs = maxrecs / 2
> 	default_nload_factor = (maxrecs + minrecs) / 2;
> 
> > +        node_load_factor = enough space ? (maxrecs + minrecs) / 2 :
> > maxrecs
> 	node_load_factor = enough space ? default_nload_factor :
> maxrecs;

Here too.

> > +
> > +Once that's done, the number of leaf blocks required to store the
> > record set
> > +can be computed as::
> > +
> > +        leaf_blocks = ceil(record_count / leaf_load_factor)
> > +
> > +The number of node blocks needed to point to the next level down in
> > the tree
> > +is computed as::
> > +
> > +        n_blocks = (n == 0 ? leaf_blocks : node_blocks[n])
> > +        node_blocks[n + 1] = ceil(n_blocks / node_load_factor)
> > +
> > +The entire computation is performed recursively until the current
> > level only
> > +needs one block.
> > +The resulting geometry is as follows:
> > +
> > +- For AG-rooted btrees, this level is the root level, so the height
> > of the new
> > +  tree is ``level + 1`` and the space needed is the summation of the
> > number of
> > +  blocks on each level.
> > +
> > +- For inode-rooted btrees where the records in the top level do not
> > fit in the
> > +  inode fork area, the height is ``level + 2``, the space needed is
> > the
> > +  summation of the number of blocks on each level, and the inode
> > fork points to
> > +  the root block.
> > +
> > +- For inode-rooted btrees where the records in the top level can be
> > stored in
> > +  the inode fork area, then the root block can be stored in the
> > inode, the
> > +  height is ``level + 1``, and the space needed is one less than the
> > summation
> > +  of the number of blocks on each level.
> > +  This only becomes relevant when non-bmap btrees gain the ability
> > to root in
> > +  an inode, which is a future patchset and only included here for
> > completeness.
> > +
> > +.. _newbt:
> > +
> > +Reserving New B+Tree Blocks
> > +```````````````````````````
> > +
> > +Once repair knows the number of blocks needed for the new btree, it
> > allocates
> > +those blocks using the free space information.
> > +Each reserved extent is tracked separately by the btree builder
> > state data.
> > +To improve crash resilience, the reservation code also logs an
> > Extent Freeing
> > +Intent (EFI) item in the same transaction as each space allocation
> > and attaches
> > +its in-memory ``struct xfs_extent_free_item`` object to the space
> > reservation.
> > +If the system goes down, log recovery will use the unfinished EFIs
> > to free the
> > +unused space, the free space, leaving the filesystem unchanged.
> > +
> > +Each time the btree builder claims a block for the btree from a
> > reserved
> > +extent, it updates the in-memory reservation to reflect the claimed
> > space.
> > +Block reservation tries to allocate as much contiguous space as
> > possible to
> > +reduce the number of EFIs in play.
> > +
> > +While repair is writing these new btree blocks, the EFIs created for
> > the space
> > +reservations pin the tail of the ondisk log.
> > +It's possible that other parts of the system will remain busy and
> > push the head
> > +of the log towards the pinned tail.
> > +To avoid livelocking the filesystem, the EFIs must not pin the tail
> > of the log
> > +for too long.
> > +To alleviate this problem, the dynamic relogging capability of the
> > deferred ops
> > +mechanism is reused here to commit a transaction at the log head
> > containing an
> > +EFD for the old EFI and new EFI at the head.
> > +This enables the log to release the old EFI to keep the log moving
> > forwards.
> > +
> > +EFIs have a role to play during the commit and reaping phases;
> > please see the
> > +next section and the section about :ref:`reaping<reaping>` for more
> > details.
> > +
> > +Proposed patchsets are the
> > +`bitmap rework
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-bitmap-rework>`_
> > +and the
> > +`preparation for bulk loading btrees
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-prep-for-bulk-loading>`_.
> > +
> > +
> > +Writing the New Tree
> > +````````````````````
> > +
> > +This part is pretty simple -- the btree builder
> > (``xfs_btree_bulkload``) claims
> > +a block from the reserved list, writes the new btree block header,
> > fills the
> > +rest of the block with records, and adds the new leaf block to a
> > list of
> > +written blocks.
> > +Sibling pointers are set every time a new block is added to the
> > level.
> > +When it finishes writing the record leaf blocks, it moves on to the
> > node
> > +blocks.
> > +To fill a node block, it walks each block in the next level down in
> > the tree
> > +to compute the relevant keys and write them into the parent node.
> > +When it reaches the root level, it is ready to commit the new btree!
> I think most of this is as straight forward as it can be, but it's a
> lot visualizing too, which makes me wonder if it would benefit from an
> simple illustration if possible.
> 
> On a side note: In a prior team I discovered power points, while a lot
> work, were also really effective for quickly moving a crowd of people
> through connected graph navigation/manipulations.  Because each one of
> these steps was another slide that illustrated how the structure
> evolved through the updates.  I realize that's not something that fits
> in the scheme of a document like this, but maybe something supplemental
> to add later.  While it was a time eater, i noticed a lot of confused
> expressions just seemed to shake loose, so sometimes it was worth it.

That was ... surprisingly less bad than I feared it would be to cut and
paste unicode linedraw characters and arrows.

          ┌─────────┐
          │root     │
          │PP       │
          └─────────┘
          ↙         ↘
      ┌────┐       ┌────┐
      │node│──────→│node│
      │PP  │←──────│PP  │
      └────┘       └────┘
      ↙   ↘         ↙   ↘
  ┌────┐ ┌────┐ ┌────┐ ┌────┐
  │leaf│→│leaf│→│leaf│→│leaf│
  │RRR │←│RRR │←│RRR │←│RRR │
  └────┘ └────┘ └────┘ └────┘

(Does someone have a program that does this?)

> 
> > +
> > +The first step to commit the new btree is to persist the btree
> > blocks to disk
> > +synchronously.
> > +This is a little complicated because a new btree block could have
> > been freed
> > +in the recent past, so the builder must use
> > ``xfs_buf_delwri_queue_here`` to
> > +remove the (stale) buffer from the AIL list before it can write the
> > new blocks
> > +to disk.
> > +Blocks are queued for IO using a delwri list and written in one
> > large batch
> > +with ``xfs_buf_delwri_submit``.
> > +
> > +Once the new blocks have been persisted to disk, control returns to
> > the
> > +individual repair function that called the bulk loader.
> > +The repair function must log the location of the new root in a
> > transaction,
> > +clean up the space reservations that were made for the new btree,
> > and reap the
> > +old metadata blocks:
> > +
> > +1. Commit the location of the new btree root.
> > +
> > +2. For each incore reservation:
> > +
> > +   a. Log Extent Freeing Done (EFD) items for all the space that was
> > consumed
> > +      by the btree builder.  The new EFDs must point to the EFIs
> > attached to
> > +      the reservation to prevent log recovery from freeing the new
> > blocks.
> > +
> > +   b. For unclaimed portions of incore reservations, create a
> > regular deferred
> > +      extent free work item to be free the unused space later in the
> > +      transaction chain.
> > +
> > +   c. The EFDs and EFIs logged in steps 2a and 2b must not overrun
> > the
> > +      reservation of the committing transaction.
> > +      If the btree loading code suspects this might be about to
> > happen, it must
> > +      call ``xrep_defer_finish`` to clear out the deferred work and
> > obtain a
> > +      fresh transaction.
> > +
> > +3. Clear out the deferred work a second time to finish the commit
> > and clean
> > +   the repair transaction.
> > +
> > +The transaction rolling in steps 2c and 3 represent a weakness in
> > the repair
> > +algorithm, because a log flush and a crash before the end of the
> > reap step can
> > +result in space leaking.
> > +Online repair functions minimize the chances of this occuring by
> > using very
> > +large transactions, which each can accomodate many thousands of
> > block freeing
> > +instructions.
> > +Repair moves on to reaping the old blocks, which will be presented
> > in a
> > +subsequent :ref:`section<reaping>` after a few case studies of bulk
> > loading.
> > +
> > +Case Study: Rebuilding the Inode Index
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +The high level process to rebuild the inode index btree is:
> > +
> > +1. Walk the reverse mapping records to generate ``struct
> > xfs_inobt_rec``
> > +   records from the inode chunk information and a bitmap of the old
> > inode btree
> > +   blocks.
> > +
> > +2. Append the records to an xfarray in inode order.
> > +
> > +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute
> > the number
> > +   of blocks needed for the inode btree.
> > +   If the free space inode btree is enabled, call it again to
> > estimate the
> > +   geometry of the finobt.
> > +
> > +4. Allocate the number of blocks computed in the previous step.
> > +
> > +5. Use ``xfs_btree_bload`` to write the xfarray records to btree
> > blocks and
> > +   generate the internal node blocks.
> > +   If the free space inode btree is enabled, call it again to load
> > the finobt.
> > +
> > +6. Commit the location of the new btree root block(s) to the AGI.
> > +
> > +7. Reap the old btree blocks using the bitmap created in step 1.
> > +
> > +Details are as follows.
> > +
> > +The inode btree maps inumbers to the ondisk location of the
> > associated
> > +inode records, which means that the inode btrees can be rebuilt from
> > the
> > +reverse mapping information.
> > +Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT``
> > marks the
> > +location of the old inode btree blocks.
> > +Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES``
> > marks the
> > +location of at least one inode cluster buffer.
> > +A cluster is the smallest number of ondisk inodes that can be
> > allocated or
> > +freed in a single transaction; it is never smaller than 1 fs block
> > or 4 inodes.
> > +
> > +For the space represented by each inode cluster, ensure that there
> > are no
> > +records in the free space btrees nor any records in the reference
> > count btree.
> > +If there are, the space metadata inconsistencies are reason enough
> > to abort the
> > +operation.
> > +Otherwise, read each cluster buffer to check that its contents
> > appear to be
> > +ondisk inodes and to decide if the file is allocated
> > +(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``).
> > +Accumulate the results of successive inode cluster buffer reads
> > until there is
> > +enough information to fill a single inode chunk record, which is 64
> > consecutive
> > +numbers in the inumber keyspace.
> > +If the chunk is sparse, the chunk record may include holes.
> > +
> > +Once the repair function accumulates one chunk's worth of data, it
> > calls
> > +``xfarray_append`` to add the inode btree record to the xfarray.
> > +This xfarray is walked twice during the btree creation step -- once
> > to populate
> > +the inode btree with all inode chunk records, and a second time to
> > populate the
> > +free inode btree with records for chunks that have free non-sparse
> > inodes.
> > +The number of records for the inode btree is the number of xfarray
> > records,
> > +but the record count for the free inode btree has to be computed as
> > inode chunk
> > +records are stored in the xfarray.
> > +
> > +The proposed patchset is the
> > +`AG btree repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-ag-btrees>`_
> > +series.
> > +
> > +Case Study: Rebuilding the Space Reference Counts
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +The high level process to rebuild the reference count btree is:
> > +
> > +1. Walk the reverse mapping records to generate ``struct
> > xfs_refcount_irec``
> > +   records for any space having more than one reverse mapping and
> > add them to
> > +   the xfarray.
> > +   Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the
> > xfarray.
> Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray
> even if they only have one mapping
> 
> ?
> 
> You haven't mentioned any owners being disallowed, you've only stated
> that you're collecting records with more than one rmap, so that would
> be the inferred meaning.  
> 
> Also I think you also need to mention why.  The documentation is
> starting to read a little more like pseudo code, but if it's not
> explaining why it's doing things, we may as well just go to the code

"Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray
because these are extents allocated to stage a copy on write operation
and are tracked in the refcount btree."

> > +   Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap
> > of old
> > +   refcount btree blocks.
> > +
> > +2. Sort the records in physical extent order, putting the CoW
> > staging extents
> > +   at the end of the xfarray.
> Why?

"This matches the sorting order of records in the refcount btree."

> > +
> > +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute
> > the number
> > +   of blocks needed for the new tree.
> > +
> > +4. Allocate the number of blocks computed in the previous step.
> > +
> > +5. Use ``xfs_btree_bload`` to write the xfarray records to btree
> > blocks and
> > +   generate the internal node blocks.
> > +
> > +6. Commit the location of new btree root block to the AGF.
> > +
> > +7. Reap the old btree blocks using the bitmap created in step 1.
> > +
> > +Details are as follows; the same algorithm is used by ``xfs_repair``
> > to
> > +generate refcount information from reverse mapping records.
> > +
> > +Reverse mapping records are used to rebuild the reference count
> > information.
> > +Reference counts are required for correct operation of copy on write
> > for shared
> > +file data.
> > +Imagine the reverse mapping entries as rectangles representing
> > extents of
> > +physical blocks, and that the rectangles can be laid down to allow
> > them to
> > +overlap each other.
> > +From the diagram below, it is apparent that a reference count record
> > must start
> > +or end wherever the height of the stack changes.
> > +In other words, the record emission stimulus is level-triggered::
> > +
> > +                        █    ███
> > +              ██      █████ ████   ███        ██████
> > +        ██   ████     ███████████ ████     █████████
> > +        ████████████████████████████████ ███████████
> > +        ^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
> > +        2 1  23 21    3 43 234  2123  1 01 2  3     0
> > +
> > +The ondisk reference count btree does not store the refcount == 0
> > cases because
> > +the free space btree already records which blocks are free.
> > +Extents being used to stage copy-on-write operations should be the
> > only records
> > +with refcount == 1.
> So here you explain it... I think maybe the pseudo code would read
> easier if you put it after the high level explanations of what we're
> doing

Good point, I'll flip these two.

> > +Single-owner file blocks aren't recorded in either the free space or
> > the
> > +reference count btrees.
> > +
> > +Given the reverse mapping btree which orders records by physical
> > block number,
> > +a starting physical block (``sp``), a bag-like data structure to
> > hold mappings
> > +that cover ``sp``, and the next physical block where the level
> > changes
> > +(``np``), reference count information is constructed from reverse
> > mapping data
> > +as follows:
> > +
> > +While there are still unprocessed mappings in the reverse mapping
> > btree:
> > +
> > +1. Set ``sp`` to the physical block of the next unprocessed reverse
> > mapping
> > +   record.
> > +
> > +2. Add to the bag all the reverse mappings where ``rm_startblock``
> > == ``sp``.
> Hmm, if this were code, I could tag the rm_startblock symbol, but that
> doesnt work for a document.  While I could go look at the code to
> answer this, you want your document to explain the code, not the other
> way around... further commentary below...
> 
> > +
> > +3. Set ``np`` to the physical block where the bag size will change.
> > +   This is the minimum of (``rm_startblock`` of the next unprocessed
> > mapping)
> > +   and (``rm_startblock`` + ``rm_blockcount`` of each mapping in the
> > bag).
> > +
> > +4. Record the bag size as ``old_bag_size``.
> > +
> > +5. While the bag isn't empty,
> > +
> > +   a. Remove from the bag all mappings where ``rm_startblock`` +
> > +      ``rm_blockcount`` == ``np``.
> > +
> > +   b. Add to the bag all reverse mappings where ``rm_startblock`` ==
> > ``np``.
> > +
> > +   c. If the bag size isn't ``old_bag_size``, store the refcount
> > record
> > +      ``(sp, np - sp, old_bag_size)`` in the refcount xfarray.
> > +
> > +   d. If the bag is empty, break out of this inner loop.
> > +
> > +   e. Set ``old_bag_size`` to ``bag_size``.
> > +
> > +   f. Set ``sp`` = ``np``.
> > +
> > +   g. Set ``np`` to the physical block where the bag size will
> > change.
> > +      Go to step 3 above.
> I don't think verbalizing literal lines of code is any more explanatory
> than the code.  I think it's easier just give the high level
> description and then just go look at it.

Agreed.... (see below)

> I notice you have the exact same verbiage in the code, you could just
> link it:
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-ag-btrees&id=771fa17dd5fd7d3d125c61232c4390e8f7ac0fb0#:~:text=*%20While%20there%20are%20still%20unprocessed%20rmaps%20in%20the%20array,and%20(startblock%20%2B%20len%20of%20each%20rmap%20in%20the%20bag)
> .

Eventually (aka once we merge this in the kernel) I intend to replace
*all* of these patchset links and whatnot with references to the actual
source code in the git repo.   I can't make those links at this time
because the design document is first in line ahead of the actual code.

> 
> Also that may cut down on future maintenance if this ever changes since
> people might not think to update the document along with the code
> 
> 
> Hmm, just thinking outside the box, what do you think of this method of
> presentation:
>  
>   - Iterate over btree
> records							tinyurl.com/4mp3j3pw
>      - Find the corresponding reverse
> mapping					tinyurl.com/27n7h5fa	
>      - Collect all shared mappings with the same starting
> block			tinyurl.com/mwdfy52b
>      - Advance to the next block with a ref count
> change			tinyurl.com/28689ufz				
>        This position will either be the next unprocessed rmap, or the
>        combined length all the collected mappings, which ever is
> smaller
>      - Iterate over the collected
> mappings,					tinyurl.com/ye673rwa
>         - Remove all mappings that start after this
> position			tinyurl.com/22yp7p6u
>         - Re-collect all mappings that start on this
> position			tinyurl.com/2p8vytmv
>         - If the size of the collection increased, update the ref
> count		tinyurl.com/ecu7tud7
>         - If more mappings were found, advance to the next block
> with		tinyurl.com/47p4dfac
>           a ref count change.  Continue until no more mappings are
> found
> 
> It pulls the pseudo code up to a little higher level, plus the quick
> links to jump deeper if needed and then people have all the navigation
> utilities they are used to.  I just found a quick url shortener, so I'm
> not really sure how long they keep those, but maybe we can find an
> appropriate shorter

I really like your version!  Can I tweak it a bit?

- Until the reverse mapping btree runs out of records:

  - Retrieve the next record from the btree and put it in a bag.

  - Collect all records with the same starting block from the btree and
    put them in the bag.

  - While the bag isn't empty:

    - Among the mappings in the bag, compute the lowest block number
      where the reference count changes.
      This position will be either the starting block number of the next
      unprocessed reverse mapping or the next block after the shortest
      mapping in the bag.

    - Remove all mappings from the bag that end at this position.

    - Collect all reverse mappings that start at this position from the
      btree and put them in the bag.

    - If the size of the bag changed and is greater than one, create a
      new refcount record associating the block number range that we
      just walked to the size of the bag.


> > +
> > +The bag-like structure in this case is a type 2 xfarray as discussed
> > in the
> > +:ref:`xfarray access patterns<xfarray_access_patterns>` section.
> > +Reverse mappings are added to the bag using
> > ``xfarray_store_anywhere`` and
> > +removed via ``xfarray_unset``.
> > +Bag members are examined through ``xfarray_iter`` loops.
> > +
> > +The proposed patchset is the
> > +`AG btree repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-ag-btrees>`_
> > +series.
> > +
> > +Case Study: Rebuilding File Fork Mapping Indices
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +The high level process to rebuild a data/attr fork mapping btree is:
> > +
> > +1. Walk the reverse mapping records to generate ``struct
> > xfs_bmbt_rec``
> > +   records from the reverse mapping records for that inode and fork.
> > +   Append these records to an xfarray.
> > +   Compute the bitmap of the old bmap btree blocks from the
> > ``BMBT_BLOCK``
> > +   records.
> > +
> > +2. Use the ``xfs_btree_bload_compute_geometry`` function to compute
> > the number
> > +   of blocks needed for the new tree.
> > +
> > +3. Sort the records in file offset order.
> > +
> > +4. If the extent records would fit in the inode fork immediate area,
> > commit the
> > +   records to that immediate area and skip to step 8.
> > +
> > +5. Allocate the number of blocks computed in the previous step.
> > +
> > +6. Use ``xfs_btree_bload`` to write the xfarray records to btree
> > blocks and
> > +   generate the internal node blocks.
> > +
> > +7. Commit the new btree root block to the inode fork immediate area.
> > +
> > +8. Reap the old btree blocks using the bitmap created in step 1.
> This description is not bad, but I had a hard time finding something
> that resembled the description in the link below.  Maybe its in a
> different branch?

Oops, sorry, that url should be:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings

> > +
> > +There are some complications here:
> > +First, it's possible to move the fork offset to adjust the sizes of
> > the
> > +immediate areas if the data and attr forks are not both in BMBT
> > format.
> > +Second, if there are sufficiently few fork mappings, it may be
> > possible to use
> > +EXTENTS format instead of BMBT, which may require a conversion.
> > +Third, the incore extent map must be reloaded carefully to avoid
> > disturbing
> > +any delayed allocation extents.
> > +
> > +The proposed patchset is the
> > +`file repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-inodes>`_
> > +series.
> So I'm assuming links to kernel.org are acceptable as it looks like you
> use them here, but it does imply that they need to sort of live
> forever, or at least as long as any document that uses them?

After all this gets merged I'll replace them with links to
fs/xfs/scrub/bmap_repair.c.

> > +
> > +.. _reaping:
> > +
> > +Reaping Old Metadata Blocks
> > +---------------------------
> > +
> > +Whenever online fsck builds a new data structure to replace one that
> > is
> > +suspect, there is a question of how to find and dispose of the
> > blocks that
> > +belonged to the old structure.
> > +The laziest method of course is not to deal with them at all, but
> > this slowly
> > +leads to service degradations as space leaks out of the filesystem.
> > +Hopefully, someone will schedule a rebuild of the free space
> > information to
> > +plug all those leaks.
> > +Offline repair rebuilds all space metadata after recording the usage
> > of
> > +the files and directories that it decides not to clear, hence it can
> > build new
> > +structures in the discovered free space and avoid the question of
> > reaping.
> > +
> > +As part of a repair, online fsck relies heavily on the reverse
> > mapping records
> > +to find space that is owned by the corresponding rmap owner yet
> > truly free.
> > +Cross referencing rmap records with other rmap records is necessary
> > because
> > +there may be other data structures that also think they own some of
> > those
> > +blocks (e.g. crosslinked trees).
> > +Permitting the block allocator to hand them out again will not push
> > the system
> > +towards consistency.
> > +
> > +For space metadata, the process of finding extents to dispose of
> > generally
> > +follows this format:
> > +
> > +1. Create a bitmap of space used by data structures that must be
> > preserved.
> > +   The space reservations used to create the new metadata can be
> > used here if
> > +   the same rmap owner code is used to denote all of the objects
> > being rebuilt.
> > +
> > +2. Survey the reverse mapping data to create a bitmap of space owned
> > by the
> > +   same ``XFS_RMAP_OWN_*`` number for the metadata that is being
> > preserved.
> > +
> > +3. Use the bitmap disunion operator to subtract (1) from (2).
> > +   The remaining set bits represent candidate extents that could be
> > freed.
> > +   The process moves on to step 4 below.
> > +
> > +Repairs for file-based metadata such as extended attributes,
> > directories,
> > +symbolic links, quota files and realtime bitmaps are performed by
> > building a
> > +new structure attached to a temporary file and swapping the forks.
> > +Afterward, the mappings in the old file fork are the candidate
> > blocks for
> > +disposal.
> > +
> > +The process for disposing of old extents is as follows:
> > +
> > +4. For each candidate extent, count the number of reverse mapping
> > records for
> > +   the first block in that extent that do not have the same rmap
> > owner for the
> > +   data structure being repaired.
> > +
> > +   - If zero, the block has a single owner and can be freed.
> > +
> > +   - If not, the block is part of a crosslinked structure and must
> > not be
> > +     freed.
> > +
> > +5. Starting with the next block in the extent, figure out how many
> > more blocks
> > +   have the same zero/nonzero other owner status as that first
> > block.
> > +
> > +6. If the region is crosslinked, delete the reverse mapping entry
> > for the
> > +   structure being repaired and move on to the next region.
> > +
> > +7. If the region is to be freed, mark any corresponding buffers in
> > the buffer
> > +   cache as stale to prevent log writeback.
> > +
> > +8. Free the region and move on.
> I think this part is as straightforward as it can be.  I like links,
> but they do have maintenance issues if the branch ever goes away.  It
> may be worth it though just while the code is going through review, I
> think it really helps to be able to just jump right into the code its
> trying to describe rather than trying to track down based on the
> description.  
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tree/fs/xfs/scrub/reap.c?h=repair-ag-btrees&id=d866f0e470b077806c994f4434bbe64e4a3a8662#n471:~:text=xrep_reap_ag_metadata(
> 
> I think that's the right one?  Tiny links nice for when steps are
> buried in sub functions too

Maybe?  That didn't actually move to line 471 or highlight anything.

> > +
> > +However, there is one complication to this procedure.
> > +Transactions are of finite size, so the reaping process must be
> > careful to roll
> > +the transactions to avoid overruns.
> > +Overruns come from two sources:
> > +
> > +a. EFIs logged on behalf of space that is no longer occupied
> > +
> > +b. Log items for buffer invalidations
> > +
> > +This is also a window in which a crash during the reaping process
> > can leak
> > +blocks.
> > +As stated earlier, online repair functions use very large
> > transactions to
> > +minimize the chances of this occurring.
> > +
> > +The proposed patchset is the
> > +`preparation for bulk loading btrees
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-prep-for-bulk-loading>`_
> > +series.
> > +
> > +Case Study: Reaping After a Regular Btree Repair
> > +````````````````````````````````````````````````
> > +
> > +Old reference count and inode btrees are the easiest to reap because
> > they have
> > +rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the
> > refcount
> > +btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode
> > btrees.
> > +Creating a list of extents to reap the old btree blocks is quite
> > simple,
> > +conceptually:
> > +
> > +1. Lock the relevant AGI/AGF header buffers to prevent allocation
> > and frees.
> > +
> > +2. For each reverse mapping record with an rmap owner corresponding
> > to the
> > +   metadata structure being rebuilt, set the corresponding range in
> > a bitmap.
> > +
> > +3. Walk the current data structures that have the same rmap owner.
> > +   For each block visited, clear that range in the above bitmap.
> > +
> > +4. Each set bit in the bitmap represents a block that could be a
> > block from the
> > +   old data structures and hence is a candidate for reaping.
> > +   In other words, ``(rmap_records_owned_by &
> > ~blocks_reachable_by_walk)``
> > +   are the blocks that might be freeable.
> > +
> > +If it is possible to maintain the AGF lock throughout the repair
> > (which is the
> > +common case), then step 2 can be performed at the same time as the
> > reverse
> > +mapping record walk that creates the records for the new btree.
> > +
> > +Case Study: Rebuilding the Free Space Indices
> > +`````````````````````````````````````````````
> > +
> > +The high level process to rebuild the free space indices is:
> Looks like this one
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-ag-btrees&id=bf5f10a91ca58d883ef1231a406fa0646c4c4e50#:~:text=%2B%20*/-,%2BSTATIC%20int,-%2Bxrep_abt_build_new_trees(
> 
> > +
> > +1. Walk the reverse mapping records to generate ``struct
> > xfs_alloc_rec_incore``
> > +   records from the gaps in the reverse mapping btree.
> > +
> > +2. Append the records to an xfarray.
> > +
> > +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute
> > the number
> > +   of blocks needed for each new tree.
> > +
> > +4. Allocate the number of blocks computed in the previous step from
> > the free
> > +   space information collected.
> > +
> > +5. Use ``xfs_btree_bload`` to write the xfarray records to btree
> > blocks and
> > +   generate the internal node blocks for the free space by block
> > index.
> > +   Call it again for the free space by length index.
> nit: these two loads are flipped

Oops, fixed.

> > +
> > +6. Commit the locations of the new btree root blocks to the AGF.
> > +
> > +7. Reap the old btree blocks by looking for space that is not
> > recorded by the
> > +   reverse mapping btree, the new free space btrees, or the AGFL.
> > +
> > +Repairing the free space btrees has three key complications over a
> > regular
> > +btree repair:
> > +
> > +First, free space is not explicitly tracked in the reverse mapping
> > records.
> > +Hence, the new free space records must be inferred from gaps in the
> > physical
> > +space component of the keyspace of the reverse mapping btree.
> > +
> > +Second, free space repairs cannot use the common btree reservation
> > code because
> > +new blocks are reserved out of the free space btrees.
> > +This is impossible when repairing the free space btrees themselves.
> > +However, repair holds the AGF buffer lock for the duration of the
> > free space
> > +index reconstruction, so it can use the collected free space
> > information to
> > +supply the blocks for the new free space btrees.
> > +It is not necessary to back each reserved extent with an EFI because
> > the new
> > +free space btrees are constructed in what the ondisk filesystem
> > thinks is
> > +unowned space.
> > +However, if reserving blocks for the new btrees from the collected
> > free space
> > +information changes the number of free space records, repair must
> > re-estimate
> > +the new free space btree geometry with the new record count until
> > the
> > +reservation is sufficient.
> > +As part of committing the new btrees, repair must ensure that
> > reverse mappings
> > +are created for the reserved blocks and that unused reserved blocks
> > are
> > +inserted into the free space btrees.
> > +Deferrred rmap and freeing operations are used to ensure that this
> > transition
> > +is atomic, similar to the other btree repair functions.
> > +
> > +Third, finding the blocks to reap after the repair is not overly
> > +straightforward.
> > +Blocks for the free space btrees and the reverse mapping btrees are
> > supplied by
> > +the AGFL.
> > +Blocks put onto the AGFL have reverse mapping records with the owner
> > +``XFS_RMAP_OWN_AG``.
> > +This ownership is retained when blocks move from the AGFL into the
> > free space
> > +btrees or the reverse mapping btrees.
> > +When repair walks reverse mapping records to synthesize free space
> > records, it
> > +creates a bitmap (``ag_owner_bitmap``) of all the space claimed by
> > +``XFS_RMAP_OWN_AG`` records.
> > +The repair context maintains a second bitmap corresponding to the
> > rmap btree
> > +blocks and the AGFL blocks (``rmap_agfl_bitmap``).
> > +When the walk is complete, the bitmap disunion operation
> > ``(ag_owner_bitmap &
> > +~rmap_agfl_bitmap)`` computes the extents that are used by the old
> > free space
> > +btrees.
> > +These blocks can then be reaped using the methods outlined above.
> > +
> > +The proposed patchset is the
> > +`AG btree repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-ag-btrees>`_
> > +series.
> I think we've repeated this link couple times in the doc.  If you like
> highlight links, we cloud clean out the duplicates
> 
> > +
> > +.. _rmap_reap:
> > +
> > +Case Study: Reaping After Repairing Reverse Mapping Btrees
> > +``````````````````````````````````````````````````````````
> > +
> > +Old reverse mapping btrees are less difficult to reap after a
> > repair.
> > +As mentioned in the previous section, blocks on the AGFL, the two
> > free space
> > +btree blocks, and the reverse mapping btree blocks all have reverse
> > mapping
> > +records with ``XFS_RMAP_OWN_AG`` as the owner.
> > +The full process of gathering reverse mapping records and building a
> > new btree
> > +are described in the case study of
> > +:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point
> > from that
> > +discussion is that the new rmap btree will not contain any records
> > for the old
> > +rmap btree, nor will the old btree blocks be tracked in the free
> > space btrees.
> > +The list of candidate reaping blocks is computed by setting the bits
> > +corresponding to the gaps in the new rmap btree records, and then
> > clearing the
> > +bits corresponding to extents in the free space btrees and the
> > current AGFL
> > +blocks.
> > +The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are
> > reaped using the
> > +methods outlined above.
> > +
> > +The rest of the process of rebuildng the reverse mapping btree is
> > discussed
> > +in a separate :ref:`case study<rmap_repair>`.
> > +
> > +The proposed patchset is the
> > +`AG btree repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-ag-btrees>`_
> > +series.
> > +
> > +Case Study: Rebuilding the AGFL
> > +```````````````````````````````
> > +
> > +The allocation group free block list (AGFL) is repaired as follows:
> > +
> > +1. Create a bitmap for all the space that the reverse mapping data
> > claims is
> > +   owned by ``XFS_RMAP_OWN_AG``.
> > +
> > +2. Subtract the space used by the two free space btrees and the rmap
> > btree.
> > +
> > +3. Subtract any space that the reverse mapping data claims is owned
> > by any
> > +   other owner, to avoid re-adding crosslinked blocks to the AGFL.
> > +
> > +4. Once the AGFL is full, reap any blocks leftover.
> > +
> > +5. The next operation to fix the freelist will right-size the list.
> > 
> Branch link?  Looks like maybe it's missing.  In fact this logic looks
> like it might have been cut off?

OH, heh.  I forgot that we already merged the AGFL repair code.

"See `fs/xfs/scrub/agheader_repair.c
<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_
for more details."

> In any case, maybe give some thought to the highlight link suggestions.

Er... how do those work?  In principle I like them, but none of your
links actually highlighted anything here.  Could you send the link over
IRC so that urldefense crapola won't destroy it, please?

--D

> Allison
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 08/14] xfs: document btree bulk loading
  2023-02-10  0:24       ` Darrick J. Wong
@ 2023-02-16 15:46         ` Allison Henderson
  2023-02-16 21:08           ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-02-16 15:46 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Thu, 2023-02-09 at 16:24 -0800, Darrick J. Wong wrote:
> On Thu, Feb 09, 2023 at 05:47:17AM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Add a discussion of the btree bulk loading code, which makes it
> > > easy
> > > to
> > > take an in-memory recordset and write it out to disk in an
> > > efficient
> > > manner.  This also enables atomic switchover from the old to the
> > > new
> > > structure with minimal potential for leaking the old blocks.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  632
> > > ++++++++++++++++++++
> > >  1 file changed, 632 insertions(+)
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index 9d7a2ef1d0dd..eb61d867e55c 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -2236,3 +2236,635 @@ this functionality as follows:
> > >  
> > >  After removing xfile logged buffers from the transaction in this
> > > manner, the
> > >  transaction can be committed or cancelled.
> > > +
> > > +Bulk Loading of Ondisk B+Trees
> > > +------------------------------
> > > +
> > > +As mentioned previously, early iterations of online repair built
> > > new
> > > btree
> > > +structures by creating a new btree and adding observations
> > > individually.
> > > +Loading a btree one record at a time had a slight advantage of
> > > not
> > > requiring
> > > +the incore records to be sorted prior to commit, but was very
> > > slow
> > > and leaked
> > > +blocks if the system went down during a repair.
> > > +Loading records one at a time also meant that repair could not
> > > control the
> > > +loading factor of the blocks in the new btree.
> > > +
> > > +Fortunately, the venerable ``xfs_repair`` tool had a more
> > > efficient
> > > means for
> > > +rebuilding a btree index from a collection of records -- bulk
> > > btree
> > > loading.
> > > +This was implemented rather inefficiently code-wise, since
> > > ``xfs_repair``
> > > +had separate copy-pasted implementations for each btree type.
> > > +
> > > +To prepare for online fsck, each of the four bulk loaders were
> > > studied, notes
> > > +were taken, and the four were refactored into a single generic
> > > btree
> > > bulk
> > > +loading mechanism.
> > > +Those notes in turn have been refreshed and are presented below.
> > > +
> > > +Geometry Computation
> > > +````````````````````
> > > +
> > > +The zeroth step of bulk loading is to assemble the entire record
> > > set
> > > that will
> > > +be stored in the new btree, and sort the records.
> > > +Next, call ``xfs_btree_bload_compute_geometry`` to compute the
> > > shape
> > > of the
> > > +btree from the record set, the type of btree, and any load
> > > factor
> > > preferences.
> > > +This information is required for resource reservation.
> > > +
> > > +First, the geometry computation computes the minimum and maximum
> > > records that
> > > +will fit in a leaf block from the size of a btree block and the
> > > size
> > > of the
> > > +block header.
> > > +Roughly speaking, the maximum number of records is::
> > > +
> > > +        maxrecs = (block_size - header_size) / record_size
> > > +
> > > +The XFS design specifies that btree blocks should be merged when
> > > possible,
> > > +which means the minimum number of records is half of maxrecs::
> > > +
> > > +        minrecs = maxrecs / 2
> > > +
> > > +The next variable to determine is the desired loading factor.
> > > +This must be at least minrecs and no more than maxrecs.
> > > +Choosing minrecs is undesirable because it wastes half the
> > > block.
> > > +Choosing maxrecs is also undesirable because adding a single
> > > record
> > > to each
> > > +newly rebuilt leaf block will cause a tree split, which causes a
> > > noticeable
> > > +drop in performance immediately afterwards.
> > > +The default loading factor was chosen to be 75% of maxrecs,
> > > which
> > > provides a
> > > +reasonably compact structure without any immediate split
> > > penalties.
> >         default_lload_factor = (maxrecs + minrecs) / 2;
> > > +If space is tight, the loading factor will be set to maxrecs to
> > > try
> > > to avoid
> > > +running out of space::
> > > +
> > > +        leaf_load_factor = enough space ? (maxrecs + minrecs) /
> > > 2 :
> > > maxrecs
> >         leaf_load_factor = enough space ? default_lload_factor :
> > maxrecs;
> > 
> > Just more readable i think
> 
> Ok, changed.
> 
> > 
> > > +
> > > +Load factor is computed for btree node blocks using the combined
> > > size of the
> > > +btree key and pointer as the record size::
> > > +
> > > +        maxrecs = (block_size - header_size) / (key_size +
> > > ptr_size)
> > > +        minrecs = maxrecs / 2
> >         default_nload_factor = (maxrecs + minrecs) / 2;
> > 
> > > +        node_load_factor = enough space ? (maxrecs + minrecs) /
> > > 2 :
> > > maxrecs
> >         node_load_factor = enough space ? default_nload_factor :
> > maxrecs;
> 
> Here too.
> 
> > > +
> > > +Once that's done, the number of leaf blocks required to store
> > > the
> > > record set
> > > +can be computed as::
> > > +
> > > +        leaf_blocks = ceil(record_count / leaf_load_factor)
> > > +
> > > +The number of node blocks needed to point to the next level down
> > > in
> > > the tree
> > > +is computed as::
> > > +
> > > +        n_blocks = (n == 0 ? leaf_blocks : node_blocks[n])
> > > +        node_blocks[n + 1] = ceil(n_blocks / node_load_factor)
> > > +
> > > +The entire computation is performed recursively until the
> > > current
> > > level only
> > > +needs one block.
> > > +The resulting geometry is as follows:
> > > +
> > > +- For AG-rooted btrees, this level is the root level, so the
> > > height
> > > of the new
> > > +  tree is ``level + 1`` and the space needed is the summation of
> > > the
> > > number of
> > > +  blocks on each level.
> > > +
> > > +- For inode-rooted btrees where the records in the top level do
> > > not
> > > fit in the
> > > +  inode fork area, the height is ``level + 2``, the space needed
> > > is
> > > the
> > > +  summation of the number of blocks on each level, and the inode
> > > fork points to
> > > +  the root block.
> > > +
> > > +- For inode-rooted btrees where the records in the top level can
> > > be
> > > stored in
> > > +  the inode fork area, then the root block can be stored in the
> > > inode, the
> > > +  height is ``level + 1``, and the space needed is one less than
> > > the
> > > summation
> > > +  of the number of blocks on each level.
> > > +  This only becomes relevant when non-bmap btrees gain the
> > > ability
> > > to root in
> > > +  an inode, which is a future patchset and only included here
> > > for
> > > completeness.
> > > +
> > > +.. _newbt:
> > > +
> > > +Reserving New B+Tree Blocks
> > > +```````````````````````````
> > > +
> > > +Once repair knows the number of blocks needed for the new btree,
> > > it
> > > allocates
> > > +those blocks using the free space information.
> > > +Each reserved extent is tracked separately by the btree builder
> > > state data.
> > > +To improve crash resilience, the reservation code also logs an
> > > Extent Freeing
> > > +Intent (EFI) item in the same transaction as each space
> > > allocation
> > > and attaches
> > > +its in-memory ``struct xfs_extent_free_item`` object to the
> > > space
> > > reservation.
> > > +If the system goes down, log recovery will use the unfinished
> > > EFIs
> > > to free the
> > > +unused space, the free space, leaving the filesystem unchanged.
> > > +
> > > +Each time the btree builder claims a block for the btree from a
> > > reserved
> > > +extent, it updates the in-memory reservation to reflect the
> > > claimed
> > > space.
> > > +Block reservation tries to allocate as much contiguous space as
> > > possible to
> > > +reduce the number of EFIs in play.
> > > +
> > > +While repair is writing these new btree blocks, the EFIs created
> > > for
> > > the space
> > > +reservations pin the tail of the ondisk log.
> > > +It's possible that other parts of the system will remain busy
> > > and
> > > push the head
> > > +of the log towards the pinned tail.
> > > +To avoid livelocking the filesystem, the EFIs must not pin the
> > > tail
> > > of the log
> > > +for too long.
> > > +To alleviate this problem, the dynamic relogging capability of
> > > the
> > > deferred ops
> > > +mechanism is reused here to commit a transaction at the log head
> > > containing an
> > > +EFD for the old EFI and new EFI at the head.
> > > +This enables the log to release the old EFI to keep the log
> > > moving
> > > forwards.
> > > +
> > > +EFIs have a role to play during the commit and reaping phases;
> > > please see the
> > > +next section and the section about :ref:`reaping<reaping>` for
> > > more
> > > details.
> > > +
> > > +Proposed patchsets are the
> > > +`bitmap rework
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-bitmap-rework>`_
> > > +and the
> > > +`preparation for bulk loading btrees
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-prep-for-bulk-loading>`_.
> > > +
> > > +
> > > +Writing the New Tree
> > > +````````````````````
> > > +
> > > +This part is pretty simple -- the btree builder
> > > (``xfs_btree_bulkload``) claims
> > > +a block from the reserved list, writes the new btree block
> > > header,
> > > fills the
> > > +rest of the block with records, and adds the new leaf block to a
> > > list of
> > > +written blocks.
> > > +Sibling pointers are set every time a new block is added to the
> > > level.
> > > +When it finishes writing the record leaf blocks, it moves on to
> > > the
> > > node
> > > +blocks.
> > > +To fill a node block, it walks each block in the next level down
> > > in
> > > the tree
> > > +to compute the relevant keys and write them into the parent
> > > node.
> > > +When it reaches the root level, it is ready to commit the new
> > > btree!
> > I think most of this is as straight forward as it can be, but it's
> > a
> > lot visualizing too, which makes me wonder if it would benefit from
> > an
> > simple illustration if possible.
> > 
> > On a side note: In a prior team I discovered power points, while a
> > lot
> > work, were also really effective for quickly moving a crowd of
> > people
> > through connected graph navigation/manipulations.  Because each one
> > of
> > these steps was another slide that illustrated how the structure
> > evolved through the updates.  I realize that's not something that
> > fits
> > in the scheme of a document like this, but maybe something
> > supplemental
> > to add later.  While it was a time eater, i noticed a lot of
> > confused
> > expressions just seemed to shake loose, so sometimes it was worth
> > it.
> 
> That was ... surprisingly less bad than I feared it would be to cut
> and
> paste unicode linedraw characters and arrows.
> 
>           ┌─────────┐
>           │root     │
>           │PP       │
>           └─────────┘
>           ↙         ↘
>       ┌────┐       ┌────┐
>       │node│──────→│node│
>       │PP  │←──────│PP  │
>       └────┘       └────┘
>       ↙   ↘         ↙   ↘
>   ┌────┐ ┌────┐ ┌────┐ ┌────┐
>   │leaf│→│leaf│→│leaf│→│leaf│
>   │RRR │←│RRR │←│RRR │←│RRR │
>   └────┘ └────┘ └────┘ └────┘
> 
> (Does someone have a program that does this?)
I think Catherine mentioned she had used PlantUML for the larp diagram,
though for something this simple I think this is fine
> 
> > 
> > > +
> > > +The first step to commit the new btree is to persist the btree
> > > blocks to disk
> > > +synchronously.
> > > +This is a little complicated because a new btree block could
> > > have
> > > been freed
> > > +in the recent past, so the builder must use
> > > ``xfs_buf_delwri_queue_here`` to
> > > +remove the (stale) buffer from the AIL list before it can write
> > > the
> > > new blocks
> > > +to disk.
> > > +Blocks are queued for IO using a delwri list and written in one
> > > large batch
> > > +with ``xfs_buf_delwri_submit``.
> > > +
> > > +Once the new blocks have been persisted to disk, control returns
> > > to
> > > the
> > > +individual repair function that called the bulk loader.
> > > +The repair function must log the location of the new root in a
> > > transaction,
> > > +clean up the space reservations that were made for the new
> > > btree,
> > > and reap the
> > > +old metadata blocks:
> > > +
> > > +1. Commit the location of the new btree root.
> > > +
> > > +2. For each incore reservation:
> > > +
> > > +   a. Log Extent Freeing Done (EFD) items for all the space that
> > > was
> > > consumed
> > > +      by the btree builder.  The new EFDs must point to the EFIs
> > > attached to
> > > +      the reservation to prevent log recovery from freeing the
> > > new
> > > blocks.
> > > +
> > > +   b. For unclaimed portions of incore reservations, create a
> > > regular deferred
> > > +      extent free work item to be free the unused space later in
> > > the
> > > +      transaction chain.
> > > +
> > > +   c. The EFDs and EFIs logged in steps 2a and 2b must not
> > > overrun
> > > the
> > > +      reservation of the committing transaction.
> > > +      If the btree loading code suspects this might be about to
> > > happen, it must
> > > +      call ``xrep_defer_finish`` to clear out the deferred work
> > > and
> > > obtain a
> > > +      fresh transaction.
> > > +
> > > +3. Clear out the deferred work a second time to finish the
> > > commit
> > > and clean
> > > +   the repair transaction.
> > > +
> > > +The transaction rolling in steps 2c and 3 represent a weakness
> > > in
> > > the repair
> > > +algorithm, because a log flush and a crash before the end of the
> > > reap step can
> > > +result in space leaking.
> > > +Online repair functions minimize the chances of this occuring by
> > > using very
> > > +large transactions, which each can accomodate many thousands of
> > > block freeing
> > > +instructions.
> > > +Repair moves on to reaping the old blocks, which will be
> > > presented
> > > in a
> > > +subsequent :ref:`section<reaping>` after a few case studies of
> > > bulk
> > > loading.
> > > +
> > > +Case Study: Rebuilding the Inode Index
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +The high level process to rebuild the inode index btree is:
> > > +
> > > +1. Walk the reverse mapping records to generate ``struct
> > > xfs_inobt_rec``
> > > +   records from the inode chunk information and a bitmap of the
> > > old
> > > inode btree
> > > +   blocks.
> > > +
> > > +2. Append the records to an xfarray in inode order.
> > > +
> > > +3. Use the ``xfs_btree_bload_compute_geometry`` function to
> > > compute
> > > the number
> > > +   of blocks needed for the inode btree.
> > > +   If the free space inode btree is enabled, call it again to
> > > estimate the
> > > +   geometry of the finobt.
> > > +
> > > +4. Allocate the number of blocks computed in the previous step.
> > > +
> > > +5. Use ``xfs_btree_bload`` to write the xfarray records to btree
> > > blocks and
> > > +   generate the internal node blocks.
> > > +   If the free space inode btree is enabled, call it again to
> > > load
> > > the finobt.
> > > +
> > > +6. Commit the location of the new btree root block(s) to the
> > > AGI.
> > > +
> > > +7. Reap the old btree blocks using the bitmap created in step 1.
> > > +
> > > +Details are as follows.
> > > +
> > > +The inode btree maps inumbers to the ondisk location of the
> > > associated
> > > +inode records, which means that the inode btrees can be rebuilt
> > > from
> > > the
> > > +reverse mapping information.
> > > +Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT``
> > > marks the
> > > +location of the old inode btree blocks.
> > > +Each reverse mapping record with an owner of
> > > ``XFS_RMAP_OWN_INODES``
> > > marks the
> > > +location of at least one inode cluster buffer.
> > > +A cluster is the smallest number of ondisk inodes that can be
> > > allocated or
> > > +freed in a single transaction; it is never smaller than 1 fs
> > > block
> > > or 4 inodes.
> > > +
> > > +For the space represented by each inode cluster, ensure that
> > > there
> > > are no
> > > +records in the free space btrees nor any records in the
> > > reference
> > > count btree.
> > > +If there are, the space metadata inconsistencies are reason
> > > enough
> > > to abort the
> > > +operation.
> > > +Otherwise, read each cluster buffer to check that its contents
> > > appear to be
> > > +ondisk inodes and to decide if the file is allocated
> > > +(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode ==
> > > 0``).
> > > +Accumulate the results of successive inode cluster buffer reads
> > > until there is
> > > +enough information to fill a single inode chunk record, which is
> > > 64
> > > consecutive
> > > +numbers in the inumber keyspace.
> > > +If the chunk is sparse, the chunk record may include holes.
> > > +
> > > +Once the repair function accumulates one chunk's worth of data,
> > > it
> > > calls
> > > +``xfarray_append`` to add the inode btree record to the xfarray.
> > > +This xfarray is walked twice during the btree creation step --
> > > once
> > > to populate
> > > +the inode btree with all inode chunk records, and a second time
> > > to
> > > populate the
> > > +free inode btree with records for chunks that have free non-
> > > sparse
> > > inodes.
> > > +The number of records for the inode btree is the number of
> > > xfarray
> > > records,
> > > +but the record count for the free inode btree has to be computed
> > > as
> > > inode chunk
> > > +records are stored in the xfarray.
> > > +
> > > +The proposed patchset is the
> > > +`AG btree repair
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-ag-btrees>`_
> > > +series.
> > > +
> > > +Case Study: Rebuilding the Space Reference Counts
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +The high level process to rebuild the reference count btree is:
> > > +
> > > +1. Walk the reverse mapping records to generate ``struct
> > > xfs_refcount_irec``
> > > +   records for any space having more than one reverse mapping
> > > and
> > > add them to
> > > +   the xfarray.
> > > +   Any records owned by ``XFS_RMAP_OWN_COW`` are also added to
> > > the
> > > xfarray.
> > Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the
> > xfarray
> > even if they only have one mapping
> > 
> > ?
> > 
> > You haven't mentioned any owners being disallowed, you've only
> > stated
> > that you're collecting records with more than one rmap, so that
> > would
> > be the inferred meaning.  
> > 
> > Also I think you also need to mention why.  The documentation is
> > starting to read a little more like pseudo code, but if it's not
> > explaining why it's doing things, we may as well just go to the
> > code
> 
> "Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the
> xfarray
> because these are extents allocated to stage a copy on write
> operation
> and are tracked in the refcount btree."
> 
> > > +   Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a
> > > bitmap
> > > of old
> > > +   refcount btree blocks.
> > > +
> > > +2. Sort the records in physical extent order, putting the CoW
> > > staging extents
> > > +   at the end of the xfarray.
> > Why?
> 
> "This matches the sorting order of records in the refcount btree."
> 
> > > +
> > > +3. Use the ``xfs_btree_bload_compute_geometry`` function to
> > > compute
> > > the number
> > > +   of blocks needed for the new tree.
> > > +
> > > +4. Allocate the number of blocks computed in the previous step.
> > > +
> > > +5. Use ``xfs_btree_bload`` to write the xfarray records to btree
> > > blocks and
> > > +   generate the internal node blocks.
> > > +
> > > +6. Commit the location of new btree root block to the AGF.
> > > +
> > > +7. Reap the old btree blocks using the bitmap created in step 1.
> > > +
> > > +Details are as follows; the same algorithm is used by
> > > ``xfs_repair``
> > > to
> > > +generate refcount information from reverse mapping records.
> > > +
> > > +Reverse mapping records are used to rebuild the reference count
> > > information.
> > > +Reference counts are required for correct operation of copy on
> > > write
> > > for shared
> > > +file data.
> > > +Imagine the reverse mapping entries as rectangles representing
> > > extents of
> > > +physical blocks, and that the rectangles can be laid down to
> > > allow
> > > them to
> > > +overlap each other.
> > > +From the diagram below, it is apparent that a reference count
> > > record
> > > must start
> > > +or end wherever the height of the stack changes.
> > > +In other words, the record emission stimulus is level-
> > > triggered::
> > > +
> > > +                        █    ███
> > > +              ██      █████ ████   ███        ██████
> > > +        ██   ████     ███████████ ████     █████████
> > > +        ████████████████████████████████ ███████████
> > > +        ^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
> > > +        2 1  23 21    3 43 234  2123  1 01 2  3     0
> > > +
> > > +The ondisk reference count btree does not store the refcount ==
> > > 0
> > > cases because
> > > +the free space btree already records which blocks are free.
> > > +Extents being used to stage copy-on-write operations should be
> > > the
> > > only records
> > > +with refcount == 1.
> > So here you explain it... I think maybe the pseudo code would read
> > easier if you put it after the high level explanations of what
> > we're
> > doing
> 
> Good point, I'll flip these two.
> 
> > > +Single-owner file blocks aren't recorded in either the free
> > > space or
> > > the
> > > +reference count btrees.
> > > +
> > > +Given the reverse mapping btree which orders records by physical
> > > block number,
> > > +a starting physical block (``sp``), a bag-like data structure to
> > > hold mappings
> > > +that cover ``sp``, and the next physical block where the level
> > > changes
> > > +(``np``), reference count information is constructed from
> > > reverse
> > > mapping data
> > > +as follows:
> > > +
> > > +While there are still unprocessed mappings in the reverse
> > > mapping
> > > btree:
> > > +
> > > +1. Set ``sp`` to the physical block of the next unprocessed
> > > reverse
> > > mapping
> > > +   record.
> > > +
> > > +2. Add to the bag all the reverse mappings where
> > > ``rm_startblock``
> > > == ``sp``.
> > Hmm, if this were code, I could tag the rm_startblock symbol, but
> > that
> > doesnt work for a document.  While I could go look at the code to
> > answer this, you want your document to explain the code, not the
> > other
> > way around... further commentary below...
> > 
> > > +
> > > +3. Set ``np`` to the physical block where the bag size will
> > > change.
> > > +   This is the minimum of (``rm_startblock`` of the next
> > > unprocessed
> > > mapping)
> > > +   and (``rm_startblock`` + ``rm_blockcount`` of each mapping in
> > > the
> > > bag).
> > > +
> > > +4. Record the bag size as ``old_bag_size``.
> > > +
> > > +5. While the bag isn't empty,
> > > +
> > > +   a. Remove from the bag all mappings where ``rm_startblock`` +
> > > +      ``rm_blockcount`` == ``np``.
> > > +
> > > +   b. Add to the bag all reverse mappings where
> > > ``rm_startblock`` ==
> > > ``np``.
> > > +
> > > +   c. If the bag size isn't ``old_bag_size``, store the refcount
> > > record
> > > +      ``(sp, np - sp, old_bag_size)`` in the refcount xfarray.
> > > +
> > > +   d. If the bag is empty, break out of this inner loop.
> > > +
> > > +   e. Set ``old_bag_size`` to ``bag_size``.
> > > +
> > > +   f. Set ``sp`` = ``np``.
> > > +
> > > +   g. Set ``np`` to the physical block where the bag size will
> > > change.
> > > +      Go to step 3 above.
> > I don't think verbalizing literal lines of code is any more
> > explanatory
> > than the code.  I think it's easier just give the high level
> > description and then just go look at it.
> 
> Agreed.... (see below)
> 
> > I notice you have the exact same verbiage in the code, you could
> > just
> > link it:
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-ag-btrees&id=771fa17dd5fd7d3d125c61232c4390e8f7ac0fb0#:~:text=*%20While%20there%20are%20still%20unprocessed%20rmaps%20in%20the%20array,and%20(startblock%20%2B%20len%20of%20each%20rmap%20in%20the%20bag)
> > .
> 
> Eventually (aka once we merge this in the kernel) I intend to replace
> *all* of these patchset links and whatnot with references to the
> actual
> source code in the git repo.   I can't make those links at this time
> because the design document is first in line ahead of the actual
> code.
> 
> > 
> > Also that may cut down on future maintenance if this ever changes
> > since
> > people might not think to update the document along with the code
> > 
> > 
> > Hmm, just thinking outside the box, what do you think of this
> > method of
> > presentation:
> >  
> >   - Iterate over btree
> > records                                                 tinyurl.com
> > /4mp3j3pw
> >      - Find the corresponding reverse
> > mapping                                 tinyurl.com/27n7h5fa    
> >      - Collect all shared mappings with the same starting
> > block                   tinyurl.com/mwdfy52b
> >      - Advance to the next block with a ref count
> > change                  tinyurl.com/28689ufz                       
> >      
> >        This position will either be the next unprocessed rmap, or
> > the
> >        combined length all the collected mappings, which ever is
> > smaller
> >      - Iterate over the collected
> > mappings,                                       tinyurl.com/ye673rw
> > a
> >         - Remove all mappings that start after this
> > position                        tinyurl.com/22yp7p6u
> >         - Re-collect all mappings that start on this
> > position                        tinyurl.com/2p8vytmv
> >         - If the size of the collection increased, update the ref
> > count           tinyurl.com/ecu7tud7
> >         - If more mappings were found, advance to the next block
> > with            tinyurl.com/47p4dfac
> >           a ref count change.  Continue until no more mappings are
> > found
> > 
> > It pulls the pseudo code up to a little higher level, plus the
> > quick
> > links to jump deeper if needed and then people have all the
> > navigation
> > utilities they are used to.  I just found a quick url shortener, so
> > I'm
> > not really sure how long they keep those, but maybe we can find an
> > appropriate shorter
> 
> I really like your version!  Can I tweak it a bit?
> 
> - Until the reverse mapping btree runs out of records:
> 
>   - Retrieve the next record from the btree and put it in a bag.
> 
>   - Collect all records with the same starting block from the btree
> and
>     put them in the bag.
> 
>   - While the bag isn't empty:
> 
>     - Among the mappings in the bag, compute the lowest block number
>       where the reference count changes.
>       This position will be either the starting block number of the
> next
>       unprocessed reverse mapping or the next block after the
> shortest
>       mapping in the bag.
> 
>     - Remove all mappings from the bag that end at this position.
> 
>     - Collect all reverse mappings that start at this position from
> the
>       btree and put them in the bag.
> 
>     - If the size of the bag changed and is greater than one, create
> a
>       new refcount record associating the block number range that we
>       just walked to the size of the bag.
> 
> 
Sure, that looks fine to me

> > > +
> > > +The bag-like structure in this case is a type 2 xfarray as
> > > discussed
> > > in the
> > > +:ref:`xfarray access patterns<xfarray_access_patterns>` section.
> > > +Reverse mappings are added to the bag using
> > > ``xfarray_store_anywhere`` and
> > > +removed via ``xfarray_unset``.
> > > +Bag members are examined through ``xfarray_iter`` loops.
> > > +
> > > +The proposed patchset is the
> > > +`AG btree repair
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-ag-btrees>`_
> > > +series.
> > > +
> > > +Case Study: Rebuilding File Fork Mapping Indices
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +The high level process to rebuild a data/attr fork mapping btree
> > > is:
> > > +
> > > +1. Walk the reverse mapping records to generate ``struct
> > > xfs_bmbt_rec``
> > > +   records from the reverse mapping records for that inode and
> > > fork.
> > > +   Append these records to an xfarray.
> > > +   Compute the bitmap of the old bmap btree blocks from the
> > > ``BMBT_BLOCK``
> > > +   records.
> > > +
> > > +2. Use the ``xfs_btree_bload_compute_geometry`` function to
> > > compute
> > > the number
> > > +   of blocks needed for the new tree.
> > > +
> > > +3. Sort the records in file offset order.
> > > +
> > > +4. If the extent records would fit in the inode fork immediate
> > > area,
> > > commit the
> > > +   records to that immediate area and skip to step 8.
> > > +
> > > +5. Allocate the number of blocks computed in the previous step.
> > > +
> > > +6. Use ``xfs_btree_bload`` to write the xfarray records to btree
> > > blocks and
> > > +   generate the internal node blocks.
> > > +
> > > +7. Commit the new btree root block to the inode fork immediate
> > > area.
> > > +
> > > +8. Reap the old btree blocks using the bitmap created in step 1.
> > This description is not bad, but I had a hard time finding
> > something
> > that resembled the description in the link below.  Maybe its in a
> > different branch?
> 
> Oops, sorry, that url should be:
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings
> 
> > > +
> > > +There are some complications here:
> > > +First, it's possible to move the fork offset to adjust the sizes
> > > of
> > > the
> > > +immediate areas if the data and attr forks are not both in BMBT
> > > format.
> > > +Second, if there are sufficiently few fork mappings, it may be
> > > possible to use
> > > +EXTENTS format instead of BMBT, which may require a conversion.
> > > +Third, the incore extent map must be reloaded carefully to avoid
> > > disturbing
> > > +any delayed allocation extents.
> > > +
> > > +The proposed patchset is the
> > > +`file repair
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-inodes>`_
> > > +series.
> > So I'm assuming links to kernel.org are acceptable as it looks like
> > you
> > use them here, but it does imply that they need to sort of live
> > forever, or at least as long as any document that uses them?
> 
> After all this gets merged I'll replace them with links to
> fs/xfs/scrub/bmap_repair.c.
> 
> > > +
> > > +.. _reaping:
> > > +
> > > +Reaping Old Metadata Blocks
> > > +---------------------------
> > > +
> > > +Whenever online fsck builds a new data structure to replace one
> > > that
> > > is
> > > +suspect, there is a question of how to find and dispose of the
> > > blocks that
> > > +belonged to the old structure.
> > > +The laziest method of course is not to deal with them at all,
> > > but
> > > this slowly
> > > +leads to service degradations as space leaks out of the
> > > filesystem.
> > > +Hopefully, someone will schedule a rebuild of the free space
> > > information to
> > > +plug all those leaks.
> > > +Offline repair rebuilds all space metadata after recording the
> > > usage
> > > of
> > > +the files and directories that it decides not to clear, hence it
> > > can
> > > build new
> > > +structures in the discovered free space and avoid the question
> > > of
> > > reaping.
> > > +
> > > +As part of a repair, online fsck relies heavily on the reverse
> > > mapping records
> > > +to find space that is owned by the corresponding rmap owner yet
> > > truly free.
> > > +Cross referencing rmap records with other rmap records is
> > > necessary
> > > because
> > > +there may be other data structures that also think they own some
> > > of
> > > those
> > > +blocks (e.g. crosslinked trees).
> > > +Permitting the block allocator to hand them out again will not
> > > push
> > > the system
> > > +towards consistency.
> > > +
> > > +For space metadata, the process of finding extents to dispose of
> > > generally
> > > +follows this format:
> > > +
> > > +1. Create a bitmap of space used by data structures that must be
> > > preserved.
> > > +   The space reservations used to create the new metadata can be
> > > used here if
> > > +   the same rmap owner code is used to denote all of the objects
> > > being rebuilt.
> > > +
> > > +2. Survey the reverse mapping data to create a bitmap of space
> > > owned
> > > by the
> > > +   same ``XFS_RMAP_OWN_*`` number for the metadata that is being
> > > preserved.
> > > +
> > > +3. Use the bitmap disunion operator to subtract (1) from (2).
> > > +   The remaining set bits represent candidate extents that could
> > > be
> > > freed.
> > > +   The process moves on to step 4 below.
> > > +
> > > +Repairs for file-based metadata such as extended attributes,
> > > directories,
> > > +symbolic links, quota files and realtime bitmaps are performed
> > > by
> > > building a
> > > +new structure attached to a temporary file and swapping the
> > > forks.
> > > +Afterward, the mappings in the old file fork are the candidate
> > > blocks for
> > > +disposal.
> > > +
> > > +The process for disposing of old extents is as follows:
> > > +
> > > +4. For each candidate extent, count the number of reverse
> > > mapping
> > > records for
> > > +   the first block in that extent that do not have the same rmap
> > > owner for the
> > > +   data structure being repaired.
> > > +
> > > +   - If zero, the block has a single owner and can be freed.
> > > +
> > > +   - If not, the block is part of a crosslinked structure and
> > > must
> > > not be
> > > +     freed.
> > > +
> > > +5. Starting with the next block in the extent, figure out how
> > > many
> > > more blocks
> > > +   have the same zero/nonzero other owner status as that first
> > > block.
> > > +
> > > +6. If the region is crosslinked, delete the reverse mapping
> > > entry
> > > for the
> > > +   structure being repaired and move on to the next region.
> > > +
> > > +7. If the region is to be freed, mark any corresponding buffers
> > > in
> > > the buffer
> > > +   cache as stale to prevent log writeback.
> > > +
> > > +8. Free the region and move on.
> > I think this part is as straightforward as it can be.  I like
> > links,
> > but they do have maintenance issues if the branch ever goes away. 
> > It
> > may be worth it though just while the code is going through review,
> > I
> > think it really helps to be able to just jump right into the code
> > its
> > trying to describe rather than trying to track down based on the
> > description.  
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tree/fs/xfs/scrub/reap.c?h=repair-ag-btrees&id=d866f0e470b077806c994f4434bbe64e4a3a8662#n471:~:text=xrep_reap_ag_metadata(
> > 
> > I think that's the right one?  Tiny links nice for when steps are
> > buried in sub functions too
> 
> Maybe?  That didn't actually move to line 471 or highlight anything.
> 
> > > +
> > > +However, there is one complication to this procedure.
> > > +Transactions are of finite size, so the reaping process must be
> > > careful to roll
> > > +the transactions to avoid overruns.
> > > +Overruns come from two sources:
> > > +
> > > +a. EFIs logged on behalf of space that is no longer occupied
> > > +
> > > +b. Log items for buffer invalidations
> > > +
> > > +This is also a window in which a crash during the reaping
> > > process
> > > can leak
> > > +blocks.
> > > +As stated earlier, online repair functions use very large
> > > transactions to
> > > +minimize the chances of this occurring.
> > > +
> > > +The proposed patchset is the
> > > +`preparation for bulk loading btrees
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-prep-for-bulk-loading>`_
> > > +series.
> > > +
> > > +Case Study: Reaping After a Regular Btree Repair
> > > +````````````````````````````````````````````````
> > > +
> > > +Old reference count and inode btrees are the easiest to reap
> > > because
> > > they have
> > > +rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for
> > > the
> > > refcount
> > > +btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode
> > > btrees.
> > > +Creating a list of extents to reap the old btree blocks is quite
> > > simple,
> > > +conceptually:
> > > +
> > > +1. Lock the relevant AGI/AGF header buffers to prevent
> > > allocation
> > > and frees.
> > > +
> > > +2. For each reverse mapping record with an rmap owner
> > > corresponding
> > > to the
> > > +   metadata structure being rebuilt, set the corresponding range
> > > in
> > > a bitmap.
> > > +
> > > +3. Walk the current data structures that have the same rmap
> > > owner.
> > > +   For each block visited, clear that range in the above bitmap.
> > > +
> > > +4. Each set bit in the bitmap represents a block that could be a
> > > block from the
> > > +   old data structures and hence is a candidate for reaping.
> > > +   In other words, ``(rmap_records_owned_by &
> > > ~blocks_reachable_by_walk)``
> > > +   are the blocks that might be freeable.
> > > +
> > > +If it is possible to maintain the AGF lock throughout the repair
> > > (which is the
> > > +common case), then step 2 can be performed at the same time as
> > > the
> > > reverse
> > > +mapping record walk that creates the records for the new btree.
> > > +
> > > +Case Study: Rebuilding the Free Space Indices
> > > +`````````````````````````````````````````````
> > > +
> > > +The high level process to rebuild the free space indices is:
> > Looks like this one
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-ag-btrees&id=bf5f10a91ca58d883ef1231a406fa0646c4c4e50#:~:text=%2B%20*/-,%2BSTATIC%20int,-%2Bxrep_abt_build_new_trees(
> > 
> > > +
> > > +1. Walk the reverse mapping records to generate ``struct
> > > xfs_alloc_rec_incore``
> > > +   records from the gaps in the reverse mapping btree.
> > > +
> > > +2. Append the records to an xfarray.
> > > +
> > > +3. Use the ``xfs_btree_bload_compute_geometry`` function to
> > > compute
> > > the number
> > > +   of blocks needed for each new tree.
> > > +
> > > +4. Allocate the number of blocks computed in the previous step
> > > from
> > > the free
> > > +   space information collected.
> > > +
> > > +5. Use ``xfs_btree_bload`` to write the xfarray records to btree
> > > blocks and
> > > +   generate the internal node blocks for the free space by block
> > > index.
> > > +   Call it again for the free space by length index.
> > nit: these two loads are flipped
> 
> Oops, fixed.
> 
> > > +
> > > +6. Commit the locations of the new btree root blocks to the AGF.
> > > +
> > > +7. Reap the old btree blocks by looking for space that is not
> > > recorded by the
> > > +   reverse mapping btree, the new free space btrees, or the
> > > AGFL.
> > > +
> > > +Repairing the free space btrees has three key complications over
> > > a
> > > regular
> > > +btree repair:
> > > +
> > > +First, free space is not explicitly tracked in the reverse
> > > mapping
> > > records.
> > > +Hence, the new free space records must be inferred from gaps in
> > > the
> > > physical
> > > +space component of the keyspace of the reverse mapping btree.
> > > +
> > > +Second, free space repairs cannot use the common btree
> > > reservation
> > > code because
> > > +new blocks are reserved out of the free space btrees.
> > > +This is impossible when repairing the free space btrees
> > > themselves.
> > > +However, repair holds the AGF buffer lock for the duration of
> > > the
> > > free space
> > > +index reconstruction, so it can use the collected free space
> > > information to
> > > +supply the blocks for the new free space btrees.
> > > +It is not necessary to back each reserved extent with an EFI
> > > because
> > > the new
> > > +free space btrees are constructed in what the ondisk filesystem
> > > thinks is
> > > +unowned space.
> > > +However, if reserving blocks for the new btrees from the
> > > collected
> > > free space
> > > +information changes the number of free space records, repair
> > > must
> > > re-estimate
> > > +the new free space btree geometry with the new record count
> > > until
> > > the
> > > +reservation is sufficient.
> > > +As part of committing the new btrees, repair must ensure that
> > > reverse mappings
> > > +are created for the reserved blocks and that unused reserved
> > > blocks
> > > are
> > > +inserted into the free space btrees.
> > > +Deferrred rmap and freeing operations are used to ensure that
> > > this
> > > transition
> > > +is atomic, similar to the other btree repair functions.
> > > +
> > > +Third, finding the blocks to reap after the repair is not overly
> > > +straightforward.
> > > +Blocks for the free space btrees and the reverse mapping btrees
> > > are
> > > supplied by
> > > +the AGFL.
> > > +Blocks put onto the AGFL have reverse mapping records with the
> > > owner
> > > +``XFS_RMAP_OWN_AG``.
> > > +This ownership is retained when blocks move from the AGFL into
> > > the
> > > free space
> > > +btrees or the reverse mapping btrees.
> > > +When repair walks reverse mapping records to synthesize free
> > > space
> > > records, it
> > > +creates a bitmap (``ag_owner_bitmap``) of all the space claimed
> > > by
> > > +``XFS_RMAP_OWN_AG`` records.
> > > +The repair context maintains a second bitmap corresponding to
> > > the
> > > rmap btree
> > > +blocks and the AGFL blocks (``rmap_agfl_bitmap``).
> > > +When the walk is complete, the bitmap disunion operation
> > > ``(ag_owner_bitmap &
> > > +~rmap_agfl_bitmap)`` computes the extents that are used by the
> > > old
> > > free space
> > > +btrees.
> > > +These blocks can then be reaped using the methods outlined
> > > above.
> > > +
> > > +The proposed patchset is the
> > > +`AG btree repair
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-ag-btrees>`_
> > > +series.
> > I think we've repeated this link couple times in the doc.  If you
> > like
> > highlight links, we cloud clean out the duplicates
> > 
> > > +
> > > +.. _rmap_reap:
> > > +
> > > +Case Study: Reaping After Repairing Reverse Mapping Btrees
> > > +``````````````````````````````````````````````````````````
> > > +
> > > +Old reverse mapping btrees are less difficult to reap after a
> > > repair.
> > > +As mentioned in the previous section, blocks on the AGFL, the
> > > two
> > > free space
> > > +btree blocks, and the reverse mapping btree blocks all have
> > > reverse
> > > mapping
> > > +records with ``XFS_RMAP_OWN_AG`` as the owner.
> > > +The full process of gathering reverse mapping records and
> > > building a
> > > new btree
> > > +are described in the case study of
> > > +:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial
> > > point
> > > from that
> > > +discussion is that the new rmap btree will not contain any
> > > records
> > > for the old
> > > +rmap btree, nor will the old btree blocks be tracked in the free
> > > space btrees.
> > > +The list of candidate reaping blocks is computed by setting the
> > > bits
> > > +corresponding to the gaps in the new rmap btree records, and
> > > then
> > > clearing the
> > > +bits corresponding to extents in the free space btrees and the
> > > current AGFL
> > > +blocks.
> > > +The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are
> > > reaped using the
> > > +methods outlined above.
> > > +
> > > +The rest of the process of rebuildng the reverse mapping btree
> > > is
> > > discussed
> > > +in a separate :ref:`case study<rmap_repair>`.
> > > +
> > > +The proposed patchset is the
> > > +`AG btree repair
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-ag-btrees>`_
> > > +series.
> > > +
> > > +Case Study: Rebuilding the AGFL
> > > +```````````````````````````````
> > > +
> > > +The allocation group free block list (AGFL) is repaired as
> > > follows:
> > > +
> > > +1. Create a bitmap for all the space that the reverse mapping
> > > data
> > > claims is
> > > +   owned by ``XFS_RMAP_OWN_AG``.
> > > +
> > > +2. Subtract the space used by the two free space btrees and the
> > > rmap
> > > btree.
> > > +
> > > +3. Subtract any space that the reverse mapping data claims is
> > > owned
> > > by any
> > > +   other owner, to avoid re-adding crosslinked blocks to the
> > > AGFL.
> > > +
> > > +4. Once the AGFL is full, reap any blocks leftover.
> > > +
> > > +5. The next operation to fix the freelist will right-size the
> > > list.
> > > 
> > Branch link?  Looks like maybe it's missing.  In fact this logic
> > looks
> > like it might have been cut off?
> 
> OH, heh.  I forgot that we already merged the AGFL repair code.
> 
> "See `fs/xfs/scrub/agheader_repair.c
> <
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> e/fs/xfs/scrub/agheader_repair.c>`_
> for more details."
> 
> > In any case, maybe give some thought to the highlight link
> > suggestions.
> 
> Er... how do those work?  In principle I like them, but none of your
> links actually highlighted anything here.  Could you send the link
> over
> IRC so that urldefense crapola won't destroy it, please?
> 
> --D
So I think the last we talked about these, we realized they're a chrome
only format.  That's a shame, I think they really help people to
quickly navigate the code in question.  Otherwise I'm pretty much just
poking through the branches looking for code that resembles the
description.

I also poked around and found there was a firefox plugin that does the
same (link to text fragment addon).  Though it doesn't look like the
links generated are compatible between the browsers.

Maybe something to consider if we have a lot of chrome or ff users.  I
think if they help facilitate more discussion they're better than
nothing at least during review. 

> 
> > Allison
> > 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 10/14] xfs: document full filesystem scans for online fsck
  2022-12-30 22:10   ` [PATCH 10/14] xfs: document full filesystem scans for online fsck Darrick J. Wong
@ 2023-02-16 15:47     ` Allison Henderson
  2023-02-16 22:48       ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-02-16 15:47 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Certain parts of the online fsck code need to scan every file in the
> entire filesystem.  It is not acceptable to block the entire
> filesystem
> while this happens, which means that we need to be clever in allowing
> scans to coordinate with ongoing filesystem updates.  We also need to
> hook the filesystem so that regular updates propagate to the staging
> records.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  677
> ++++++++++++++++++++
>  1 file changed, 677 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index a658da8fe4ae..c0f08a773f08 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -3018,3 +3018,680 @@ The proposed patchset is the
>  `summary counter cleanup
>  <
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-fscounters>`_
>  series.
> +
> +Full Filesystem Scans
> +---------------------
> +
> +Certain types of metadata can only be checked by walking every file
> in the
> +entire filesystem to record observations and comparing the
> observations against
> +what's recorded on disk.
> +Like every other type of online repair, repairs are made by writing
> those
> +observations to disk in a replacement structure and committing it
> atomically.
> +However, it is not practical to shut down the entire filesystem to
> examine
> +hundreds of billions of files because the downtime would be
> excessive.
> +Therefore, online fsck must build the infrastructure to manage a
> live scan of
> +all the files in the filesystem.
> +There are two questions that need to be solved to perform a live
> walk:
> +
> +- How does scrub manage the scan while it is collecting data?
> +
> +- How does the scan keep abreast of changes being made to the system
> by other
> +  threads?
> +
> +.. _iscan:
> +
> +Coordinated Inode Scans
> +```````````````````````
> +
> +In the original Unix filesystems of the 1970s, each directory entry
> contained
> +an index number (*inumber*) which was used as an index into on
> ondisk array
> +(*itable*) of fixed-size records (*inodes*) describing a file's
> attributes and
> +its data block mapping.
> +This system is described by J. Lions, `"inode (5659)"
> +<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions'
> Commentary on
> +UNIX, 6th Edition*, (Dept. of Computer Science, the University of
> New South
> +Wales, November 1977), pp. 18-2; and later by D. Ritchie and K.
> Thompson,
> +`"Implementation of the File System"
> +<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from
> *The UNIX
> +Time-Sharing System*, (The Bell System Technical Journal, July
> 1978), pp.
> +1913-4.
> +
> +XFS retains most of this design, except now inumbers are search keys
> over all
> +the space in the data section filesystem.
> +They form a continuous keyspace that can be expressed as a 64-bit
> integer,
> +though the inodes themselves are sparsely distributed within the
> keyspace.
> +Scans proceed in a linear fashion across the inumber keyspace,
> starting from
> +``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``.
> +Naturally, a scan through a keyspace requires a scan cursor object
> to track the
> +scan progress.
> +Because this keyspace is sparse, this cursor contains two parts.
> +The first part of this scan cursor object tracks the inode that will
> be
> +examined next; call this the examination cursor.
> +Somewhat less obviously, the scan cursor object must also track
> which parts of
> +the keyspace have already been visited, which is critical for
> deciding if a
> +concurrent filesystem update needs to be incorporated into the scan
> data.
> +Call this the visited inode cursor.
> +
> +Advancing the scan cursor is a multi-step process encapsulated in
> +``xchk_iscan_iter``:
> +
> +1. Lock the AGI buffer of the AG containing the inode pointed to by
> the visited
> +   inode cursor.
> +   This guarantee that inodes in this AG cannot be allocated or
> freed while
> +   advancing the cursor.
> +
> +2. Use the per-AG inode btree to look up the next inumber after the
> one that
> +   was just visited, since it may not be keyspace adjacent.
> +
> +3. If there are no more inodes left in this AG:
> +
> +   a. Move the examination cursor to the point of the inumber
> keyspace that
> +      corresponds to the start of the next AG.
> +
> +   b. Adjust the visited inode cursor to indicate that it has
> "visited" the
> +      last possible inode in the current AG's inode keyspace.
> +      XFS inumbers are segmented, so the cursor needs to be marked
> as having
> +      visited the entire keyspace up to just before the start of the
> next AG's
> +      inode keyspace.
> +
> +   c. Unlock the AGI and return to step 1 if there are unexamined
> AGs in the
> +      filesystem.
> +
> +   d. If there are no more AGs to examine, set both cursors to the
> end of the
> +      inumber keyspace.
> +      The scan is now complete.
> +
> +4. Otherwise, there is at least one more inode to scan in this AG:
> +
> +   a. Move the examination cursor ahead to the next inode marked as
> allocated
> +      by the inode btree.
> +
> +   b. Adjust the visited inode cursor to point to the inode just
> prior to where
> +      the examination cursor is now.
> +      Because the scanner holds the AGI buffer lock, no inodes could
> have been
> +      created in the part of the inode keyspace that the visited
> inode cursor
> +      just advanced.
> +
> +5. Get the incore inode for the inumber of the examination cursor.
> +   By maintaining the AGI buffer lock until this point, the scanner
> knows that
> +   it was safe to advance the examination cursor across the entire
> keyspace,
> +   and that it has stabilized this next inode so that it cannot
> disappear from
> +   the filesystem until the scan releases the incore inode.
> +
> +6. Drop the AGI lock and return the incore inode to the caller.
> +
> +Online fsck functions scan all files in the filesystem as follows:
> +
> +1. Start a scan by calling ``xchk_iscan_start``.
Hmm, I actually did not find xchk_iscan_start in the below branch, I
found xchk_iscan_iter in "xfs: implement live inode scan for scrub",
but it doesnt look like anything uses it yet, at least not in that
branch.

Also, it took me a bit to figure out that "initial user" meant "calling
function" 


> +
> +2. Advance the scan cursor (``xchk_iscan_iter``) to get the next
> inode.
> +   If one is provided:
> +
> +   a. Lock the inode to prevent updates during the scan.
> +
> +   b. Scan the inode.
> +
> +   c. While still holding the inode lock, adjust the visited inode
> cursor
> +      (``xchk_iscan_mark_visited``) to point to this inode.
> +
> +   d. Unlock and release the inode.
> +
> +8. Call ``xchk_iscan_finish`` to complete the scan.
> +
> +There are subtleties with the inode cache that complicate grabbing
> the incore
> +inode for the caller.
> +Obviously, it is an absolute requirement that the inode metadata be
> consistent
> +enough to load it into the inode cache.
> +Second, if the incore inode is stuck in some intermediate state, the
> scan
> +coordinator must release the AGI and push the main filesystem to get
> the inode
> +back into a loadable state.
> +
> +The proposed patches are the
> +`inode scanner
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-iscan>`_
> +series.
> +
> +Inode Management
> +````````````````
> +
> +In regular filesystem code, references to allocated XFS incore
> inodes are
> +always obtained (``xfs_iget``) outside of transaction context
> because the
> +creation of the incore context for ane xisting file does not require
an existing
> metadata
> +updates.
> +However, it is important to note that references to incore inodes
> obtained as
> +part of file creation must be performed in transaction context
> because the
> +filesystem must ensure the atomicity of the ondisk inode btree index
> updates
> +and the initialization of the actual ondisk inode.
> +
> +References to incore inodes are always released (``xfs_irele``)
> outside of
> +transaction context because there are a handful of activities that
> might
> +require ondisk updates:
> +
> +- The VFS may decide to kick off writeback as part of a
> ``DONTCACHE`` inode
> +  release.
> +
> +- Speculative preallocations need to be unreserved.
> +
> +- An unlinked file may have lost its last reference, in which case
> the entire
> +  file must be inactivated, which involves releasing all of its
> resources in
> +  the ondisk metadata and freeing the inode.
> +
> +These activities are collectively called inode inactivation.
> +Inactivation has two parts -- the VFS part, which initiates
> writeback on all
> +dirty file pages, and the XFS part, which cleans up XFS-specific
> information
> +and frees the inode if it was unlinked.
> +If the inode is unlinked (or unconnected after a file handle
> operation), the
> +kernel drops the inode into the inactivation machinery immediately.
> +
> +During normal operation, resource acquisition for an update follows
> this order
> +to avoid deadlocks:
> +
> +1. Inode reference (``iget``).
> +
> +2. Filesystem freeze protection, if repairing
> (``mnt_want_write_file``).
> +
> +3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO.
> +
> +4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for
> operations that
> +   can update page cache mappings.
> +
> +5. Log feature enablement.
> +
> +6. Transaction log space grant.
> +
> +7. Space on the data and realtime devices for the transaction.
> +
> +8. Incore dquot references, if a file is being repaired.
> +   Note that they are not locked, merely acquired.
> +
> +9. Inode ``ILOCK`` for file metadata updates.
> +
> +10. AG header buffer locks / Realtime metadata inode ILOCK.
> +
> +11. Realtime metadata buffer locks, if applicable.
> +
> +12. Extent mapping btree blocks, if applicable.
> +
> +Resources are often released in the reverse order, though this is
> not required.
> +However, online fsck differs from regular XFS operations because it
> may examine
> +an object that normally is acquired in a later stage of the locking
> order, and
> +then decide to cross-reference the object with an object that is
> acquired
> +earlier in the order.
> +The next few sections detail the specific ways in which online fsck
> takes care
> +to avoid deadlocks.
> +
> +iget and irele During a Scrub
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +An inode scan performed on behalf of a scrub operation runs in
> transaction
> +context, and possibly with resources already locked and bound to it.
> +This isn't much of a problem for ``iget`` since it can operate in
> the context
> +of an existing transaction, as long as all of the bound resources
> are acquired
> +before the inode reference in the regular filesystem.
> +
> +When the VFS ``iput`` function is given a linked inode with no other
> +references, it normally puts the inode on an LRU list in the hope
> that it can
> +save time if another process re-opens the file before the system
> runs out
> +of memory and frees it.
> +Filesystem callers can short-circuit the LRU process by setting a
> ``DONTCACHE``
> +flag on the inode to cause the kernel to try to drop the inode into
> the
> +inactivation machinery immediately.
> +
> +In the past, inactivation was always done from the process that
> dropped the
> +inode, which was a problem for scrub because scrub may already hold
> a
> +transaction, and XFS does not support nesting transactions.
> +On the other hand, if there is no scrub transaction, it is desirable
> to drop
> +otherwise unused inodes immediately to avoid polluting caches.
> +To capture these nuances, the online fsck code has a separate
> ``xchk_irele``
> +function to set or clear the ``DONTCACHE`` flag to get the required
> release
> +behavior.
> +
> +Proposed patchsets include fixing
> +`scrub iget usage
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-iget-fixes>`_ and
> +`dir iget usage
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-dir-iget-fixes>`_.
> +
> +Locking Inodes
> +^^^^^^^^^^^^^^
> +
> +In regular filesystem code, the VFS and XFS will acquire multiple
> IOLOCK locks
> +in a well-known order: parent → child when updating the directory
> tree, and
> +``struct inode`` address order otherwise.
> +For regular files, the MMAPLOCK can be acquired after the IOLOCK to
> stop page
> +faults.
> +If two MMAPLOCKs must be acquired, they are acquired in 


> ``struct
> +address_space`` order.
the order of their memory address

?

> +Due to the structure of existing filesystem code, IOLOCKs and
> MMAPLOCKs must be
> +acquired before transactions are allocated.
> +If two ILOCKs must be acquired, they are acquired in inumber order.
> +
> +Inode lock acquisition must be done carefully during a coordinated
> inode scan.
> +Online fsck cannot abide these conventions, because for a directory
> tree
> +scanner, the scrub process holds the IOLOCK of the file being
> scanned and it
> +needs to take the IOLOCK of the file at the other end of the
> directory link.
> +If the directory tree is corrupt because it contains a cycle,
> ``xfs_scrub``
> +cannot use the regular inode locking functions and avoid becoming
> trapped in an
> +ABBA deadlock.
> +
> +Solving both of these problems is straightforward -- any time online
> fsck
> +needs to take a second lock of the same class, it uses trylock to
> avoid an ABBA
> +deadlock.
> +If the trylock fails, scrub drops all inode locks and use trylock
> loops to
> +(re)acquire all necessary resources.
> +Trylock loops enable scrub to check for pending fatal signals, which
> is how
> +scrub avoids deadlocking the filesystem or becoming an unresponsive
> process.
> +However, trylock loops means that online fsck must be prepared to
> measure the
> +resource being scrubbed before and after the lock cycle to detect
> changes and
> +react accordingly.
> +
> +.. _dirparent:
> +
> +Case Study: Finding a Directory Parent
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Consider the directory parent pointer repair code as an example.
> +Online fsck must verify that the dotdot dirent of a directory points
> up to a
> +parent directory, and that the parent directory contains exactly one
> dirent
> +pointing down to the child directory.
> +Fully validating this relationship (and repairing it if possible)
> requires a
> +walk of every directory on the filesystem while holding the child
> locked, and
> +while updates to the directory tree are being made.
> +The coordinated inode scan provides a way to walk the filesystem
> without the
> +possibility of missing an inode.
> +The child directory is kept locked to prevent updates to the dotdot
> dirent, but
> +if the scanner fails to lock a parent, it can drop and relock both
> the child
> +and the prospective parent.
> +If the dotdot entry changes while the directory is unlocked, then a
> move or
> +rename operation must have changed the child's parentage, and the
> scan can
> +exit early.
> +
> +The proposed patchset is the
> +`directory repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-dirs>`_
> +series.
> +
> +.. _fshooks:
> +
> +Filesystem Hooks
> +`````````````````
> +
> +The second piece of support that online fsck functions need during a
> full
> +filesystem scan is the ability to stay informed about updates being
> made by
> +other threads in the filesystem, since comparisons against the past
> are useless
> +in a dynamic environment.
> +Two pieces of Linux kernel infrastructure enable online fsck to
> monitor regular
> +filesystem operations: filesystem hooks and :ref:`static
> keys<jump_labels>`.
> +
> +Filesystem hooks convey information about an ongoing filesystem
> operation to
> +a downstream consumer.
> +In this case, the downstream consumer is always an online fsck
> function.
> +Because multiple fsck functions can run in parallel, online fsck
> uses the Linux
> +notifier call chain facility to dispatch updates to any number of
> interested
> +fsck processes.
> +Call chains are a dynamic list, which means that they can be
> configured at
> +run time.
> +Because these hooks are private to the XFS module, the information
> passed along
> +contains exactly what the checking function needs to update its
> observations.
> +
> +The current implementation of XFS hooks uses SRCU notifier chains to
> reduce the
> +impact to highly threaded workloads.
> +Regular blocking notifier chains use a rwsem and seem to have a much
> lower
> +overhead for single-threaded applications.
> +However, it may turn out that the combination of blocking chains and
> static
> +keys are a more performant combination; more study is needed here.
> +
> +The following pieces are necessary to hook a certain point in the
> filesystem:
> +
> +- A ``struct xfs_hooks`` object must be embedded in a convenient
> place such as
> +  a well-known incore filesystem object.
> +
> +- Each hook must define an action code and a structure containing
> more context
> +  about the action.
> +
> +- Hook providers should provide appropriate wrapper functions and
> structs
> +  around the ``xfs_hooks`` and ``xfs_hook`` objects to take
> advantage of type
> +  checking to ensure correct usage.
> +
> +- A callsite in the regular filesystem code must be chosen to call
> +  ``xfs_hooks_call`` with the action code and data structure.
> +  This place should be adjacent to (and not earlier than) the place
> where
> +  the filesystem update is committed to the transaction.
> +  In general, when the filesystem calls a hook chain, it should be
> able to
> +  handle sleeping and should not be vulnerable to memory reclaim or
> locking
> +  recursion.
> +  However, the exact requirements are very dependent on the context
> of the hook
> +  caller and the callee.
> +
> +- The online fsck function should define a structure to hold scan
> data, a lock
> +  to coordinate access to the scan data, and a ``struct xfs_hook``
> object.
> +  The scanner function and the regular filesystem code must acquire
> resources
> +  in the same order; see the next section for details.
> +
> +- The online fsck code must contain a C function to catch the hook
> action code
> +  and data structure.
> +  If the object being updated has already been visited by the scan,
> then the
> +  hook information must be applied to the scan data.
> +
> +- Prior to unlocking inodes to start the scan, online fsck must call
> +  ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and
> +  ``xfs_hooks_add`` to enable the hook.
> +
> +- Online fsck must call ``xfs_hooks_del`` to disable the hook once
> the scan is
> +  complete.
> +
> +The number of hooks should be kept to a minimum to reduce
> complexity.
> +Static keys are used to reduce the overhead of filesystem hooks to
> nearly
> +zero when online fsck is not running.
> +
> +.. _liveupdate:
> +
> +Live Updates During a Scan
> +``````````````````````````
> +
> +The code paths of the online fsck scanning code and the
> :ref:`hooked<fshooks>`
> +filesystem code look like this::
> +
> +            other program
> +                  ↓
> +            inode lock ←────────────────────┐
> +                  ↓                         │
> +            AG header lock                  │
> +                  ↓                         │
> +            filesystem function             │
> +                  ↓                         │
> +            notifier call chain             │    same
> +                  ↓                         ├─── inode
> +            scrub hook function             │    lock
> +                  ↓                         │
> +            scan data mutex ←──┐    same    │
> +                  ↓            ├─── scan    │
> +            update scan data   │    lock    │
> +                  ↑            │            │
> +            scan data mutex ←──┘            │
> +                  ↑                         │
> +            inode lock ←────────────────────┘
> +                  ↑
> +            scrub function
> +                  ↑
> +            inode scanner
> +                  ↑
> +            xfs_scrub
> +
> +These rules must be followed to ensure correct interactions between
> the
> +checking code and the code making an update to the filesystem:
> +
> +- Prior to invoking the notifier call chain, the filesystem function
> being
> +  hooked must acquire the same lock that the scrub scanning function
> acquires
> +  to scan the inode.
> +
> +- The scanning function and the scrub hook function must coordinate
> access to
> +  the scan data by acquiring a lock on the scan data.
> +
> +- Scrub hook function must not add the live update information to
> the scan
> +  observations unless the inode being updated has already been
> scanned.
> +  The scan coordinator has a helper predicate
> (``xchk_iscan_want_live_update``)
> +  for this.
> +
> +- Scrub hook functions must not change the caller's state, including
> the
> +  transaction that it is running.
> +  They must not acquire any resources that might conflict with the
> filesystem
> +  function being hooked.
> +
> +- The hook function can abort the inode scan to avoid breaking the
> other rules.
> +
> +The inode scan APIs are pretty simple:
> +
> +- ``xchk_iscan_start`` starts a scan
> +
> +- ``xchk_iscan_iter`` grabs a reference to the next inode in the
> scan or
> +  returns zero if there is nothing left to scan
> +
> +- ``xchk_iscan_want_live_update`` to decide if an inode has already
> been
> +  visited in the scan.
> +  This is critical for hook functions to decide if they need to
> update the
> +  in-memory scan information.
> +
> +- ``xchk_iscan_mark_visited`` to mark an inode as having been
> visited in the
> +  scan
> +
> +- ``xchk_iscan_finish`` to finish the scan
> +
> +The proposed patches are at the start of the
> +`online quotacheck
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-quota>`_
> +series.
Wrong link?  This looks like it goes to the section below.

> +
> +.. _quotacheck:
> +
> +Case Study: Quota Counter Checking
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +It is useful to compare the mount time quotacheck code to the online
> repair
> +quotacheck code.
> +Mount time quotacheck does not have to contend with concurrent
> operations, so
> +it does the following:
> +
> +1. Make sure the ondisk dquots are in good enough shape that all the
> incore
> +   dquots will actually load, and zero the resource usage counters
> in the
> +   ondisk buffer.
> +
> +2. Walk every inode in the filesystem.
> +   Add each file's resource usage to the incore dquot.
> +
> +3. Walk each incore dquot.
> +   If the incore dquot is not being flushed, add the ondisk buffer
> backing the
> +   incore dquot to a delayed write (delwri) list.
> +
> +4. Write the buffer list to disk.
> +
> +Like most online fsck functions, online quotacheck can't write to
> regular
> +filesystem objects until the newly collected metadata reflect all
> filesystem
> +state.
> +Therefore, online quotacheck records file resource usage to a shadow
> dquot
> +index implemented with a sparse ``xfarray``, and only writes to the
> real dquots
> +once the scan is complete.
> +Handling transactional updates is tricky because quota resource
> usage updates
> +are handled in phases to minimize contention on dquots:
> +
> +1. The inodes involved are joined and locked to a transaction.
> +
> +2. For each dquot attached to the file:
> +
> +   a. The dquot is locked.
> +
> +   b. A quota reservation is added to the dquot's resource usage.
> +      The reservation is recorded in the transaction.
> +
> +   c. The dquot is unlocked.
> +
> +3. Changes in actual quota usage are tracked in the transaction.
> +
> +4. At transaction commit time, each dquot is examined again:
> +
> +   a. The dquot is locked again.
> +
> +   b. Quota usage changes are logged and unused reservation is given
> back to
> +      the dquot.
> +
> +   c. The dquot is unlocked.
> +
> +For online quotacheck, hooks are placed in steps 2 and 4.
> +The step 2 hook creates a shadow version of the transaction dquot
> context
> +(``dqtrx``) that operates in a similar manner to the regular code.
> +The step 4 hook commits the shadow ``dqtrx`` changes to the shadow
> dquots.
> +Notice that both hooks are called with the inode locked, which is
> how the
> +live update coordinates with the inode scanner.
> +
> +The quotacheck scan looks like this:
> +
> +1. Set up a coordinated inode scan.
> +
> +2. For each inode returned by the inode scan iterator:
> +
> +   a. Grab and lock the inode.
> +
> +   b. Determine that inode's resource usage (data blocks, inode
> counts,
> +      realtime blocks) 
nit: move this list to the first appearance of "resource usage".  Step
2 of the first list I think

> and add that to the shadow dquots for the user, group,
> +      and project ids associated with the inode.
> +
> +   c. Unlock and release the inode.
> +
> +3. For each dquot in the system:
> +
> +   a. Grab and lock the dquot.
> +
> +   b. Check the dquot against the shadow dquots created by the scan
> and updated
> +      by the live hooks.
> +
> +Live updates are key to being able to walk every quota record
> without
> +needing to hold any locks for a long duration.
> +If repairs are desired, the real and shadow dquots are locked and
> their
> +resource counts are set to the values in the shadow dquot.
> +
> +The proposed patchset is the
> +`online quotacheck
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-quota>`_
> +series.
> +
> +.. _nlinks:
> +
> +Case Study: File Link Count Checking
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +File link count checking also uses live update hooks.
> +The coordinated inode scanner is used to visit all directories on
> the
> +filesystem, and per-file link count records are stored in a sparse
> ``xfarray``
> +indexed by inumber.
> +During the scanning phase, each entry in a directory generates
> observation
> +data as follows:
> +
> +1. If the entry is a dotdot (``'..'``) entry of the root directory,
> the
> +   directory's parent link count is bumped because the root
> directory's dotdot
> +   entry is self referential.
> +
> +2. If the entry is a dotdot entry of a subdirectory, the parent's
> backref
> +   count is bumped.
> +
> +3. If the entry is neither a dot nor a dotdot entry, the target
> file's parent
> +   count is bumped.
> +
> +4. If the target is a subdirectory, the parent's child link count is
> bumped.
> +
> +A crucial point to understand about how the link count inode scanner
> interacts
> +with the live update hooks is that the scan cursor tracks which
> *parent*
> +directories have been scanned.
> +In other words, the live updates ignore any update about ``A → B``
> when A has
> +not been scanned, even if B has been scanned.
> +Furthermore, a subdirectory A with a dotdot entry pointing back to B
> is
> +accounted as a backref counter in the shadow data for A, since child
> dotdot
> +entries affect the parent's link count.
> +Live update hooks are carefully placed in all parts of the
> filesystem that
> +create, change, or remove directory entries, since those operations
> involve
> +bumplink and droplink.
> +
> +For any file, the correct link count is the number of parents plus
> the number
> +of child subdirectories.
> +Non-directories never have children of any kind.
> +The backref information is used to detect inconsistencies in the
> number of
> +links pointing to child subdirectories and the number of dotdot
> entries
> +pointing back.
> +
> +After the scan completes, the link count of each file can be checked
> by locking
> +both the inode and the shadow data, and comparing the link counts.
> +A second coordinated inode scan cursor is used for comparisons.
> +Live updates are key to being able to walk every inode without
> needing to hold
> +any locks between inodes.
> +If repairs are desired, the inode's link count is set to the value
> in the
> +shadow information.
> +If no parents are found, the file must be :ref:`reparented
> <orphanage>` to the
> +orphanage to prevent the file from being lost forever.
> +
> +The proposed patchset is the
> +`file link count repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-nlinks>`_
> +series.
> +
> +.. _rmap_repair:
> +
> +Case Study: Rebuilding Reverse Mapping Records
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Most repair functions follow the same pattern: lock filesystem
> resources,
> +walk the surviving ondisk metadata looking for replacement metadata
> records,
> +and use an :ref:`in-memory array <xfarray>` to store the gathered
> observations.
> +The primary advantage of this approach is the simplicity and
> modularity of the
> +repair code -- code and data are entirely contained within the scrub
> module,
> +do not require hooks in the main filesystem, and are usually the
> most efficient
> +in memory use.
> +A secondary advantage of this repair approach is atomicity -- once
> the kernel
> +decides a structure is corrupt, no other threads can access the
> metadata until
> +the kernel finishes repairing and revalidating the metadata.
> +
> +For repairs going on within a shard of the filesystem, these
> advantages
> +outweigh the delays inherent in locking the shard while repairing
> parts of the
> +shard.
> +Unfortunately, repairs to the reverse mapping btree cannot use the
> "standard"
> +btree repair strategy because it must scan every space mapping of
> every fork of
> +every file in the filesystem, and the filesystem cannot stop.
> +Therefore, rmap repair foregoes atomicity between scrub and repair.
> +It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live
> update hooks
> +<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to
> complete the
> +scan for reverse mapping records.
> +
> +1. Set up an xfbtree to stage rmap records.
> +
> +2. While holding the locks on the AGI and AGF buffers acquired
> during the
> +   scrub, generate reverse mappings for all AG metadata: inodes,
> btrees, CoW
> +   staging extents, and the internal log.
> +
> +3. Set up an inode scanner.
> +
> +4. Hook into rmap updates for the AG being repaired so that the live
> scan data
> +   can receive updates to the rmap btree from the rest of the
> filesystem during
> +   the file scan.
> +
> +5. For each space mapping found in either fork of each file scanned,
> +   decide if the mapping matches the AG of interest.
> +   If so:
> +
> +   a. Create a btree cursor for the in-memory btree.
> +
> +   b. Use the rmap code to add the record to the in-memory btree.
> +
> +   c. Use the :ref:`special commit function <xfbtree_commit>` to
> write the
> +      xfbtree changes to the xfile.
> +
> +6. For each live update received via the hook, decide if the owner
> has already
> +   been scanned.
> +   If so, apply the live update into the scan data:
> +
> +   a. Create a btree cursor for the in-memory btree.
> +
> +   b. Replay the operation into the in-memory btree.
> +
> +   c. Use the :ref:`special commit function <xfbtree_commit>` to
> write the
> +      xfbtree changes to the xfile.
> +      This is performed with an empty transaction to avoid changing
> the
> +      caller's state.
> +
> +7. When the inode scan finishes, create a new scrub transaction and
> relock the
> +   two AG headers.
> +
> +8. Compute the new btree geometry using the number of rmap records
> in the
> +   shadow btree, like all other btree rebuilding functions.
> +
> +9. Allocate the number of blocks computed in the previous step.
> +
> +10. Perform the usual btree bulk loading and commit to install the
> new rmap
> +    btree.
> +
> +11. Reap the old rmap btree blocks as discussed in the case study
> about how
> +    to :ref:`reap after rmap btree repair <rmap_reap>`.
> +
> +12. Free the xfbtree now that it not needed.
> +
> +The proposed patchset is the
> +`rmap repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-rmap-btree>`_
> +series.
> 

Mostly looks good nits aside, I do sort of wonder if this patch would
do better to appear before patch 6 (or move 6 down), since it gets into
more challenges concerning locks and hooks, where as here we are mostly
discussing what they are and how they work.  So it might build better
to move this patch up a little.

Allison


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 08/14] xfs: document btree bulk loading
  2023-02-16 15:46         ` Allison Henderson
@ 2023-02-16 21:08           ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-02-16 21:08 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Thu, Feb 16, 2023 at 03:46:02PM +0000, Allison Henderson wrote:

<snip to the relevant parts>

> > > > +Writing the New Tree
> > > > +````````````````````
> > > > +
> > > > +This part is pretty simple -- the btree builder
> > > > (``xfs_btree_bulkload``) claims
> > > > +a block from the reserved list, writes the new btree block
> > > > header,
> > > > fills the
> > > > +rest of the block with records, and adds the new leaf block to a
> > > > list of
> > > > +written blocks.
> > > > +Sibling pointers are set every time a new block is added to the
> > > > level.
> > > > +When it finishes writing the record leaf blocks, it moves on to
> > > > the
> > > > node
> > > > +blocks.
> > > > +To fill a node block, it walks each block in the next level down
> > > > in
> > > > the tree
> > > > +to compute the relevant keys and write them into the parent
> > > > node.
> > > > +When it reaches the root level, it is ready to commit the new
> > > > btree!
> > > I think most of this is as straight forward as it can be, but it's
> > > a
> > > lot visualizing too, which makes me wonder if it would benefit from
> > > an
> > > simple illustration if possible.
> > > 
> > > On a side note: In a prior team I discovered power points, while a
> > > lot
> > > work, were also really effective for quickly moving a crowd of
> > > people
> > > through connected graph navigation/manipulations.  Because each one
> > > of
> > > these steps was another slide that illustrated how the structure
> > > evolved through the updates.  I realize that's not something that
> > > fits
> > > in the scheme of a document like this, but maybe something
> > > supplemental
> > > to add later.  While it was a time eater, i noticed a lot of
> > > confused
> > > expressions just seemed to shake loose, so sometimes it was worth
> > > it.
> > 
> > That was ... surprisingly less bad than I feared it would be to cut
> > and
> > paste unicode linedraw characters and arrows.
> > 
> >           ┌─────────┐
> >           │root     │
> >           │PP       │
> >           └─────────┘
> >           ↙         ↘
> >       ┌────┐       ┌────┐
> >       │node│──────→│node│
> >       │PP  │←──────│PP  │
> >       └────┘       └────┘
> >       ↙   ↘         ↙   ↘
> >   ┌────┐ ┌────┐ ┌────┐ ┌────┐
> >   │leaf│→│leaf│→│leaf│→│leaf│
> >   │RRR │←│RRR │←│RRR │←│RRR │
> >   └────┘ └────┘ └────┘ └────┘
> > 
> > (Does someone have a program that does this?)
> I think Catherine mentioned she had used PlantUML for the larp diagram,
> though for something this simple I think this is fine

<nod>

> > I really like your version!  Can I tweak it a bit?
> > 
> > - Until the reverse mapping btree runs out of records:
> > 
> >   - Retrieve the next record from the btree and put it in a bag.
> > 
> >   - Collect all records with the same starting block from the btree
> > and
> >     put them in the bag.
> > 
> >   - While the bag isn't empty:
> > 
> >     - Among the mappings in the bag, compute the lowest block number
> >       where the reference count changes.
> >       This position will be either the starting block number of the
> > next
> >       unprocessed reverse mapping or the next block after the
> > shortest
> >       mapping in the bag.
> > 
> >     - Remove all mappings from the bag that end at this position.
> > 
> >     - Collect all reverse mappings that start at this position from
> > the
> >       btree and put them in the bag.
> > 
> >     - If the size of the bag changed and is greater than one, create
> > a
> >       new refcount record associating the block number range that we
> >       just walked to the size of the bag.
> > 
> > 
> Sure, that looks fine to me

Ok, will commit.

> > > Branch link?  Looks like maybe it's missing.  In fact this logic
> > > looks
> > > like it might have been cut off?
> > 
> > OH, heh.  I forgot that we already merged the AGFL repair code.
> > 
> > "See `fs/xfs/scrub/agheader_repair.c
> > <
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> > e/fs/xfs/scrub/agheader_repair.c>`_
> > for more details."
> > 
> > > In any case, maybe give some thought to the highlight link
> > > suggestions.
> > 
> > Er... how do those work?  In principle I like them, but none of your
> > links actually highlighted anything here.  Could you send the link
> > over
> > IRC so that urldefense crapola won't destroy it, please?
> > 
> > --D
> So I think the last we talked about these, we realized they're a chrome
> only format.  That's a shame, I think they really help people to
> quickly navigate the code in question.  Otherwise I'm pretty much just
> poking through the branches looking for code that resembles the
> description.

Yep.  Back in 2020, Google was pushing a "link to text fragment"
proposal wherein they'd add some secret sauce to URL anchors:

#:~:text=[prefix-,]textStart[,textEnd][,-suffix]

Which would inspire web browsers to highlight all instances of "text" in
a document and autoscroll to the first occurrence.  They've since
integrated this into Chrome and persuaded Safari to pick it up, but
there are serious problems with this hack.

https://wicg.github.io/scroll-to-text-fragment/

The first and biggest problem is that none of the prefix characters here
":~:text=" are invalid characters for a url anchor, nor are they ever
invalid for an <a name> tag.  This is valid html:

<a name="dork:~:text=farts">cow</a>

And this is valid link to that html anchor:

file:///tmp/a.html#dork:~:text=farts

Web browsers that are unaware of this extension (Firefox, lynx, w3m,
etc.) will not know to ignore everything starting with ":~:" when
navigating, so they will actually try to find an anchor matching that
name.  That's why it didn't work for me but worked fine for Allison.

This is even worse if the document also contains:

<a name="dork">frogs</a>

Because now the url "file:///tmp/a.html#dork:~:text=farts" jumps to
"cow" on Chrome, and "frogs" on Firefox.

Embrace and extend [with proprietary bullsh*t].  Thanks Google.

> I also poked around and found there was a firefox plugin that does the
> same (link to text fragment addon).  Though it doesn't look like the
> links generated are compatible between the browsers.

No, they are not.

> Maybe something to consider if we have a lot of chrome or ff users.  I
> think if they help facilitate more discussion they're better than
> nothing at least during review.

I'll comb through these documents and add some suggestions of where to
navigate, e.g.

"For more details, see the function xrep_reap."

Simple and readable by anyone, albeit without the convenient mechanical
links.

For more fun reading, apparently terminals support now escape sequences
to inject url links too:
https://github.com/Alhadis/OSC8-Adoption

--D

> > 
> > > Allison
> > > 
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 10/14] xfs: document full filesystem scans for online fsck
  2023-02-16 15:47     ` Allison Henderson
@ 2023-02-16 22:48       ` Darrick J. Wong
  2023-02-25  7:33         ` Allison Henderson
  0 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2023-02-16 22:48 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Thu, Feb 16, 2023 at 03:47:20PM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Certain parts of the online fsck code need to scan every file in the
> > entire filesystem.  It is not acceptable to block the entire
> > filesystem
> > while this happens, which means that we need to be clever in allowing
> > scans to coordinate with ongoing filesystem updates.  We also need to
> > hook the filesystem so that regular updates propagate to the staging
> > records.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  677
> > ++++++++++++++++++++
> >  1 file changed, 677 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index a658da8fe4ae..c0f08a773f08 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -3018,3 +3018,680 @@ The proposed patchset is the
> >  `summary counter cleanup
> >  <
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-fscounters>`_
> >  series.
> > +
> > +Full Filesystem Scans
> > +---------------------
> > +
> > +Certain types of metadata can only be checked by walking every file
> > in the
> > +entire filesystem to record observations and comparing the
> > observations against
> > +what's recorded on disk.
> > +Like every other type of online repair, repairs are made by writing
> > those
> > +observations to disk in a replacement structure and committing it
> > atomically.
> > +However, it is not practical to shut down the entire filesystem to
> > examine
> > +hundreds of billions of files because the downtime would be
> > excessive.
> > +Therefore, online fsck must build the infrastructure to manage a
> > live scan of
> > +all the files in the filesystem.
> > +There are two questions that need to be solved to perform a live
> > walk:
> > +
> > +- How does scrub manage the scan while it is collecting data?
> > +
> > +- How does the scan keep abreast of changes being made to the system
> > by other
> > +  threads?
> > +
> > +.. _iscan:
> > +
> > +Coordinated Inode Scans
> > +```````````````````````
> > +
> > +In the original Unix filesystems of the 1970s, each directory entry
> > contained
> > +an index number (*inumber*) which was used as an index into on
> > ondisk array
> > +(*itable*) of fixed-size records (*inodes*) describing a file's
> > attributes and
> > +its data block mapping.
> > +This system is described by J. Lions, `"inode (5659)"
> > +<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions'
> > Commentary on
> > +UNIX, 6th Edition*, (Dept. of Computer Science, the University of
> > New South
> > +Wales, November 1977), pp. 18-2; and later by D. Ritchie and K.
> > Thompson,
> > +`"Implementation of the File System"
> > +<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from
> > *The UNIX
> > +Time-Sharing System*, (The Bell System Technical Journal, July
> > 1978), pp.
> > +1913-4.
> > +
> > +XFS retains most of this design, except now inumbers are search keys
> > over all
> > +the space in the data section filesystem.
> > +They form a continuous keyspace that can be expressed as a 64-bit
> > integer,
> > +though the inodes themselves are sparsely distributed within the
> > keyspace.
> > +Scans proceed in a linear fashion across the inumber keyspace,
> > starting from
> > +``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``.
> > +Naturally, a scan through a keyspace requires a scan cursor object
> > to track the
> > +scan progress.
> > +Because this keyspace is sparse, this cursor contains two parts.
> > +The first part of this scan cursor object tracks the inode that will
> > be
> > +examined next; call this the examination cursor.
> > +Somewhat less obviously, the scan cursor object must also track
> > which parts of
> > +the keyspace have already been visited, which is critical for
> > deciding if a
> > +concurrent filesystem update needs to be incorporated into the scan
> > data.
> > +Call this the visited inode cursor.
> > +
> > +Advancing the scan cursor is a multi-step process encapsulated in
> > +``xchk_iscan_iter``:
> > +
> > +1. Lock the AGI buffer of the AG containing the inode pointed to by
> > the visited
> > +   inode cursor.
> > +   This guarantee that inodes in this AG cannot be allocated or
> > freed while
> > +   advancing the cursor.
> > +
> > +2. Use the per-AG inode btree to look up the next inumber after the
> > one that
> > +   was just visited, since it may not be keyspace adjacent.
> > +
> > +3. If there are no more inodes left in this AG:
> > +
> > +   a. Move the examination cursor to the point of the inumber
> > keyspace that
> > +      corresponds to the start of the next AG.
> > +
> > +   b. Adjust the visited inode cursor to indicate that it has
> > "visited" the
> > +      last possible inode in the current AG's inode keyspace.
> > +      XFS inumbers are segmented, so the cursor needs to be marked
> > as having
> > +      visited the entire keyspace up to just before the start of the
> > next AG's
> > +      inode keyspace.
> > +
> > +   c. Unlock the AGI and return to step 1 if there are unexamined
> > AGs in the
> > +      filesystem.
> > +
> > +   d. If there are no more AGs to examine, set both cursors to the
> > end of the
> > +      inumber keyspace.
> > +      The scan is now complete.
> > +
> > +4. Otherwise, there is at least one more inode to scan in this AG:
> > +
> > +   a. Move the examination cursor ahead to the next inode marked as
> > allocated
> > +      by the inode btree.
> > +
> > +   b. Adjust the visited inode cursor to point to the inode just
> > prior to where
> > +      the examination cursor is now.
> > +      Because the scanner holds the AGI buffer lock, no inodes could
> > have been
> > +      created in the part of the inode keyspace that the visited
> > inode cursor
> > +      just advanced.
> > +
> > +5. Get the incore inode for the inumber of the examination cursor.
> > +   By maintaining the AGI buffer lock until this point, the scanner
> > knows that
> > +   it was safe to advance the examination cursor across the entire
> > keyspace,
> > +   and that it has stabilized this next inode so that it cannot
> > disappear from
> > +   the filesystem until the scan releases the incore inode.
> > +
> > +6. Drop the AGI lock and return the incore inode to the caller.
> > +
> > +Online fsck functions scan all files in the filesystem as follows:
> > +
> > +1. Start a scan by calling ``xchk_iscan_start``.
> Hmm, I actually did not find xchk_iscan_start in the below branch, I
> found xchk_iscan_iter in "xfs: implement live inode scan for scrub",
> but it doesnt look like anything uses it yet, at least not in that
> branch.

<nod> The topic branch linked below has the implementation, but no
users.  The first user is online quotacheck, which is in the next branch
after that:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck

Specifically, this patch:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-quotacheck&id=3640515b9282514d91a407b6aa8d8b73caa123c5

I'll restate what you probably saw in the commit message for this
email discussion:

This "one branch to introduce a new infrastructure and a second branch
to actually use it" pattern is a result of reviewer requests for smaller
more focused branches.  This has turned out to be useful in practice
because it's easier to move just these pieces up and down in the branch
as needed.  The inode scan was originally developed for rmapbt repair
(which comes *much* later) and moved it up once I realized that
quotacheck has far fewer dependencies and hence all of this could come
earlier.

You're right that this section ought to point to an actual user of the
functionality.  Will fix. :)

> Also, it took me a bit to figure out that "initial user" meant "calling
> function"

Er... are you talking about the sentence "...new code is split out as a
separate patch from its initial user" in the patch commit message?

Maybe I should reword that:

"This new code is a separate patch from the patches adding callers for
the sake of enabling the author to move patches around his tree..."

> > +
> > +2. Advance the scan cursor (``xchk_iscan_iter``) to get the next
> > inode.
> > +   If one is provided:
> > +
> > +   a. Lock the inode to prevent updates during the scan.
> > +
> > +   b. Scan the inode.
> > +
> > +   c. While still holding the inode lock, adjust the visited inode
> > cursor
> > +      (``xchk_iscan_mark_visited``) to point to this inode.
> > +
> > +   d. Unlock and release the inode.
> > +
> > +8. Call ``xchk_iscan_finish`` to complete the scan.
> > +
> > +There are subtleties with the inode cache that complicate grabbing
> > the incore
> > +inode for the caller.
> > +Obviously, it is an absolute requirement that the inode metadata be
> > consistent
> > +enough to load it into the inode cache.
> > +Second, if the incore inode is stuck in some intermediate state, the
> > scan
> > +coordinator must release the AGI and push the main filesystem to get
> > the inode
> > +back into a loadable state.
> > +
> > +The proposed patches are the
> > +`inode scanner
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-iscan>`_
> > +series.
> > +
> > +Inode Management
> > +````````````````
> > +
> > +In regular filesystem code, references to allocated XFS incore
> > inodes are
> > +always obtained (``xfs_iget``) outside of transaction context
> > because the
> > +creation of the incore context for ane xisting file does not require
> an existing

Corrected, thank you.

> > metadata
> > +updates.
> > +However, it is important to note that references to incore inodes
> > obtained as
> > +part of file creation must be performed in transaction context
> > because the
> > +filesystem must ensure the atomicity of the ondisk inode btree index
> > updates
> > +and the initialization of the actual ondisk inode.
> > +
> > +References to incore inodes are always released (``xfs_irele``)
> > outside of
> > +transaction context because there are a handful of activities that
> > might
> > +require ondisk updates:
> > +
> > +- The VFS may decide to kick off writeback as part of a
> > ``DONTCACHE`` inode
> > +  release.
> > +
> > +- Speculative preallocations need to be unreserved.
> > +
> > +- An unlinked file may have lost its last reference, in which case
> > the entire
> > +  file must be inactivated, which involves releasing all of its
> > resources in
> > +  the ondisk metadata and freeing the inode.
> > +
> > +These activities are collectively called inode inactivation.
> > +Inactivation has two parts -- the VFS part, which initiates
> > writeback on all
> > +dirty file pages, and the XFS part, which cleans up XFS-specific
> > information
> > +and frees the inode if it was unlinked.
> > +If the inode is unlinked (or unconnected after a file handle
> > operation), the
> > +kernel drops the inode into the inactivation machinery immediately.
> > +
> > +During normal operation, resource acquisition for an update follows
> > this order
> > +to avoid deadlocks:
> > +
> > +1. Inode reference (``iget``).
> > +
> > +2. Filesystem freeze protection, if repairing
> > (``mnt_want_write_file``).
> > +
> > +3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO.
> > +
> > +4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for
> > operations that
> > +   can update page cache mappings.
> > +
> > +5. Log feature enablement.
> > +
> > +6. Transaction log space grant.
> > +
> > +7. Space on the data and realtime devices for the transaction.
> > +
> > +8. Incore dquot references, if a file is being repaired.
> > +   Note that they are not locked, merely acquired.
> > +
> > +9. Inode ``ILOCK`` for file metadata updates.
> > +
> > +10. AG header buffer locks / Realtime metadata inode ILOCK.
> > +
> > +11. Realtime metadata buffer locks, if applicable.
> > +
> > +12. Extent mapping btree blocks, if applicable.
> > +
> > +Resources are often released in the reverse order, though this is
> > not required.
> > +However, online fsck differs from regular XFS operations because it
> > may examine
> > +an object that normally is acquired in a later stage of the locking
> > order, and
> > +then decide to cross-reference the object with an object that is
> > acquired
> > +earlier in the order.
> > +The next few sections detail the specific ways in which online fsck
> > takes care
> > +to avoid deadlocks.
> > +
> > +iget and irele During a Scrub
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +An inode scan performed on behalf of a scrub operation runs in
> > transaction
> > +context, and possibly with resources already locked and bound to it.
> > +This isn't much of a problem for ``iget`` since it can operate in
> > the context
> > +of an existing transaction, as long as all of the bound resources
> > are acquired
> > +before the inode reference in the regular filesystem.
> > +
> > +When the VFS ``iput`` function is given a linked inode with no other
> > +references, it normally puts the inode on an LRU list in the hope
> > that it can
> > +save time if another process re-opens the file before the system
> > runs out
> > +of memory and frees it.
> > +Filesystem callers can short-circuit the LRU process by setting a
> > ``DONTCACHE``
> > +flag on the inode to cause the kernel to try to drop the inode into
> > the
> > +inactivation machinery immediately.
> > +
> > +In the past, inactivation was always done from the process that
> > dropped the
> > +inode, which was a problem for scrub because scrub may already hold
> > a
> > +transaction, and XFS does not support nesting transactions.
> > +On the other hand, if there is no scrub transaction, it is desirable
> > to drop
> > +otherwise unused inodes immediately to avoid polluting caches.
> > +To capture these nuances, the online fsck code has a separate
> > ``xchk_irele``
> > +function to set or clear the ``DONTCACHE`` flag to get the required
> > release
> > +behavior.
> > +
> > +Proposed patchsets include fixing
> > +`scrub iget usage
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-iget-fixes>`_ and
> > +`dir iget usage
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-dir-iget-fixes>`_.
> > +
> > +Locking Inodes
> > +^^^^^^^^^^^^^^
> > +
> > +In regular filesystem code, the VFS and XFS will acquire multiple
> > IOLOCK locks
> > +in a well-known order: parent → child when updating the directory
> > tree, and
> > +``struct inode`` address order otherwise.
> > +For regular files, the MMAPLOCK can be acquired after the IOLOCK to
> > stop page
> > +faults.
> > +If two MMAPLOCKs must be acquired, they are acquired in 
> 
> 
> > ``struct
> > +address_space`` order.
> the order of their memory address
> 
> ?

Urghg.  I think I need to clarify this more:

"...they are acquired in numerical order of the addresses of their
``struct address_space`` objects."

See filemap_invalidate_lock_two.

> > +Due to the structure of existing filesystem code, IOLOCKs and
> > MMAPLOCKs must be
> > +acquired before transactions are allocated.
> > +If two ILOCKs must be acquired, they are acquired in inumber order.
> > +
> > +Inode lock acquisition must be done carefully during a coordinated
> > inode scan.
> > +Online fsck cannot abide these conventions, because for a directory
> > tree
> > +scanner, the scrub process holds the IOLOCK of the file being
> > scanned and it
> > +needs to take the IOLOCK of the file at the other end of the
> > directory link.
> > +If the directory tree is corrupt because it contains a cycle,
> > ``xfs_scrub``
> > +cannot use the regular inode locking functions and avoid becoming
> > trapped in an
> > +ABBA deadlock.
> > +
> > +Solving both of these problems is straightforward -- any time online
> > fsck
> > +needs to take a second lock of the same class, it uses trylock to
> > avoid an ABBA
> > +deadlock.
> > +If the trylock fails, scrub drops all inode locks and use trylock
> > loops to
> > +(re)acquire all necessary resources.
> > +Trylock loops enable scrub to check for pending fatal signals, which
> > is how
> > +scrub avoids deadlocking the filesystem or becoming an unresponsive
> > process.
> > +However, trylock loops means that online fsck must be prepared to
> > measure the
> > +resource being scrubbed before and after the lock cycle to detect
> > changes and
> > +react accordingly.
> > +
> > +.. _dirparent:
> > +
> > +Case Study: Finding a Directory Parent
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Consider the directory parent pointer repair code as an example.
> > +Online fsck must verify that the dotdot dirent of a directory points
> > up to a
> > +parent directory, and that the parent directory contains exactly one
> > dirent
> > +pointing down to the child directory.
> > +Fully validating this relationship (and repairing it if possible)
> > requires a
> > +walk of every directory on the filesystem while holding the child
> > locked, and
> > +while updates to the directory tree are being made.
> > +The coordinated inode scan provides a way to walk the filesystem
> > without the
> > +possibility of missing an inode.
> > +The child directory is kept locked to prevent updates to the dotdot
> > dirent, but
> > +if the scanner fails to lock a parent, it can drop and relock both
> > the child
> > +and the prospective parent.
> > +If the dotdot entry changes while the directory is unlocked, then a
> > move or
> > +rename operation must have changed the child's parentage, and the
> > scan can
> > +exit early.
> > +
> > +The proposed patchset is the
> > +`directory repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-dirs>`_
> > +series.
> > +
> > +.. _fshooks:
> > +
> > +Filesystem Hooks
> > +`````````````````
> > +
> > +The second piece of support that online fsck functions need during a
> > full
> > +filesystem scan is the ability to stay informed about updates being
> > made by
> > +other threads in the filesystem, since comparisons against the past
> > are useless
> > +in a dynamic environment.
> > +Two pieces of Linux kernel infrastructure enable online fsck to
> > monitor regular
> > +filesystem operations: filesystem hooks and :ref:`static
> > keys<jump_labels>`.
> > +
> > +Filesystem hooks convey information about an ongoing filesystem
> > operation to
> > +a downstream consumer.
> > +In this case, the downstream consumer is always an online fsck
> > function.
> > +Because multiple fsck functions can run in parallel, online fsck
> > uses the Linux
> > +notifier call chain facility to dispatch updates to any number of
> > interested
> > +fsck processes.
> > +Call chains are a dynamic list, which means that they can be
> > configured at
> > +run time.
> > +Because these hooks are private to the XFS module, the information
> > passed along
> > +contains exactly what the checking function needs to update its
> > observations.
> > +
> > +The current implementation of XFS hooks uses SRCU notifier chains to
> > reduce the
> > +impact to highly threaded workloads.
> > +Regular blocking notifier chains use a rwsem and seem to have a much
> > lower
> > +overhead for single-threaded applications.
> > +However, it may turn out that the combination of blocking chains and
> > static
> > +keys are a more performant combination; more study is needed here.
> > +
> > +The following pieces are necessary to hook a certain point in the
> > filesystem:
> > +
> > +- A ``struct xfs_hooks`` object must be embedded in a convenient
> > place such as
> > +  a well-known incore filesystem object.
> > +
> > +- Each hook must define an action code and a structure containing
> > more context
> > +  about the action.
> > +
> > +- Hook providers should provide appropriate wrapper functions and
> > structs
> > +  around the ``xfs_hooks`` and ``xfs_hook`` objects to take
> > advantage of type
> > +  checking to ensure correct usage.
> > +
> > +- A callsite in the regular filesystem code must be chosen to call
> > +  ``xfs_hooks_call`` with the action code and data structure.
> > +  This place should be adjacent to (and not earlier than) the place
> > where
> > +  the filesystem update is committed to the transaction.
> > +  In general, when the filesystem calls a hook chain, it should be
> > able to
> > +  handle sleeping and should not be vulnerable to memory reclaim or
> > locking
> > +  recursion.
> > +  However, the exact requirements are very dependent on the context
> > of the hook
> > +  caller and the callee.
> > +
> > +- The online fsck function should define a structure to hold scan
> > data, a lock
> > +  to coordinate access to the scan data, and a ``struct xfs_hook``
> > object.
> > +  The scanner function and the regular filesystem code must acquire
> > resources
> > +  in the same order; see the next section for details.
> > +
> > +- The online fsck code must contain a C function to catch the hook
> > action code
> > +  and data structure.
> > +  If the object being updated has already been visited by the scan,
> > then the
> > +  hook information must be applied to the scan data.
> > +
> > +- Prior to unlocking inodes to start the scan, online fsck must call
> > +  ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and
> > +  ``xfs_hooks_add`` to enable the hook.
> > +
> > +- Online fsck must call ``xfs_hooks_del`` to disable the hook once
> > the scan is
> > +  complete.
> > +
> > +The number of hooks should be kept to a minimum to reduce
> > complexity.
> > +Static keys are used to reduce the overhead of filesystem hooks to
> > nearly
> > +zero when online fsck is not running.
> > +
> > +.. _liveupdate:
> > +
> > +Live Updates During a Scan
> > +``````````````````````````
> > +
> > +The code paths of the online fsck scanning code and the
> > :ref:`hooked<fshooks>`
> > +filesystem code look like this::
> > +
> > +            other program
> > +                  ↓
> > +            inode lock ←────────────────────┐
> > +                  ↓                         │
> > +            AG header lock                  │
> > +                  ↓                         │
> > +            filesystem function             │
> > +                  ↓                         │
> > +            notifier call chain             │    same
> > +                  ↓                         ├─── inode
> > +            scrub hook function             │    lock
> > +                  ↓                         │
> > +            scan data mutex ←──┐    same    │
> > +                  ↓            ├─── scan    │
> > +            update scan data   │    lock    │
> > +                  ↑            │            │
> > +            scan data mutex ←──┘            │
> > +                  ↑                         │
> > +            inode lock ←────────────────────┘
> > +                  ↑
> > +            scrub function
> > +                  ↑
> > +            inode scanner
> > +                  ↑
> > +            xfs_scrub
> > +
> > +These rules must be followed to ensure correct interactions between
> > the
> > +checking code and the code making an update to the filesystem:
> > +
> > +- Prior to invoking the notifier call chain, the filesystem function
> > being
> > +  hooked must acquire the same lock that the scrub scanning function
> > acquires
> > +  to scan the inode.
> > +
> > +- The scanning function and the scrub hook function must coordinate
> > access to
> > +  the scan data by acquiring a lock on the scan data.
> > +
> > +- Scrub hook function must not add the live update information to
> > the scan
> > +  observations unless the inode being updated has already been
> > scanned.
> > +  The scan coordinator has a helper predicate
> > (``xchk_iscan_want_live_update``)
> > +  for this.
> > +
> > +- Scrub hook functions must not change the caller's state, including
> > the
> > +  transaction that it is running.
> > +  They must not acquire any resources that might conflict with the
> > filesystem
> > +  function being hooked.
> > +
> > +- The hook function can abort the inode scan to avoid breaking the
> > other rules.
> > +
> > +The inode scan APIs are pretty simple:
> > +
> > +- ``xchk_iscan_start`` starts a scan
> > +
> > +- ``xchk_iscan_iter`` grabs a reference to the next inode in the
> > scan or
> > +  returns zero if there is nothing left to scan
> > +
> > +- ``xchk_iscan_want_live_update`` to decide if an inode has already
> > been
> > +  visited in the scan.
> > +  This is critical for hook functions to decide if they need to
> > update the
> > +  in-memory scan information.
> > +
> > +- ``xchk_iscan_mark_visited`` to mark an inode as having been
> > visited in the
> > +  scan
> > +
> > +- ``xchk_iscan_finish`` to finish the scan
> > +
> > +The proposed patches are at the start of the
> > +`online quotacheck
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-quota>`_
> > +series.
> Wrong link?  This looks like it goes to the section below.

Oops.  This one should link to scrub-iscan, and the next one should link
to repair-quotacheck.

> > +
> > +.. _quotacheck:
> > +
> > +Case Study: Quota Counter Checking
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +It is useful to compare the mount time quotacheck code to the online
> > repair
> > +quotacheck code.
> > +Mount time quotacheck does not have to contend with concurrent
> > operations, so
> > +it does the following:
> > +
> > +1. Make sure the ondisk dquots are in good enough shape that all the
> > incore
> > +   dquots will actually load, and zero the resource usage counters
> > in the
> > +   ondisk buffer.
> > +
> > +2. Walk every inode in the filesystem.
> > +   Add each file's resource usage to the incore dquot.
> > +
> > +3. Walk each incore dquot.
> > +   If the incore dquot is not being flushed, add the ondisk buffer
> > backing the
> > +   incore dquot to a delayed write (delwri) list.
> > +
> > +4. Write the buffer list to disk.
> > +
> > +Like most online fsck functions, online quotacheck can't write to
> > regular
> > +filesystem objects until the newly collected metadata reflect all
> > filesystem
> > +state.
> > +Therefore, online quotacheck records file resource usage to a shadow
> > dquot
> > +index implemented with a sparse ``xfarray``, and only writes to the
> > real dquots
> > +once the scan is complete.
> > +Handling transactional updates is tricky because quota resource
> > usage updates
> > +are handled in phases to minimize contention on dquots:
> > +
> > +1. The inodes involved are joined and locked to a transaction.
> > +
> > +2. For each dquot attached to the file:
> > +
> > +   a. The dquot is locked.
> > +
> > +   b. A quota reservation is added to the dquot's resource usage.
> > +      The reservation is recorded in the transaction.
> > +
> > +   c. The dquot is unlocked.
> > +
> > +3. Changes in actual quota usage are tracked in the transaction.
> > +
> > +4. At transaction commit time, each dquot is examined again:
> > +
> > +   a. The dquot is locked again.
> > +
> > +   b. Quota usage changes are logged and unused reservation is given
> > back to
> > +      the dquot.
> > +
> > +   c. The dquot is unlocked.
> > +
> > +For online quotacheck, hooks are placed in steps 2 and 4.
> > +The step 2 hook creates a shadow version of the transaction dquot
> > context
> > +(``dqtrx``) that operates in a similar manner to the regular code.
> > +The step 4 hook commits the shadow ``dqtrx`` changes to the shadow
> > dquots.
> > +Notice that both hooks are called with the inode locked, which is
> > how the
> > +live update coordinates with the inode scanner.
> > +
> > +The quotacheck scan looks like this:
> > +
> > +1. Set up a coordinated inode scan.
> > +
> > +2. For each inode returned by the inode scan iterator:
> > +
> > +   a. Grab and lock the inode.
> > +
> > +   b. Determine that inode's resource usage (data blocks, inode
> > counts,
> > +      realtime blocks) 
> nit: move this list to the first appearance of "resource usage".  Step
> 2 of the first list I think

I don't understand this proposed change.  Are you talking about "2. For
each dquot attached to the file:" above?  That list describes the steps
taken by regular code wanting to allocate file space that's accounted to
quotas.  This list describes what online quotacheck does.  The two don't
mix.

> > and add that to the shadow dquots for the user, group,
> > +      and project ids associated with the inode.
> > +
> > +   c. Unlock and release the inode.
> > +
> > +3. For each dquot in the system:
> > +
> > +   a. Grab and lock the dquot.
> > +
> > +   b. Check the dquot against the shadow dquots created by the scan
> > and updated
> > +      by the live hooks.
> > +
> > +Live updates are key to being able to walk every quota record
> > without
> > +needing to hold any locks for a long duration.
> > +If repairs are desired, the real and shadow dquots are locked and
> > their
> > +resource counts are set to the values in the shadow dquot.
> > +
> > +The proposed patchset is the
> > +`online quotacheck
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-quota>`_

Changed from repair-quota to repair-quotacheck.

> > +series.
> > +
> > +.. _nlinks:
> > +
> > +Case Study: File Link Count Checking
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +File link count checking also uses live update hooks.
> > +The coordinated inode scanner is used to visit all directories on
> > the
> > +filesystem, and per-file link count records are stored in a sparse
> > ``xfarray``
> > +indexed by inumber.
> > +During the scanning phase, each entry in a directory generates
> > observation
> > +data as follows:
> > +
> > +1. If the entry is a dotdot (``'..'``) entry of the root directory,
> > the
> > +   directory's parent link count is bumped because the root
> > directory's dotdot
> > +   entry is self referential.
> > +
> > +2. If the entry is a dotdot entry of a subdirectory, the parent's
> > backref
> > +   count is bumped.
> > +
> > +3. If the entry is neither a dot nor a dotdot entry, the target
> > file's parent
> > +   count is bumped.
> > +
> > +4. If the target is a subdirectory, the parent's child link count is
> > bumped.
> > +
> > +A crucial point to understand about how the link count inode scanner
> > interacts
> > +with the live update hooks is that the scan cursor tracks which
> > *parent*
> > +directories have been scanned.
> > +In other words, the live updates ignore any update about ``A → B``
> > when A has
> > +not been scanned, even if B has been scanned.
> > +Furthermore, a subdirectory A with a dotdot entry pointing back to B
> > is
> > +accounted as a backref counter in the shadow data for A, since child
> > dotdot
> > +entries affect the parent's link count.
> > +Live update hooks are carefully placed in all parts of the
> > filesystem that
> > +create, change, or remove directory entries, since those operations
> > involve
> > +bumplink and droplink.
> > +
> > +For any file, the correct link count is the number of parents plus
> > the number
> > +of child subdirectories.
> > +Non-directories never have children of any kind.
> > +The backref information is used to detect inconsistencies in the
> > number of
> > +links pointing to child subdirectories and the number of dotdot
> > entries
> > +pointing back.
> > +
> > +After the scan completes, the link count of each file can be checked
> > by locking
> > +both the inode and the shadow data, and comparing the link counts.
> > +A second coordinated inode scan cursor is used for comparisons.
> > +Live updates are key to being able to walk every inode without
> > needing to hold
> > +any locks between inodes.
> > +If repairs are desired, the inode's link count is set to the value
> > in the
> > +shadow information.
> > +If no parents are found, the file must be :ref:`reparented
> > <orphanage>` to the
> > +orphanage to prevent the file from being lost forever.
> > +
> > +The proposed patchset is the
> > +`file link count repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-nlinks>`_
> > +series.
> > +
> > +.. _rmap_repair:
> > +
> > +Case Study: Rebuilding Reverse Mapping Records
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Most repair functions follow the same pattern: lock filesystem
> > resources,
> > +walk the surviving ondisk metadata looking for replacement metadata
> > records,
> > +and use an :ref:`in-memory array <xfarray>` to store the gathered
> > observations.
> > +The primary advantage of this approach is the simplicity and
> > modularity of the
> > +repair code -- code and data are entirely contained within the scrub
> > module,
> > +do not require hooks in the main filesystem, and are usually the
> > most efficient
> > +in memory use.
> > +A secondary advantage of this repair approach is atomicity -- once
> > the kernel
> > +decides a structure is corrupt, no other threads can access the
> > metadata until
> > +the kernel finishes repairing and revalidating the metadata.
> > +
> > +For repairs going on within a shard of the filesystem, these
> > advantages
> > +outweigh the delays inherent in locking the shard while repairing
> > parts of the
> > +shard.
> > +Unfortunately, repairs to the reverse mapping btree cannot use the
> > "standard"
> > +btree repair strategy because it must scan every space mapping of
> > every fork of
> > +every file in the filesystem, and the filesystem cannot stop.
> > +Therefore, rmap repair foregoes atomicity between scrub and repair.
> > +It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live
> > update hooks
> > +<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to
> > complete the
> > +scan for reverse mapping records.
> > +
> > +1. Set up an xfbtree to stage rmap records.
> > +
> > +2. While holding the locks on the AGI and AGF buffers acquired
> > during the
> > +   scrub, generate reverse mappings for all AG metadata: inodes,
> > btrees, CoW
> > +   staging extents, and the internal log.
> > +
> > +3. Set up an inode scanner.
> > +
> > +4. Hook into rmap updates for the AG being repaired so that the live
> > scan data
> > +   can receive updates to the rmap btree from the rest of the
> > filesystem during
> > +   the file scan.
> > +
> > +5. For each space mapping found in either fork of each file scanned,
> > +   decide if the mapping matches the AG of interest.
> > +   If so:
> > +
> > +   a. Create a btree cursor for the in-memory btree.
> > +
> > +   b. Use the rmap code to add the record to the in-memory btree.
> > +
> > +   c. Use the :ref:`special commit function <xfbtree_commit>` to
> > write the
> > +      xfbtree changes to the xfile.
> > +
> > +6. For each live update received via the hook, decide if the owner
> > has already
> > +   been scanned.
> > +   If so, apply the live update into the scan data:
> > +
> > +   a. Create a btree cursor for the in-memory btree.
> > +
> > +   b. Replay the operation into the in-memory btree.
> > +
> > +   c. Use the :ref:`special commit function <xfbtree_commit>` to
> > write the
> > +      xfbtree changes to the xfile.
> > +      This is performed with an empty transaction to avoid changing
> > the
> > +      caller's state.
> > +
> > +7. When the inode scan finishes, create a new scrub transaction and
> > relock the
> > +   two AG headers.
> > +
> > +8. Compute the new btree geometry using the number of rmap records
> > in the
> > +   shadow btree, like all other btree rebuilding functions.
> > +
> > +9. Allocate the number of blocks computed in the previous step.
> > +
> > +10. Perform the usual btree bulk loading and commit to install the
> > new rmap
> > +    btree.
> > +
> > +11. Reap the old rmap btree blocks as discussed in the case study
> > about how
> > +    to :ref:`reap after rmap btree repair <rmap_reap>`.
> > +
> > +12. Free the xfbtree now that it not needed.
> > +
> > +The proposed patchset is the
> > +`rmap repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-rmap-btree>`_
> > +series.
> > 
> 
> Mostly looks good nits aside, I do sort of wonder if this patch would
> do better to appear before patch 6 (or move 6 down), since it gets into
> more challenges concerning locks and hooks, where as here we are mostly
> discussing what they are and how they work.  So it might build better
> to move this patch up a little.

(I might be a tad confused here, bear with me.)

Patch 6, the section about eventual consistency?

Hmm.  The intent drains exist to quiesce intent chains targeting
specific AGs.  It briefly mentions "fshooks" in the context of using
jump labels to avoid the overhead of calling notify_all on the drain
waitqueue when scrub isn't running.  That's perhaps bad naming on my
part, since the other "fshooks" are jump labels to avoid bouncing
through the notifier chain code when scrub isn't running.  The jump
labels themselves are not hooks, they're structured dynamic code
patching.

I probably should've named those something else.  fsgates?

Or maybe you were talking specifically about "Case Study: Rebuilding
Reverse Mapping Records"?  In which case I remark that the case study
needs both the intent drains to quiesce the AG and the live scans to
work properly, which is why the case study of it couldn't come earlier.
The intent drains section still ought to come before the refcountbt
section, because it's the refcountbt scrubber that first hit the
coordination problem.

Things are getting pretty awkward like this because there are sooo many
interdependent pieces. :(

Regardless, thank you very much for slogging through.

--D

> Allison
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 07/14] xfs: document pageable kernel memory
  2023-02-09 23:14           ` Darrick J. Wong
@ 2023-02-25  7:32             ` Allison Henderson
  0 siblings, 0 replies; 220+ messages in thread
From: Allison Henderson @ 2023-02-25  7:32 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Thu, 2023-02-09 at 15:14 -0800, Darrick J. Wong wrote:
> On Thu, Feb 09, 2023 at 05:41:22AM +0000, Allison Henderson wrote:
> > On Thu, 2023-02-02 at 15:14 -0800, Darrick J. Wong wrote:
> > > On Thu, Feb 02, 2023 at 07:14:22AM +0000, Allison Henderson
> > > wrote:
> > > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > 
> > > > > Add a discussion of pageable kernel memory, since online fsck
> > > > > needs
> > > > > quite a bit more memory than most other parts of the
> > > > > filesystem
> > > > > to
> > > > > stage
> > > > > records and other information.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > > ---
> > > > >  .../filesystems/xfs-online-fsck-design.rst         |  490
> > > > > ++++++++++++++++++++
> > > > >  1 file changed, 490 insertions(+)
> > > > > 
> > > > > 
> > > > > diff --git a/Documentation/filesystems/xfs-online-fsck-
> > > > > design.rst
> > > > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > > index 419eb54ee200..9d7a2ef1d0dd 100644
> > > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > > @@ -383,6 +383,8 @@ Algorithms") of Srinivasan.
> > > > >  However, any data structure builder that maintains a
> > > > > resource
> > > > > lock
> > > > > for the
> > > > >  duration of the repair is *always* an offline algorithm.
> > > > >  
> > > > > +.. _secondary_metadata:
> > > > > +
> > > > >  Secondary Metadata
> > > > >  ``````````````````
> > > > >  
> > > > > @@ -1746,3 +1748,491 @@ Scrub teardown disables all static
> > > > > keys
> > > > > obtained by ``xchk_fshooks_enable``.
> > > > >  
> > > > >  For more information, please see the kernel documentation of
> > > > >  Documentation/staging/static-keys.rst.
> > > > > +
> > > > > +.. _xfile:
> > > > > +
> > > > > +Pageable Kernel Memory
> > > > > +----------------------
> > > > > +
> > > > > +Demonstrations of the first few prototypes of online repair
> > > > > revealed
> > > > > new
> > > > > +technical requirements that were not originally identified.
> > > > > +For the first demonstration, the code walked whatever
> > > > > filesystem
> > > > > +metadata it needed to synthesize new records and inserted
> > > > > records
> > > > > into a new
> > > > > +btree as it found them.
> > > > > +This was subpar since any additional corruption or runtime
> > > > > errors
> > > > > encountered
> > > > > +during the walk would shut down the filesystem.
> > > > > +After remount, the blocks containing the half-rebuilt data
> > > > > structure
> > > > > would not
> > > > > +be accessible until another repair was attempted.
> > > > > +Solving the problem of half-rebuilt data structures will be
> > > > > discussed in the
> > > > > +next section.
> > > > > +
> > > > > +For the second demonstration, the synthesized records were
> > > > > instead
> > > > > stored in
> > > > > +kernel slab memory.
> > > > > +Doing so enabled online repair to abort without writing to
> > > > > the
> > > > > filesystem if
> > > > > +the metadata walk failed, which prevented online fsck from
> > > > > making
> > > > > things worse.
> > > > > +However, even this approach needed improving upon.
> > > > > +
> > > > > +There are four reasons why traditional Linux kernel memory
> > > > > management isn't
> > > > > +suitable for storing large datasets:
> > > > > +
> > > > > +1. Although it is tempting to allocate a contiguous block of
> > > > > memory
> > > > > to create a
> > > > > +   C array, this cannot easily be done in the kernel because
> > > > > it
> > > > > cannot be
> > > > > +   relied upon to allocate multiple contiguous memory pages.
> > > > > +
> > > > > +2. While disparate physical pages can be virtually mapped
> > > > > together,
> > > > > installed
> > > > > +   memory might still not be large enough to stage the
> > > > > entire
> > > > > record
> > > > > set in
> > > > > +   memory while constructing a new btree.
> > > > > +
> > > > > +3. To overcome these two difficulties, the implementation
> > > > > was
> > > > > adjusted to use
> > > > > +   doubly linked lists, which means every record object
> > > > > needed
> > > > > two
> > > > > 64-bit list
> > > > > +   head pointers, which is a lot of overhead.
> > > > > +
> > > > > +4. Kernel memory is pinned, which can drive the system out
> > > > > of
> > > > > memory, leading
> > > > > +   to OOM kills of unrelated processes.
> > > > > +
> > > > I think I maybe might just jump to what ever the current plan
> > > > is
> > > > instead of trying to keep a record of the dev history in the
> > > > document.
> > > > I'm sure we're not done yet, dev really never is, so in order
> > > > for
> > > > the
> > > > documentation to be maintained, it would just get bigger and
> > > > bigger
> > > > to
> > > > keep documenting it this way.  It's not that the above isnt
> > > > valuable,
> > > > but maybe a different kind of document really.
> > > 
> > > OK, I've shortened this introduction to outline the requirements,
> > > and
> > > trimmed the historical information to a sidebar:
> > > 
> > > "Some online checking functions work by scanning the filesystem
> > > to
> > > build
> > > a shadow copy of an ondisk metadata structure in memory and
> > > comparing
> > > the two copies. For online repair to rebuild a metadata
> > > structure, it
> > > must compute the record set that will be stored in the new
> > > structure
> > > before it can persist that new structure to disk. Ideally,
> > > repairs
> > > complete with a single atomic commit that introduces a new data
> > > structure. To meet these goals, the kernel needs to collect a
> > > large
> > > amount of information in a place that doesn’t require the correct
> > > operation of the filesystem.
> > > 
> > > "Kernel memory isn’t suitable because:
> > > 
> > > *   Allocating a contiguous region of memory to create a C array
> > > is
> > > very
> > >     difficult, especially on 32-bit systems.
> > > 
> > > *   Linked lists of records introduce double pointer overhead
> > > which
> > > is
> > >     very high and eliminate the possibility of indexed lookups.
> > > 
> > > *   Kernel memory is pinned, which can drive the system into OOM
> > >     conditions.
> > > 
> > > *   The system might not have sufficient memory to stage all the
> > >     information.
> > > 
> > > "At any given time, online fsck does not need to keep the entire
> > > record
> > > set in memory, which means that individual records can be paged
> > > out
> > > if
> > > necessary. Continued development of online fsck demonstrated that
> > > the
> > > ability to perform indexed data storage would also be very
> > > useful.
> > > Fortunately, the Linux kernel already has a facility for
> > > byte-addressable and pageable storage: tmpfs. In-kernel graphics
> > > drivers
> > > (most notably i915) take advantage of tmpfs files to store
> > > intermediate
> > > data that doesn’t need to be in memory at all times, so that
> > > usage
> > > precedent is already established. Hence, the xfile was born!
> > > 
> > > Historical Sidebar
> > > ------------------
> > > 
> > > "The first edition of online repair inserted records into a new
> > > btree
> > > as
> > > it found them, which failed because filesystem could shut down
> > > with a
> > > built data structure, which would be live after recovery
> > > finished.
> > > 
> > > "The second edition solved the half-rebuilt structure problem by
> > > storing
> > > everything in memory, but frequently ran the system out of
> > > memory.
> > > 
> > > "The third edition solved the OOM problem by using linked lists,
> > > but
> > > the
> > > list overhead was extreme."
> > Ok, I think that's cleaner
> > 
> > > 
> > > > 
> > > > 
> > > > > +For the third iteration, attention swung back to the
> > > > > possibility
> > > > > of
> > > > > using
> > > > 
> > > > Due to the large volume of metadata that needs to be processed,
> > > > ofsck
> > > > uses...
> > > > 
> > > > > +byte-indexed array-like storage to reduce the overhead of
> > > > > in-
> > > > > memory
> > > > > records.
> > > > > +At any given time, online repair does not need to keep the
> > > > > entire
> > > > > record set in
> > > > > +memory, which means that individual records can be paged
> > > > > out.
> > > > > +Creating new temporary files in the XFS filesystem to store
> > > > > intermediate data
> > > > > +was explored and rejected for some types of repairs because
> > > > > a
> > > > > filesystem with
> > > > > +compromised space and inode metadata should never be used to
> > > > > fix
> > > > > compromised
> > > > > +space or inode metadata.
> > > > > +However, the kernel already has a facility for byte-
> > > > > addressable
> > > > > and
> > > > > pageable
> > > > > +storage: shmfs.
> > > > > +In-kernel graphics drivers (most notably i915) take
> > > > > advantage of
> > > > > shmfs files
> > > > > +to store intermediate data that doesn't need to be in memory
> > > > > at
> > > > > all
> > > > > times, so
> > > > > +that usage precedent is already established.
> > > > > +Hence, the ``xfile`` was born!
> > > > > +
> > > > > +xfile Access Models
> > > > > +```````````````````
> > > > > +
> > > > > +A survey of the intended uses of xfiles suggested these use
> > > > > cases:
> > > > > +
> > > > > +1. Arrays of fixed-sized records (space management btrees,
> > > > > directory
> > > > > and
> > > > > +   extended attribute entries)
> > > > > +
> > > > > +2. Sparse arrays of fixed-sized records (quotas and link
> > > > > counts)
> > > > > +
> > > > > +3. Large binary objects (BLOBs) of variable sizes (directory
> > > > > and
> > > > > extended
> > > > > +   attribute names and values)
> > > > > +
> > > > > +4. Staging btrees in memory (reverse mapping btrees)
> > > > > +
> > > > > +5. Arbitrary contents (realtime space management)
> > > > > +
> > > > > +To support the first four use cases, high level data
> > > > > structures
> > > > > wrap
> > > > > the xfile
> > > > > +to share functionality between online fsck functions.
> > > > > +The rest of this section discusses the interfaces that the
> > > > > xfile
> > > > > presents to
> > > > > +four of those five higher level data structures.
> > > > > +The fifth use case is discussed in the :ref:`realtime
> > > > > summary
> > > > > <rtsummary>` case
> > > > > +study.
> > > > > +
> > > > > +The most general storage interface supported by the xfile
> > > > > enables
> > > > > the reading
> > > > > +and writing of arbitrary quantities of data at arbitrary
> > > > > offsets
> > > > > in
> > > > > the xfile.
> > > > > +This capability is provided by ``xfile_pread`` and
> > > > > ``xfile_pwrite``
> > > > > functions,
> > > > > +which behave similarly to their userspace counterparts.
> > > > > +XFS is very record-based, which suggests that the ability to
> > > > > load
> > > > > and store
> > > > > +complete records is important.
> > > > > +To support these cases, a pair of ``xfile_obj_load`` and
> > > > > ``xfile_obj_store``
> > > > > +functions are provided to read and persist objects into an
> > > > > xfile.
> > > > > +They are internally the same as pread and pwrite, except
> > > > > that
> > > > > they
> > > > > treat any
> > > > > +error as an out of memory error.
> > > > > +For online repair, squashing error conditions in this manner
> > > > > is
> > > > > an
> > > > > acceptable
> > > > > +behavior because the only reaction is to abort the operation
> > > > > back to
> > > > > userspace.
> > > > > +All five xfile usecases can be serviced by these four
> > > > > functions.
> > > > > +
> > > > > +However, no discussion of file access idioms is complete
> > > > > without
> > > > > answering the
> > > > > +question, "But what about mmap?"
> > > > I actually wouldn't spend too much time discussing solutions
> > > > that
> > > > didn't work for what ever reason, unless someones really asking
> > > > for
> > > > it.
> > > >  I think this section would read just fine to trim off the last
> > > > paragraph here
> > > 
> > > Since I wrote this, I've been experimenting with wiring up the
> > > tmpfs
> > > file page cache folios to the xfs buffer cache.  Pinning the
> > > folios
> > > in
> > > this manner makes it so that online fsck can (more or less)
> > > directly
> > > access the xfile contents.  Much to my surprise, this has
> > > actually
> > > held
> > > up in testing, so ... it's no longer a solution that "didn't
> > > really
> > > work". :)
> > > 
> > > I also need to s/page/folio/ now that willy has finished that
> > > conversion.  This section has been rewritten as such:
> > > 
> > > "However, no discussion of file access idioms is complete without
> > > answering the question, “But what about mmap?” It is convenient
> > > to
> > > access storage directly with pointers, just like userspace code
> > > does
> > > with regular memory. Online fsck must not drive the system into
> > > OOM
> > > conditions, which means that xfiles must be responsive to memory
> > > reclamation. tmpfs can only push a pagecache folio to the swap
> > > cache
> > > if
> > > the folio is neither pinned nor locked, which means the xfile
> > > must
> > > not
> > > pin too many folios.
> > > 
> > > "Short term direct access to xfile contents is done by locking
> > > the
> > > pagecache folio and mapping it into kernel address space.
> > > Programmatic
> > > access (e.g. pread and pwrite) uses this mechanism. Folio locks
> > > are
> > > not
> > > supposed to be held for long periods of time, so long term direct
> > > access
> > > to xfile contents is done by bumping the folio refcount, mapping
> > > it
> > > into
> > > kernel address space, and dropping the folio lock. These long
> > > term
> > > users
> > > must be responsive to memory reclaim by hooking into the shrinker
> > > infrastructure to know when to release folios.
> > > 
> > > "The xfile_get_page and xfile_put_page functions are provided to
> > > retrieve the (locked) folio that backs part of an xfile and to
> > > release
> > > it. The only code to use these folio lease functions are the
> > > xfarray
> > > sorting algorithms and the in-memory btrees."
> > Alrighty, sounds like a good upate then
> > 
> > > 
> > > > > +It would be *much* more convenient if kernel code could
> > > > > access
> > > > > pageable kernel
> > > > > +memory with pointers, just like userspace code does with
> > > > > regular
> > > > > memory.
> > > > > +Like any other filesystem that uses the page cache, reads
> > > > > and
> > > > > writes
> > > > > of xfile
> > > > > +data lock the cache page and map it into the kernel address
> > > > > space
> > > > > for the
> > > > > +duration of the operation.
> > > > > +Unfortunately, shmfs can only write a file page to the swap
> > > > > device
> > > > > if the page
> > > > > +is unmapped and unlocked, which means the xfile risks
> > > > > causing
> > > > > OOM
> > > > > problems
> > > > > +unless it is careful not to pin too many pages.
> > > > > +Therefore, the xfile steers most of its users towards
> > > > > programmatic
> > > > > access so
> > > > > +that backing pages are not kept locked in memory for longer
> > > > > than
> > > > > is
> > > > > necessary.
> > > > > +However, for callers performing quick linear scans of xfile
> > > > > data,
> > > > > +``xfile_get_page`` and ``xfile_put_page`` functions are
> > > > > provided
> > > > > to
> > > > > pin a page
> > > > > +in memory.
> > > > > +So far, the only code to use these functions are the xfarray
> > > > > :ref:`sorting
> > > > > +<xfarray_sort>` algorithms.
> > > > > +
> > > > > +xfile Access Coordination
> > > > > +`````````````````````````
> > > > > +
> > > > > +For security reasons, xfiles must be owned privately by the
> > > > > kernel.
> > > > > +They are marked ``S_PRIVATE`` to prevent interference from
> > > > > the
> > > > > security system,
> > > > > +must never be mapped into process file descriptor tables,
> > > > > and
> > > > > their
> > > > > pages must
> > > > > +never be mapped into userspace processes.
> > > > > +
> > > > > +To avoid locking recursion issues with the VFS, all accesses
> > > > > to
> > > > > the
> > > > > shmfs file
> > > > > +are performed by manipulating the page cache directly.
> > > > > +xfile writes call the ``->write_begin`` and ``->write_end``
> > > > > functions of the
> > > > > +xfile's address space to grab writable pages, copy the
> > > > > caller's
> > > > > buffer into the
> > > > > +page, and release the pages.
> > > > > +xfile reads call ``shmem_read_mapping_page_gfp`` to grab
> > > > > pages
> > > > xfile readers
> > > 
> > > OK.
> > > 
> > > > > directly before
> > > > > +copying the contents into the caller's buffer.
> > > > > +In other words, xfiles ignore the VFS read and write code
> > > > > paths
> > > > > to
> > > > > avoid
> > > > > +having to create a dummy ``struct kiocb`` and to avoid
> > > > > taking
> > > > > inode
> > > > > and
> > > > > +freeze locks.
> > > > > +
> > > > > +If an xfile is shared between threads to stage repairs, the
> > > > > caller
> > > > > must provide
> > > > > +its own locks to coordinate access.
> > > > Ofsck threads that share an xfile between stage repairs will
> > > > use
> > > > their
> > > > own locks to coordinate access with each other.
> > > > 
> > > > ?
> > > 
> > > Hm.  I wonder if there's a misunderstanding here?
> > > 
> > > Online fsck functions themselves are single-threaded, which is to
> > > say
> > > that they themselves neither queue workers nor start kthreads. 
> > > However,
> > > an xfile created by a running fsck function can be accessed from
> > > other
> > > thread if the fsck function also hooks itself into filesystem
> > > code.
> > > 
> > > The live update section has a nice diagram of how that works:
> > > https://djwong.org/docs/xfs-online-fsck-design/#filesystem-hooks
> > > 
> > 
> > Oh ok, I think I got hung up on who the callers were.  How about
> > "xfiles shared between threads running from hooked filesystem
> > functions
> > will use their own locks to coordinate access with each other."
> 
> I don't want to mention filesystem hooks before the chapter that
> introduces them.  How about:
> 
> "For example, if a scrub function stores scan results in an xfile and
> needs other threads to provide updates to the scanned data, the scrub
> function must provide a lock for all threads to share."
Oh, I didnt see this response....

Ok, i think that sounds fine.  Alternately I think if patch 10 were to
move up, then it would have sounded fine since we introduce hooks
there, but I think either way works

Allison
> 
> --D


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 11/14] xfs: document metadata file repair
  2022-12-30 22:10   ` [PATCH 11/14] xfs: document metadata file repair Darrick J. Wong
@ 2023-02-25  7:33     ` Allison Henderson
  2023-03-01  2:42       ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-02-25  7:33 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> File-based metadata (such as xattrs and directories) can be extremely
> large.  To reduce the memory requirements and maximize code reuse, it
> is
> very convenient to create a temporary file, use the regular dir/attr
> code to store salvaged information, and then atomically swap the
> extents
> between the file being repaired and the temporary file.  Record the
> high
> level concepts behind how temporary files and atomic content swapping
> should work, and then present some case studies of what the actual
> repair functions do.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  574
> ++++++++++++++++++++
>  1 file changed, 574 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index c0f08a773f08..e32506acb66f 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -3252,6 +3252,8 @@ Proposed patchsets include fixing
>  `dir iget usage
>  <
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=scrub-dir-iget-fixes>`_.
>  
> +.. _ilocking:
> +
hmm, this little  part look like maybe it was supposed to go in the
last patch?

>  Locking Inodes
>  ^^^^^^^^^^^^^^
>  
> @@ -3695,3 +3697,575 @@ The proposed patchset is the
>  `rmap repair
>  <
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-rmap-btree>`_
>  series.
> +
> +Staging Repairs with Temporary Files on Disk
> +--------------------------------------------
> +
> +XFS stores a substantial amount of metadata in file forks:
> directories,
> +extended attributes, symbolic link targets, free space bitmaps and
> summary
> +information for the realtime volume, and quota records.
> +File forks map 64-bit logical file fork space extents to physical
> storage space
> +extents, similar to how a memory management unit maps 64-bit virtual
> addresses
> +to physical memory addresses.
> +Therefore, file-based tree structures (such as directories and
> extended
> +attributes) use blocks mapped in the file fork offset address space
> that point
> +to other blocks mapped within that same address space, and file-
> based linear
> +structures (such as bitmaps and quota records) compute array element
> offsets in
> +the file fork offset address space.
> +


> +In the initial iteration of file metadata repair, the damaged
> metadata blocks
> +would be scanned for salvageable data; the extents in the file fork
> would be
> +reaped; and then a new structure would be built in its place.
> +This strategy did not survive the introduction of the atomic repair
> requirement
> +expressed earlier in this document.
> +The second iteration explored building a second structure at a high
> offset
> +in the fork from the salvage data, reaping the old extents, and
> using a
> +``COLLAPSE_RANGE`` operation to slide the new extents into place.
> +This had many drawbacks:
> +
> +- Array structures are linearly addressed, and the regular
> filesystem codebase
> +  does not have the concept of a linear offset that could be applied
> to the
> +  record offset computation to build an alternate copy.
> +
> +- Extended attributes are allowed to use the entire attr fork offset
> address
> +  space.
> +
> +- Even if repair could build an alternate copy of a data structure
> in a
> +  different part of the fork address space, the atomic repair commit
> +  requirement means that online repair would have to be able to
> perform a log
> +  assisted ``COLLAPSE_RANGE`` operation to ensure that the old
> structure was
> +  completely replaced.
> +
> +- A crash after construction of the secondary tree but before the
> range
> +  collapse would leave unreachable blocks in the file fork.
> +  This would likely confuse things further.
> +
> +- Reaping blocks after a repair is not a simple operation, and
> initiating a
> +  reap operation from a restarted range collapse operation during
> log recovery
> +  is daunting.
> +
> +- Directory entry blocks and quota records record the file fork
> offset in the
> +  header area of each block.
> +  An atomic range collapse operation would have to rewrite this part
> of each
> +  block header.
> +  Rewriting a single field in block headers is not a huge problem,
> but it's
> +  something to be aware of.
> +
> +- Each block in a directory or extended attributes btree index
> contains sibling
> +  and child block pointers.
> +  Were the atomic commit to use a range collapse operation, each
> block would
> +  have to be rewritten very carefully to preserve the graph
> structure.
> +  Doing this as part of a range collapse means rewriting a large
> number of
> +  blocks repeatedly, which is not conducive to quick repairs.
> +
> +The third iteration of the design for file metadata repair went for
> a totally
> +new strategy -- 
All the above looks like something that could be culled or side bared.
I know you really like these, but I think the extra dialog is why
people are having a hard time getting through it. 

> create a temporary file in the XFS filesystem, write a new
"The current design for metadata repair creates a temporary file..."

> +structure at the correct offsets into the temporary file, and
> atomically swap
> +the fork mappings (and hence the fork contents) to commit the
> repair.
> +Once the repair is complete, the old fork can be reaped as
> necessary; if the
> +system goes down during the reap, the iunlink code will delete the
> blocks
> +during log recovery.
> +
> +**Note**: All space usage and inode indices in the filesystem *must*
> be
> +consistent to use a temporary file safely!
> +This dependency is the reason why online repair can only use
> pageable kernel
> +memory to stage ondisk space usage information.
> +
> +Swapping extents with a temporary file still requires a rewrite of
> the owner
> +field of the block headers, but this is *much* simpler than moving
> tree blocks
> +individually.
> +Furthermore, the buffer verifiers do not verify owner fields (since
> they are
> +not aware of the inode that owns the block), which makes reaping of
> old file
> +blocks much simpler.
> +Extent swapping requires that AG space metadata and the file fork
> metadata of
> +the file being repaired are all consistent with respect to each
> other, but
> +that's already a requirement for correct operation of files in
> general.
> +There is, however, a slight downside -- if the system crashes during
> the reap
> +phase and the fork extents are crosslinked, the iunlink processing
> will fail
> +because freeing space will find the extra reverse mappings and
> abort.
> +
> +Temporary files created for repair are similar to ``O_TMPFILE``
> files created
> +by userspace.
> +They are not linked into a directory and the entire file will be
> reaped when
> +the last reference to the file is lost.
> +The key differences are that these files must have no access
> permission outside
> +the kernel at all, they must be specially marked to prevent them
> from being
> +opened by handle, and they must never be linked into the directory
> tree.
> +
> +Using a Temporary File
> +``````````````````````
> +
> +Online repair code should use the ``xrep_tempfile_create`` function
> to create a
> +temporary file inside the filesystem.
> +This allocates an inode, marks the in-core inode private, and
> attaches it to
> +the scrub context.
> +These files are hidden from userspace, may not be added to the
> directory tree,
> +and must be kept private.
> +
> +Temporary files only use two inode locks: the IOLOCK and the ILOCK.
> +The MMAPLOCK is not needed here, because there must not be page
> faults from
> +userspace for data fork blocks.
> +The usage patterns of these two locks are the same as for any other
> XFS file --
> +access to file data are controlled via the IOLOCK, and access to
> file metadata
> +are controlled via the ILOCK.
> +Locking helpers are provided so that the temporary file and its lock
> state can
> +be cleaned up by the scrub context.
> +To comply with the nested locking strategy laid out in the
> :ref:`inode
> +locking<ilocking>` section, it is recommended that scrub functions
> use the
> +xrep_tempfile_ilock*_nowait lock helpers.
> +
> +Data can be written to a temporary file by two means:
> +
> +1. ``xrep_tempfile_copyin`` can be used to set the contents of a
> regular
> +   temporary file from an xfile.
> +
> +2. The regular directory, symbolic link, and extended attribute
> functions can
> +   be used to write to the temporary file.
> +
> +Once a good copy of a data file has been constructed in a temporary
> file, it
> +must be conveyed to the file being repaired, which is the topic of
> the next
> +section.
> +
> +The proposed patches are in the
> +`realtime summary repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-rtsummary>`_
> +series.
> +
> +Atomic Extent Swapping
> +----------------------
> +
> +Once repair builds a temporary file with a new data structure
> written into
> +it, it must commit the new changes into the existing file.
> +It is not possible to swap the inumbers of two files, so instead the
> new
> +metadata must replace the old.
> +This suggests the need for the ability to swap extents, but the
> existing extent
> +swapping code used by the file defragmenting tool ``xfs_fsr`` is not
> sufficient
> +for online repair because:
> +
> +a. When the reverse-mapping btree is enabled, the swap code must
> keep the
> +   reverse mapping information up to date with every exchange of
> mappings.
> +   Therefore, it can only exchange one mapping per transaction, and
> each
> +   transaction is independent.
> +
> +b. Reverse-mapping is critical for the operation of online fsck, so
> the old
> +   defragmentation code (which swapped entire extent forks in a
> single
> +   operation) is not useful here.
> +
> +c. Defragmentation is assumed to occur between two files with
> identical
> +   contents.
> +   For this use case, an incomplete exchange will not result in a
> user-visible
> +   change in file contents, even if the operation is interrupted.
> +
> +d. Online repair needs to swap the contents of two files that are by
> definition
> +   *not* identical.
> +   For directory and xattr repairs, the user-visible contents might
> be the
> +   same, but the contents of individual blocks may be very
> different.
> +
> +e. Old blocks in the file may be cross-linked with another structure
> and must
> +   not reappear if the system goes down mid-repair.
> +
> +These problems are overcome by creating a new deferred operation and
> a new type
> +of log intent item to track the progress of an operation to exchange
> two file
> +ranges.
> +The new deferred operation type chains together the same
> transactions used by
> +the reverse-mapping extent swap code.
> +The new log item records the progress of the exchange to ensure that
> once an
> +exchange begins, it will always run to completion, even there are
> +interruptions.
> +
> +The proposed patchset is the
> +`atomic extent swap
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=atomic-file-updates>`_
> +series.
> +
> +Using Log-Incompatible Feature Flags
> +````````````````````````````````````
> +
> +Starting with XFS v5, the superblock contains a
> ``sb_features_log_incompat``
> +field to indicate that the log contains records that might not
> readable by all
> +kernels that could mount this filesystem.


> +In short, log incompat features protect the log contents against
> kernels that
> +will not understand the contents.
> +Unlike the other superblock feature bits, log incompat bits are
> ephemeral
> +because an empty (clean) log does not need protection.
> +The log cleans itself after its contents have been committed into
> the
> +filesystem, either as part of an unmount or because the system is
> otherwise
> +idle.
> +Because upper level code can be working on a transaction at the same
> time that
> +the log cleans itself, it is necessary for upper level code to
> communicate to
> +the log when it is going to use a log incompatible feature.
> +
> +The log coordinates access to incompatible features through the use
> of one
> +``struct rw_semaphore`` for each feature.
> +The log cleaning code tries to take this rwsem in exclusive mode to
> clear the
> +bit; if the lock attempt fails, the feature bit remains set.
> +Filesystem code signals its intention to use a log incompat feature
> in a
> +transaction by calling ``xlog_use_incompat_feat``, which takes the
> rwsem in
> +shared mode.
> +The code supporting a log incompat feature should create wrapper
> functions to
> +obtain the log feature and call ``xfs_add_incompat_log_feature`` to
> set the
> +feature bits in the primary superblock.
> +The superblock update is performed transactionally, so the wrapper
> to obtain
> +log assistance must be called just prior to the creation of the
> transaction
> +that uses the functionality.
> +For a file operation, this step must happen after taking the IOLOCK
> and the
> +MMAPLOCK, but before allocating the transaction.
> +When the transaction is complete, the ``xlog_drop_incompat_feat``
> function
> +is called to release the feature.
> +The feature bit will not be cleared from the superblock until the
> log becomes
> +clean.
While this section does make sense, it doesnt really seem like it's
specific to ofsck either.  Pptrs and possibly other future features use
the same incompat bit logic, but the implementation is pretty disjoint
and I wouldnt really consider it part of that feature.  So I would
either remove this part, or move it to its own section.  Then I would
just give a quick blurb here about how ofsck uses it:

"Since atomic extent swap will introduce a new type of log item, it
will also add a new XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP bit"

> +
> +Log-assisted extended attribute updates and atomic extent swaps both
> use log
> +incompat features and provide convenience wrappers around the
> functionality.

"For more information on incompat bits, see...."

> +
> +Mechanics of an Atomic Extent Swap
> +``````````````````````````````````
> +
> +Swapping entire file forks is a complex task.
> +The goal is to exchange all file fork mappings between two file fork
> offset
> +ranges.
> +There are likely to be many extent mappings in each fork, and the
> edges of
> +the mappings aren't necessarily aligned.
> +Furthermore, there may be other updates that need to happen after
> the swap,
> +such as exchanging file sizes, inode flags, or conversion of fork
> data to local
> +format.
> +This is roughly the format of the new deferred extent swap work
> item:
> +
> +.. code-block:: c
> +
> +       struct xfs_swapext_intent {
> +           /* Inodes participating in the operation. */
> +           struct xfs_inode    *sxi_ip1;
> +           struct xfs_inode    *sxi_ip2;
> +
> +           /* File offset range information. */
> +           xfs_fileoff_t       sxi_startoff1;
> +           xfs_fileoff_t       sxi_startoff2;
> +           xfs_filblks_t       sxi_blockcount;
> +
> +           /* Set these file sizes after the operation, unless
> negative. */
> +           xfs_fsize_t         sxi_isize1;
> +           xfs_fsize_t         sxi_isize2;
> +
> +           /* XFS_SWAP_EXT_* log operation flags */
> +           uint64_t            sxi_flags;
> +       };
> +
> +The new log intent item contains enough information to track two
> logical fork
> +offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2,
> startoff2,
> +blockcount)``.
> +Each step of a swap operation exchanges the largest file range
> mapping possible
> +from one file to the other.
> +After each step in the swap operation, the two startoff fields are
> incremented
> +and the blockcount field is decremented to reflect the progress
> made.
> +The flags field captures behavioral parameters such as swapping the
> attr fork
> +instead of the data fork and other work to be done after the extent
> swap.
> +The two isize fields are used to swap the file size at the end of
> the operation
> +if the file data fork is the target of the swap operation.
> +
> +When the extent swap is initiated, the sequence of operations is as
> follows:
> +
> +1. Create a deferred work item for the extent swap.
> +   At the start, it should contain the entirety of the file ranges
> to be
> +   swapped.
> +
> +2. Call ``xfs_defer_finish`` 
This seems like this should be some sort of defer start wrapper, not
finish.  It would also help to have a link or function name to see the
code it is trying to describe

> to start processing of the exchange.
> +   This will log an extent swap intent item to the transaction for
> the deferred
> +   extent swap work item.
> +
> +3. Until ``sxi_blockcount`` of the deferred extent swap work item is
> zero,
> +
> +   a. Read the block maps of both file ranges starting at
> ``sxi_startoff1`` and
> +      ``sxi_startoff2``, respectively, and compute the longest
> extent that can
> +      be swapped in a single step.
> +      This is the minimum of the two ``br_blockcount`` s in the
> mappings.
> +      Keep advancing through the file forks until at least one of
> the mappings
> +      contains written blocks.
> +      Mutual holes, unwritten extents, and extent mappings to the
> same physical
> +      space are not exchanged.
> +
> +      For the next few steps, this document will refer to the
> mapping that came
> +      from file 1 as "map1", and the mapping that came from file 2
> as "map2".
> +
> +   b. Create a deferred block mapping update to unmap map1 from file
> 1.
> +
> +   c. Create a deferred block mapping update to unmap map2 from file
> 2.
> +
> +   d. Create a deferred block mapping update to map map1 into file
> 2.
> +
> +   e. Create a deferred block mapping update to map map2 into file
> 1.
> +
> +   f. Log the block, quota, and extent count updates for both files.
> +
> +   g. Extend the ondisk size of either file if necessary.
> +
> +   h. Log an extent swap done log item for the extent swap intent
> log item
> +      that was read at the start of step 3.
> +
> +   i. Compute the amount of file range that has just been covered.
> +      This quantity is ``(map1.br_startoff + map1.br_blockcount -
> +      sxi_startoff1)``, because step 3a could have skipped holes.
> +
> +   j. Increase the starting offsets of ``sxi_startoff1`` and
> ``sxi_startoff2``
> +      by the number of blocks computed in the previous step, and
> decrease
> +      ``sxi_blockcount`` by the same quantity.
> +      This advances the cursor.
> +
> +   k. Log a new extent swap intent log item reflecting the advanced
> state of
> +      the work item.
> +
> +   l. Return the proper error code (EAGAIN) to the deferred
> operation manager
> +      to inform it that there is more work to be done.
> +      The operation manager completes the deferred work in steps 3b-
> 3e before
> +      moving back to the start of step 3.
> +
> +4. Perform any post-processing.
> +   This will be discussed in more detail in subsequent sections.
> +
> +If the filesystem goes down in the middle of an operation, log
> recovery will
> +find the most recent unfinished extent swap log intent item and
> restart from
> +there.
> +This is how extent swapping guarantees that an outside observer will
> either see
> +the old broken structure or the new one, and never a mismash of
> both.
> +
> +Extent Swapping with Regular User Files
> +```````````````````````````````````````
> +
> +As mentioned earlier, XFS has long had the ability to swap extents
> between
> +files, which is used almost exclusively by ``xfs_fsr`` to defragment
> files.
> +The earliest form of this was the fork swap mechanism, where the
> entire
> +contents of data forks could be exchanged between two files by
> exchanging the
> +raw bytes in each inode fork's immediate area.
> +When XFS v5 came along with self-describing metadata, this old
> mechanism grew
> +some log support to continue rewriting the owner fields of BMBT
> blocks during
> +log recovery.
> +When the reverse mapping btree was later added to XFS, the only way
> to maintain
> +the consistency of the fork mappings with the reverse mapping index
> was to
> +develop an iterative mechanism that used deferred bmap and rmap
> operations to
> +swap mappings one at a time.
> +This mechanism is identical to steps 2-3 from the procedure above
> except for
> +the new tracking items, because the atomic extent swap mechanism is
> an
> +iteration of an existing mechanism and not something totally novel.
> +For the narrow case of file defragmentation, the file contents must
> be
> +identical, so the recovery guarantees are not much of a gain.
> +
> +Atomic extent swapping is much more flexible than the existing
> swapext
> +implementations because it can guarantee that the caller never sees
> a mix of
> +old and new contents even after a crash, and it can operate on two
> arbitrary
> +file fork ranges.
> +The extra flexibility enables several new use cases:
> +
> +- **Atomic commit of file writes**: A userspace process opens a file
> that it
> +  wants to update.
> +  Next, it opens a temporary file and calls the file clone operation
> to reflink
> +  the first file's contents into the temporary file.
> +  Writes to the original file should instead be written to the
> temporary file.
> +  Finally, the process calls the atomic extent swap system call
> +  (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby
> committing all
> +  of the updates to the original file, or none of them.
> +
> +- **Transactional file updates**: The same mechanism as above, but
> the caller
> +  only wants the commit to occur if the original file's contents
> have not
> +  changed.
> +  To make this happen, the calling process snapshots the file
> modification and
> +  change timestamps of the original file before reflinking its data
> to the
> +  temporary file.
> +  When the program is ready to commit the changes, it passes the
> timestamps
> +  into the kernel as arguments to the atomic extent swap system
> call.
> +  The kernel only commits the changes if the provided timestamps
> match the
> +  original file.
> +
> +- **Emulation of atomic block device writes**: Export a block device
> with a
> +  logical sector size matching the filesystem block size to force
> all writes
> +  to be aligned to the filesystem block size.
> +  Stage all writes to a temporary file, and when that is complete,
> call the
> +  atomic extent swap system call with a flag to indicate that holes
> in the
> +  temporary file should be ignored.
> +  This emulates an atomic device write in software, and can support
> arbitrary
> +  scattered writes.
Mmm, this section here I would either let go or move.  Since we're not
really talking about ofsc anymore, it's more like an "extra use case"
section.  Side uses are great and all, but they're generally not worth
the implementation on their own, so I think we want to keep readers
focused on the main ofsck feature and it's mechanics.  Once we get that
out of the way, we can come back and touch on goodies later at the end
of the document.  

> +
> +Preparation for Extent Swapping
> +```````````````````````````````
> +
> +There are a few things that need to be taken care of before
> initiating an
> +atomic extent swap operation.
> +First, regular files require the page cache to be flushed to disk
> before the
> +operation begins, and directio writes to be quiesced.
> +Like any filesystem operation, extent swapping must determine the
> maximum
> +amount of disk space and quota that can be consumed on behalf of
> both files in
> +the operation, and reserve that quantity of resources to avoid an
> unrecoverable
> +out of space failure once it starts dirtying metadata.
> +The preparation step scans the ranges of both files to estimate:
> +
> +- Data device blocks needed to handle the repeated updates to the
> fork
> +  mappings.
> +- Change in data and realtime block counts for both files.
> +- Increase in quota usage for both files, if the two files do not
> share the
> +  same set of quota ids.
> +- The number of extent mappings that will be added to each file.
> +- Whether or not there are partially written realtime extents.
> +  User programs must never be able to access a realtime file extent
> that maps
> +  to different extents on the realtime volume, which could happen if
> the
> +  operation fails to run to completion.
> +
> +The need for precise estimation increases the run time of the swap
> operation,
> +but it is very important to maintain correct accounting.
> +The filesystem must not run completely out of free space, nor can
> the extent
> +swap ever add more extent mappings to a fork than it can support.
> +Regular users are required to abide the quota limits, though
> metadata repairs
> +may exceed quota to resolve inconsistent metadata elsewhere.
> +
> +Special Features for Swapping Metadata File Extents
> +```````````````````````````````````````````````````
> +
> +Extended attributes, symbolic links, and directories can set the
> fork format to
> +"local" and treat the fork as a literal area for data storage.
> +Metadata repairs must take extra steps to support these cases:
> +
> +- If both forks are in local format and the fork areas are large
> enough, the
> +  swap is performed by copying the incore fork contents, logging
> both forks,
> +  and committing.
> +  The atomic extent swap mechanism is not necessary, since this can
> be done
> +  with a single transaction.
> +
> +- If both forks map blocks, then the regular atomic extent swap is
> used.
> +
> +- Otherwise, only one fork is in local format.
> +  The contents of the local format fork are converted to a block to
> perform the
> +  swap.
> +  The conversion to block format must be done in the same
> transaction that
> +  logs the initial extent swap intent log item.
> +  The regular atomic extent swap is used to exchange the mappings.
> +  Special flags are set on the swap operation so that the
> transaction can be
> +  rolled one more time to convert the second file's fork back to
> local format
> +  if possible.
I feel like there's probably a function name or link that could go with
this

> +
> +Extended attributes and directories stamp the owning inode into
> every block,
> +but the buffer verifiers do not actually check the inode number!
> +Although there is no verification, it is still important to maintain
> +referential integrity, so prior to performing the extent swap,
> online repair
> +walks every block in the new data structure to update the owner
> field and flush
> +the buffer to disk.
> +
> +After a successful swap operation, the repair operation must reap
> the old fork
> +blocks by processing each fork mapping through the standard
> :ref:`file extent
> +reaping <reaping>` mechanism that is done post-repair.
> +If the filesystem should go down during the reap part of the repair,
> the
> +iunlink processing at the end of recovery will free both the
> temporary file and
> +whatever blocks were not reaped.
> +However, this iunlink processing omits the cross-link detection of
> online
> +repair, and is not completely foolproof.
> +
> +Swapping Temporary File Extents
> +```````````````````````````````
> +
> +To repair a metadata file, online repair proceeds as follows:
> +
> +1. Create a temporary repair file.
> +
> +2. Use the staging data to write out new contents into the temporary
> repair
> +   file.
> +   The same fork must be written to as is being repaired.
> +
> +3. Commit the scrub transaction, since the swap estimation step must
> be
> +   completed before transaction reservations are made.
> +
> +4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub
> transaction with
> +   the appropriate resource reservations, locks, and fill out a
> ``struct
> +   xfs_swapext_req`` with the details of the swap operation.
> +
> +5. Call ``xrep_tempswap_contents`` to swap the contents.
> +
> +6. Commit the transaction to complete the repair.
Here too.  A reference to the code would help to be able to see it side
by side

> +
> +.. _rtsummary:
> +
> +Case Study: Repairing the Realtime Summary File
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +In the "realtime" section of an XFS filesystem, free space is
> tracked via a
> +bitmap, similar to Unix FFS.
> +Each bit in the bitmap represents one realtime extent, which is a
> multiple of
> +the filesystem block size between 4KiB and 1GiB in size.
> +The realtime summary file indexes the number of free extents of a
> given size to
> +the offset of the block within the realtime free space bitmap where
> those free
> +extents begin.
> +In other words, the summary file helps the allocator find free
> extents by
> +length, similar to what the free space by count (cntbt) btree does
> for the data
> +section.
> +
> +The summary file itself is a flat file (with no block headers or
> checksums!)
> +partitioned into ``log2(total rt extents)`` sections containing
> enough 32-bit
> +counters to match the number of blocks in the rt bitmap.
> +Each counter records the number of free extents that start in that
> bitmap block
> +and can satisfy a power-of-two allocation request.
> +
> +To check the summary file against the bitmap:
> +
> +1. Take the ILOCK of both the realtime bitmap and summary files.
> +
> +2. For each free space extent recorded in the bitmap:
> +
> +   a. Compute the position in the summary file that contains a
> counter that
> +      represents this free extent.
> +
> +   b. Read the counter from the xfile.
> +
> +   c. Increment it, and write it back to the xfile.
> +
> +3. Compare the contents of the xfile against the ondisk file.
> +
> +To repair the summary file, write the xfile contents into the
> temporary file
> +and use atomic extent swap to commit the new contents.
> +The temporary file is then reaped.
> +
> +The proposed patchset is the
> +`realtime summary repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-rtsummary>`_
I think this is the same link as the last.  Did you mean to have a
different link here?

> +series.
> +
> +Case Study: Salvaging Extended Attributes
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +In XFS, extended attributes are implemented as a namespaced name-
> value store.
> +Values are limited in size to 64KiB, but there is no limit in the
> number of
> +names.
> +The attribute fork is unpartitioned, which means that the root of
> the attribute
> +structure is always in logical block zero, but attribute leaf
> blocks, dabtree
> +index blocks, and remote value blocks are intermixed.
> +Attribute leaf blocks contain variable-sized records that associate
> +user-provided names with the user-provided values.
> +Values larger than a block are allocated separate extents and
> written there.
> +If the leaf information expands beyond a single block, a
> directory/attribute
> +btree (``dabtree``) is created to map hashes of attribute names to
> entries
> +for fast lookup.
> +
> +Salvaging extended attributes is done as follows:
> +
> +1. Walk the attr fork mappings of the file being repaired to find
> the attribute
> +   leaf blocks.
> +   When one is found,
> +
> +   a. Walk the attr leaf block to find candidate keys.
> +      When one is found,
> +
> +      1. Check the name for problems, and ignore the name if there
> are.
> +
> +      2. Retrieve the value.
> +         If that succeeds, add the name and value to the staging
> xfarray and
> +         xfblob.
> +
> +2. If the memory usage of the xfarray and xfblob exceed a certain
> amount of
> +   memory or there are no more attr fork blocks to examine, unlock
> the file and
> +   add the staged extended attributes to the temporary file.
> +
> +3. Use atomic extent swapping to exchange the new and old extended
> attribute
> +   structures.
> +   The old attribute blocks are now attached to the temporary file.
> +
> +4. Reap the temporary file.
> +
> +The proposed patchset is the
> +`extended attribute repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-xattrs>`_
> +series.

I think it mostly looks good, culling and link suggestions aside.

Allison

> 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH v24.3 12/14] xfs: document directory tree repairs
  2023-02-03  2:12     ` [PATCH v24.3 " Darrick J. Wong
@ 2023-02-25  7:33       ` Allison Henderson
  2023-03-02  0:14         ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-02-25  7:33 UTC (permalink / raw)
  To: david, linux-fsdevel, hch, djwong, linux-xfs, willy,
	Catherine Hoang, Chandan Babu

On Thu, 2023-02-02 at 18:12 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Directory tree repairs are the least complete part of online fsck,
> due
> to the lack of directory parent pointers.  However, even without that
> feature, we can still make some corrections to the directory tree --
> we
> can salvage as many directory entries as we can from a damaged
> directory, and we can reattach orphaned inodes to the lost+found,
> just
> as xfs_repair does now.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> v24.2: updated with my latest thoughts about how to use parent
> pointers
> v24.3: updated to reflect the online fsck code I built for parent
> pointers
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  410
> ++++++++++++++++++++
>  1 file changed, 410 insertions(+)
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index af7755fe0107..51d040e4a2d0 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -4359,3 +4359,413 @@ The proposed patchset is the
>  `extended attribute repair
>  <
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-xattrs>`_
>  series.
> +
> +Fixing Directories
> +------------------
> +
> +Fixing directories is difficult with currently available filesystem
> features,
> +since directory entries are not redundant.
> +The offline repair tool scans all inodes to find files with nonzero
> link count,
> +and then it scans all directories to establish parentage of those
> linked files.
> +Damaged files and directories are zapped, and files with no parent
> are
> +moved to the ``/lost+found`` directory.
> +It does not try to salvage anything.
> +
> +The best that online repair can do at this time is to read directory
> data
> +blocks and salvage any dirents that look plausible, correct link
> counts, and
> +move orphans back into the directory tree.
> +The salvage process is discussed in the case study at the end of
> this section.
> +The :ref:`file link count fsck <nlinks>` code takes care of fixing
> link counts
> +and moving orphans to the ``/lost+found`` directory.
> +
> +Case Study: Salvaging Directories
> +`````````````````````````````````
> +
> +Unlike extended attributes, directory blocks are all the same size,
> so
> +salvaging directories is straightforward:
> +
> +1. Find the parent of the directory.
> +   If the dotdot entry is not unreadable, try to confirm that the
> alleged
> +   parent has a child entry pointing back to the directory being
> repaired.
> +   Otherwise, walk the filesystem to find it.
> +
> +2. Walk the first partition of data fork of the directory to find
> the directory
> +   entry data blocks.
> +   When one is found,
> +
> +   a. Walk the directory data block to find candidate entries.
> +      When an entry is found:
> +
> +      i. Check the name for problems, and ignore the name if there
> are.
> +
> +      ii. Retrieve the inumber and grab the inode.
> +          If that succeeds, add the name, inode number, and file
> type to the
> +          staging xfarray and xblob.
> +
> +3. If the memory usage of the xfarray and xfblob exceed a certain
> amount of
> +   memory or there are no more directory data blocks to examine,
> unlock the
> +   directory and add the staged dirents into the temporary
> directory.
> +   Truncate the staging files.
> +
> +4. Use atomic extent swapping to exchange the new and old directory
> structures.
> +   The old directory blocks are now attached to the temporary file.
> +
> +5. Reap the temporary file.
> +



> +**Future Work Question**: Should repair revalidate the dentry cache
> when
> +rebuilding a directory?
> +
> +*Answer*: Yes, though the current dentry cache code doesn't provide
> a means
> +to walk every dentry of a specific directory.
> +If the cache contains an entry that the salvaging code does not
> find, the
> +repair cannot proceed.
> +
> +**Future Work Question**: Can the dentry cache know about a
> directory entry
> +that cannot be salvaged?
> +
> +*Answer*: In theory, the dentry cache should be a subset of the
> directory
> +entries on disk because there's no way to load a dentry without
> having
> +something to read in the directory.
> +However, it is possible for a coherency problem to be introduced if
> the ondisk
> +structures becomes corrupt *after* the cache loads.
> +In theory it is necessary to scan all dentry cache entries for a
> directory to
> +ensure that one of the following apply:

"Currently the dentry cache code doesn't provide a means to walk every
dentry of a specific directory.  This makes validation of the rebuilt
directory difficult, and it is possible that an ondisk structure to
become corrupt *after* the cache loads.  Walking the dentry cache is
currently being considered as a future improvement.  This will also
enable the ability to report which entries were not salvageable since
these will be the subset of entries that are absent after the walk. 
This improvement will ensure that one of the following apply:"

?

I just think it reads cleaner.  I realize this is an area that still
sort of in flux, but definitely before we call the document done we
should probably strip out the Q's and just document the A's.  If
someone re-raises the Q's we can always refer to the archives and then
have the discussion on the mailing list.  But I think the document
should maintain the goal of making clear whatever the current plan is
just to keep it reading cleanly. 


> +
> +1. The cached dentry reflects an ondisk dirent in the new directory.
> +
> +2. The cached dentry no longer has a corresponding ondisk dirent in
> the new
> +   directory and the dentry can be purged from the cache.
> +
> +3. The cached dentry no longer has an ondisk dirent but the dentry
> cannot be
> +   purged.

> +   This is bad.
These entries are irrecoverable, but can now be reported.



> +
> +As mentioned above, the dentry cache does not have a means to walk
> all the
> +dentries with a particular directory as a parent.
> +This makes detecting situations #2 and #3 impossible, and remains an
> +interesting question for research.
I think the above paraphrase makes this last bit redundant.

> +
> +The proposed patchset is the
> +`directory repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-dirs>`_
> +series.
> +
> +Parent Pointers
> +```````````````
> +
"Generally speaking, a parent pointer is any kind of metadata that
enables an inode to locate its parent with out having to traverse the
directory tree from the root."

> +The lack of secondary directory metadata hinders directory tree
"Without them, the lack of secondary..." 

> reconstruction
> +in much the same way that the historic lack of reverse space mapping
> +information once hindered reconstruction of filesystem space
> metadata.
> +The parent pointer feature, however, makes total directory
> reconstruction
> +possible.
> +

History side bar the below chunk...
> +Directory parent pointers were first proposed as an XFS feature more
> than a
> +decade ago by SGI.
> +Each link from a parent directory to a child file is mirrored with
> an extended
> +attribute in the child that could be used to identify the parent
> directory.
> +Unfortunately, this early implementation had major shortcomings and
> was never
> +merged into Linux XFS:
> +
> +1. The XFS codebase of the late 2000s did not have the
> infrastructure to
> +   enforce strong referential integrity in the directory tree.
> +   It did not guarantee that a change in a forward link would always
> be
> +   followed up with the corresponding change to the reverse links.
> +
> +2. Referential integrity was not integrated into offline repair.
> +   Checking and repairs were performed on mounted filesystems
> without taking
> +   any kernel or inode locks to coordinate access.
> +   It is not clear how this actually worked properly.
> +
> +3. The extended attribute did not record the name of the directory
> entry in the
> +   parent, so the SGI parent pointer implementation cannot be used
> to reconnect
> +   the directory tree.
> +
> +4. Extended attribute forks only support 65,536 extents, which means
> that
> +   parent pointer attribute creation is likely to fail at some point
> before the
> +   maximum file link count is achieved.


"The original parent pointer design was too unstable for something like
a file system repair to depend on."

> +
> +Allison Henderson, Chandan Babu, and Catherine Hoang are working on
> a second
> +implementation that solves all shortcomings of the first.
> +During 2022, Allison introduced log intent items to track physical
> +manipulations of the extended attribute structures.
> +This solves the referential integrity problem by making it possible
> to commit
> +a dirent update and a parent pointer update in the same transaction.
> +Chandan increased the maximum extent counts of both data and
> attribute forks,

> +thereby addressing the fourth problem.
which ensures the parent pointer creation will succeed even if the max
extent count is reached.

> +
> +To solve the third problem, parent pointers include the dirent name
"Lastly, the new design includes the dirent name..."

> and
> +location of the entry within the parent directory.
> +In other words, child files use extended attributes to store
> pointers to
> +parents in the form ``(parent_inum, parent_gen, dirent_pos) →
> (dirent_name)``.
This parts still in flux, so probably this will have to get updated
later...

> +
> +On a filesystem with parent pointers, the directory checking process
> can be
> +strengthened to ensure that the target of each dirent also contains
> a parent
> +pointer pointing back to the dirent.
> +Likewise, each parent pointer can be checked by ensuring that the
> target of
> +each parent pointer is a directory and that it contains a dirent
> matching
> +the parent pointer.
> +Both online and offline repair can use this strategy.
> +
> +Case Study: Repairing Directories with Parent Pointers
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Directory rebuilding uses a :ref:`coordinated inode scan <iscan>`
> and
> +a :ref:`directory entry live update hook <liveupdate>` as follows:
> +
> +1. Set up a temporary directory for generating the new directory
> structure,
> +   an xfblob for storing entry names, and an xfarray for stashing
> directory
> +   updates.
> +
> +2. Set up an inode scanner and hook into the directory entry code to
> receive
> +   updates on directory operations.
> +
> +3. For each parent pointer found in each file scanned, decide if the
> parent
> +   pointer references the directory of interest.
> +   If so:
> +
> +   a. Stash an addname entry for this dirent in the xfarray for
> later.
> +
> +   b. When finished scanning that file, flush the stashed updates to
> the
> +      temporary directory.
> +
> +4. For each live directory update received via the hook, decide if
> the child
> +   has already been scanned.
> +   If so:
> +
> +   a. Stash an addname or removename entry for this dirent update in
> the
> +      xfarray for later.
> +      We cannot write directly to the temporary directory because
> hook
> +      functions are not allowed to modify filesystem metadata.
> +      Instead, we stash updates in the xfarray and rely on the
> scanner thread
> +      to apply the stashed updates to the temporary directory.
> +
> +5. When the scan is complete, atomically swap the contents of the
> temporary
> +   directory and the directory being repaired.
> +   The temporary directory now contains the damaged directory
> structure.
> +
> +6. Reap the temporary directory.
> +
> +7. Update the dirent position field of parent pointers as necessary.
> +   This may require the queuing of a substantial number of xattr log
> intent
> +   items.
> +
> +The proposed patchset is the
> +`parent pointers directory repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=pptrs-online-dir-repair>`_
> +series.
> +
> +**Unresolved Question**: How will repair ensure that the
> ``dirent_pos`` fields
> +match in the reconstructed directory?
> +
> +*Answer*: There are a few ways to solve this problem:
> +
> +1. The field could be designated advisory, since the other three
> values are
> +   sufficient to find the entry in the parent.
> +   However, this makes indexed key lookup impossible while repairs
> are ongoing.
> +
> +2. We could allow creating directory entries at specified offsets,
> which solves
> +   the referential integrity problem but runs the risk that dirent
> creation
> +   will fail due to conflicts with the free space in the directory.
> +
> +   These conflicts could be resolved by appending the directory
> entry and
> +   amending the xattr code to support updating an xattr key and
> reindexing the
> +   dabtree, though this would have to be performed with the parent
> directory
> +   still locked.
> +
> +3. Same as above, but remove the old parent pointer entry and add a
> new one
> +   atomically.
> +
> +4. Change the ondisk xattr format to ``(parent_inum, name) →
> (parent_gen)``,
> +   which would provide the attr name uniqueness that we require,
> without
> +   forcing repair code to update the dirent position.
> +   Unfortunately, this requires changes to the xattr code to support
> attr
> +   names as long as 263 bytes.
> +
> +5. Change the ondisk xattr format to ``(parent_inum, hash(name)) →
> +   (name, parent_gen)``.
> +   If the hash is sufficiently resistant to collisions (e.g. sha256)
> then
> +   this should provide the attr name uniqueness that we require.
> +   Names shorter than 247 bytes could be stored directly.
I think the RFC deluge is the same question but more context, so
probably this section will follow what we decide there.  I will save
commentary to keep the discussion in the same thread...

I'll just link it here for anyone else following this for now...
https://www.spinics.net/lists/linux-xfs/msg69397.html

> +
> +Case Study: Repairing Parent Pointers
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Online reconstruction of a file's parent pointer information works
> similarly to
> +directory reconstruction:
> +
> +1. Set up a temporary file for generating a new extended attribute
> structure,
> +   an xfblob for storing parent pointer names, and an xfarray for
> stashing
> +   parent pointer updates.
we did talk about blobs in patch 6 though it took me a moment to
remember... if there's a way to link or tag it, that would be helpful
for with the quick refresh.  kinda like wikipedia hyperlinks, you
really only need like the first line or two to get it snap back

> +
> +2. Set up an inode scanner and hook into the directory entry code to
> receive
> +   updates on directory operations.
> +
> +3. For each directory entry found in each directory scanned, decide
> if the
> +   dirent references the file of interest.
> +   If so:
> +
> +   a. Stash an addpptr entry for this parent pointer in the xfblob
> and xfarray
> +      for later.
> +
> +   b. When finished scanning the directory, flush the stashed
> updates to the
> +      temporary directory.
> +
> +4. For each live directory update received via the hook, decide if
> the parent
> +   has already been scanned.
> +   If so:
> +
> +   a. Stash an addpptr or removepptr entry for this dirent update in
> the
> +      xfarray for later.
> +      We cannot write parent pointers directly to the temporary file
> because
> +      hook functions are not allowed to modify filesystem metadata.
> +      Instead, we stash updates in the xfarray and rely on the
> scanner thread
> +      to apply the stashed parent pointer updates to the temporary
> file.
> +
> +5. Copy all non-parent pointer extended attributes to the temporary
> file.
> +
> +6. When the scan is complete, atomically swap the attribute fork of
> the
> +   temporary file and the file being repaired.
> +   The temporary file now contains the damaged extended attribute
> structure.
> +
> +7. Reap the temporary file.
Seems like it should work

> +
> +The proposed patchset is the
> +`parent pointers repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=pptrs-online-parent-repair>`_
> +series.
> +
> +Digression: Offline Checking of Parent Pointers
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Examining parent pointers in offline repair works differently
> because corrupt
> +files are erased long before directory tree connectivity checks are
> performed.
> +Parent pointer checks are therefore a second pass to be added to the
> existing
> +connectivity checks:
> +
> +1. After the set of surviving files has been established (i.e. phase
> 6),
> +   walk the surviving directories of each AG in the filesystem.
> +   This is already performed as part of the connectivity checks.
> +
> +2. For each directory entry found, record the name in an xfblob, and
> store
> +   ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples
> in a
> +   per-AG in-memory slab.
> +
> +3. For each AG in the filesystem,
> +
> +   a. Sort the per-AG tuples in order of child_ag_inum, parent_inum,
> and
> +      dirent_pos.
> +
> +   b. For each inode in the AG,
> +
> +      1. Scan the inode for parent pointers.
> +         Record the names in a per-file xfblob, and store
> ``(parent_inum,
> +         parent_gen, dirent_pos)`` tuples in a per-file slab.
> +
> +      2. Sort the per-file tuples in order of parent_inum, and
> dirent_pos.
> +
> +      3. Position one slab cursor at the start of the inode's
> records in the
> +         per-AG tuple slab.
> +         This should be trivial since the per-AG tuples are in child
> inumber
> +         order.
> +
> +      4. Position a second slab cursor at the start of the per-file
> tuple slab.
> +
> +      5. Iterate the two cursors in lockstep, comparing the
> parent_ino and
> +         dirent_pos fields of the records under each cursor.
> +
> +         a. Tuples in the per-AG list but not the per-file list are
> missing and
> +            need to be written to the inode.
> +
> +         b. Tuples in the per-file list but not the per-AG list are
> dangling
> +            and need to be removed from the inode.
> +
> +         c. For tuples in both lists, update the parent_gen and name
> components
> +            of the parent pointer if necessary.
> +
> +4. Move on to examining link counts, as we do today.
> +
> +The proposed patchset is the
> +`offline parent pointers repair
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> it/log/?h=pptrs-repair>`_
> +series.
> +
> +Rebuilding directories from parent pointers in offline repair is
> very
> +challenging because it currently uses a single-pass scan of the
> filesystem
> +during phase 3 to decide which files are corrupt enough to be
> zapped.
> +This scan would have to be converted into a multi-pass scan:
> +
> +1. The first pass of the scan zaps corrupt inodes, forks, and
> attributes
> +   much as it does now.
> +   Corrupt directories are noted but not zapped.
> +
> +2. The next pass records parent pointers pointing to the directories
> noted
> +   as being corrupt in the first pass.
> +   This second pass may have to happen after the phase 4 scan for
> duplicate
> +   blocks, if phase 4 is also capable of zapping directories.
> +
> +3. The third pass resets corrupt directories to an empty shortform
> directory.
> +   Free space metadata has not been ensured yet, so repair cannot
> yet use the
> +   directory building code in libxfs.
> +
> +4. At the start of phase 6, space metadata have been rebuilt.
> +   Use the parent pointer information recorded during step 2 to
> reconstruct
> +   the dirents and add them to the now-empty directories.
> +
> +This code has not yet been constructed.
> +
> +.. _orphanage:
> +
> +The Orphanage
> +-------------
> +
> +Filesystems present files as a directed, and hopefully acyclic,
> graph.
> +In other words, a tree.
> +The root of the filesystem is a directory, and each entry in a
> directory points
> +downwards either to more subdirectories or to non-directory files.
> +Unfortunately, a disruption in the directory graph pointers result
> in a
> +disconnected graph, which makes files impossible to access via
> regular path
> +resolution.
> +The directory parent pointer online scrub code can detect a dotdot
> entry
> +pointing to a parent directory that doesn't have a link back to the
> child
> +directory, and the file link count checker can detect a file that
> isn't pointed
> +to by any directory in the filesystem.
> +If the file in question has a positive link count, the file in
> question is an
> +orphan.

Hmm, I kinda felt like this should have flowed into something like:
"now that we have parent pointers, we can reparent them instead of
putting them in the orphanage..."

?
> +
> +When orphans are found, they should be reconnected to the directory
> tree.
> +Offline fsck solves the problem by creating a directory
> ``/lost+found`` to
> +serve as an orphanage, and linking orphan files into the orphanage
> by using the
> +inumber as the name.
> +Reparenting a file to the orphanage does not reset any of its
> permissions or
> +ACLs.
> +
> +This process is more involved in the kernel than it is in userspace.
> +The directory and file link count repair setup functions must use
> the regular
> +VFS mechanisms to create the orphanage directory with all the
> necessary
> +security attributes and dentry cache entries, just like a regular
> directory
> +tree modification.
> +
> +Orphaned files are adopted by the orphanage as follows:
> +
> +1. Call ``xrep_orphanage_try_create`` at the start of the scrub
> setup function
> +   to try to ensure that the lost and found directory actually
> exists.
> +   This also attaches the orphanage directory to the scrub context.
> +
> +2. If the decision is made to reconnect a file, take the IOLOCK of
> both the
> +   orphanage and the file being reattached.
> +   The ``xrep_orphanage_iolock_two`` function follows the inode
> locking
> +   strategy discussed earlier.
> +
> +3. Call ``xrep_orphanage_compute_blkres`` and
> ``xrep_orphanage_compute_name``
> +   to compute the new name in the orphanage and the block
> reservation required.
> +
> +4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the
> repair
> +   transaction.
> +
> +5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into
> the lost
> +   and found, and update the kernel dentry cache.
> +
> +The proposed patches are in the
> +`orphanage adoption
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-orphanage>`_
> +series.

Certainly we'll need to come back and update all the parts that would
be affected by the RFC, but otherwise looks ok.  It seems trying to
document code before it's written tends to cause things to go around
for a while, since we really just cant know how stable a design is
until it's been through at least a few prototypes.

Allison

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 10/14] xfs: document full filesystem scans for online fsck
  2023-02-16 22:48       ` Darrick J. Wong
@ 2023-02-25  7:33         ` Allison Henderson
  2023-03-01 22:09           ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-02-25  7:33 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Thu, 2023-02-16 at 14:48 -0800, Darrick J. Wong wrote:
> On Thu, Feb 16, 2023 at 03:47:20PM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Certain parts of the online fsck code need to scan every file in
> > > the
> > > entire filesystem.  It is not acceptable to block the entire
> > > filesystem
> > > while this happens, which means that we need to be clever in
> > > allowing
> > > scans to coordinate with ongoing filesystem updates.  We also
> > > need to
> > > hook the filesystem so that regular updates propagate to the
> > > staging
> > > records.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  677
> > > ++++++++++++++++++++
> > >  1 file changed, 677 insertions(+)
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index a658da8fe4ae..c0f08a773f08 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -3018,3 +3018,680 @@ The proposed patchset is the
> > >  `summary counter cleanup
> > >  <
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-fscounters>`_
> > >  series.
> > > +
> > > +Full Filesystem Scans
> > > +---------------------
> > > +
> > > +Certain types of metadata can only be checked by walking every
> > > file
> > > in the
> > > +entire filesystem to record observations and comparing the
> > > observations against
> > > +what's recorded on disk.
> > > +Like every other type of online repair, repairs are made by
> > > writing
> > > those
> > > +observations to disk in a replacement structure and committing
> > > it
> > > atomically.
> > > +However, it is not practical to shut down the entire filesystem
> > > to
> > > examine
> > > +hundreds of billions of files because the downtime would be
> > > excessive.
> > > +Therefore, online fsck must build the infrastructure to manage a
> > > live scan of
> > > +all the files in the filesystem.
> > > +There are two questions that need to be solved to perform a live
> > > walk:
> > > +
> > > +- How does scrub manage the scan while it is collecting data?
> > > +
> > > +- How does the scan keep abreast of changes being made to the
> > > system
> > > by other
> > > +  threads?
> > > +
> > > +.. _iscan:
> > > +
> > > +Coordinated Inode Scans
> > > +```````````````````````
> > > +
> > > +In the original Unix filesystems of the 1970s, each directory
> > > entry
> > > contained
> > > +an index number (*inumber*) which was used as an index into on
> > > ondisk array
> > > +(*itable*) of fixed-size records (*inodes*) describing a file's
> > > attributes and
> > > +its data block mapping.
> > > +This system is described by J. Lions, `"inode (5659)"
> > > +<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions'
> > > Commentary on
> > > +UNIX, 6th Edition*, (Dept. of Computer Science, the University
> > > of
> > > New South
> > > +Wales, November 1977), pp. 18-2; and later by D. Ritchie and K.
> > > Thompson,
> > > +`"Implementation of the File System"
> > > +<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_,
> > > from
> > > *The UNIX
> > > +Time-Sharing System*, (The Bell System Technical Journal, July
> > > 1978), pp.
> > > +1913-4.
> > > +
> > > +XFS retains most of this design, except now inumbers are search
> > > keys
> > > over all
> > > +the space in the data section filesystem.
> > > +They form a continuous keyspace that can be expressed as a 64-
> > > bit
> > > integer,
> > > +though the inodes themselves are sparsely distributed within the
> > > keyspace.
> > > +Scans proceed in a linear fashion across the inumber keyspace,
> > > starting from
> > > +``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``.
> > > +Naturally, a scan through a keyspace requires a scan cursor
> > > object
> > > to track the
> > > +scan progress.
> > > +Because this keyspace is sparse, this cursor contains two parts.
> > > +The first part of this scan cursor object tracks the inode that
> > > will
> > > be
> > > +examined next; call this the examination cursor.
> > > +Somewhat less obviously, the scan cursor object must also track
> > > which parts of
> > > +the keyspace have already been visited, which is critical for
> > > deciding if a
> > > +concurrent filesystem update needs to be incorporated into the
> > > scan
> > > data.
> > > +Call this the visited inode cursor.
> > > +
> > > +Advancing the scan cursor is a multi-step process encapsulated
> > > in
> > > +``xchk_iscan_iter``:
> > > +
> > > +1. Lock the AGI buffer of the AG containing the inode pointed to
> > > by
> > > the visited
> > > +   inode cursor.
> > > +   This guarantee that inodes in this AG cannot be allocated or
> > > freed while
> > > +   advancing the cursor.
> > > +
> > > +2. Use the per-AG inode btree to look up the next inumber after
> > > the
> > > one that
> > > +   was just visited, since it may not be keyspace adjacent.
> > > +
> > > +3. If there are no more inodes left in this AG:
> > > +
> > > +   a. Move the examination cursor to the point of the inumber
> > > keyspace that
> > > +      corresponds to the start of the next AG.
> > > +
> > > +   b. Adjust the visited inode cursor to indicate that it has
> > > "visited" the
> > > +      last possible inode in the current AG's inode keyspace.
> > > +      XFS inumbers are segmented, so the cursor needs to be
> > > marked
> > > as having
> > > +      visited the entire keyspace up to just before the start of
> > > the
> > > next AG's
> > > +      inode keyspace.
> > > +
> > > +   c. Unlock the AGI and return to step 1 if there are
> > > unexamined
> > > AGs in the
> > > +      filesystem.
> > > +
> > > +   d. If there are no more AGs to examine, set both cursors to
> > > the
> > > end of the
> > > +      inumber keyspace.
> > > +      The scan is now complete.
> > > +
> > > +4. Otherwise, there is at least one more inode to scan in this
> > > AG:
> > > +
> > > +   a. Move the examination cursor ahead to the next inode marked
> > > as
> > > allocated
> > > +      by the inode btree.
> > > +
> > > +   b. Adjust the visited inode cursor to point to the inode just
> > > prior to where
> > > +      the examination cursor is now.
> > > +      Because the scanner holds the AGI buffer lock, no inodes
> > > could
> > > have been
> > > +      created in the part of the inode keyspace that the visited
> > > inode cursor
> > > +      just advanced.
> > > +
> > > +5. Get the incore inode for the inumber of the examination
> > > cursor.
> > > +   By maintaining the AGI buffer lock until this point, the
> > > scanner
> > > knows that
> > > +   it was safe to advance the examination cursor across the
> > > entire
> > > keyspace,
> > > +   and that it has stabilized this next inode so that it cannot
> > > disappear from
> > > +   the filesystem until the scan releases the incore inode.
> > > +
> > > +6. Drop the AGI lock and return the incore inode to the caller.
> > > +
> > > +Online fsck functions scan all files in the filesystem as
> > > follows:
> > > +
> > > +1. Start a scan by calling ``xchk_iscan_start``.
> > Hmm, I actually did not find xchk_iscan_start in the below branch,
> > I
> > found xchk_iscan_iter in "xfs: implement live inode scan for
> > scrub",
> > but it doesnt look like anything uses it yet, at least not in that
> > branch.
> 
> <nod> The topic branch linked below has the implementation, but no
> users.  The first user is online quotacheck, which is in the next
> branch
> after that:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck
> 
> Specifically, this patch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-quotacheck&id=3640515b9282514d91a407b6aa8d8b73caa123c5
> 
> I'll restate what you probably saw in the commit message for this
> email discussion:
> 
> This "one branch to introduce a new infrastructure and a second
> branch
> to actually use it" pattern is a result of reviewer requests for
> smaller
> more focused branches.  This has turned out to be useful in practice
> because it's easier to move just these pieces up and down in the
> branch
> as needed.  The inode scan was originally developed for rmapbt repair
> (which comes *much* later) and moved it up once I realized that
> quotacheck has far fewer dependencies and hence all of this could
> come
> earlier.
> 
> You're right that this section ought to point to an actual user of
> the
> functionality.  Will fix. :)

Alrighty then, sounds good

> 
> > Also, it took me a bit to figure out that "initial user" meant
> > "calling
> > function"
> 
> Er... are you talking about the sentence "...new code is split out as
> a
> separate patch from its initial user" in the patch commit message?
> 
> Maybe I should reword that:
> 
> "This new code is a separate patch from the patches adding callers
> for
> the sake of enabling the author to move patches around his tree..."
Yes, I think that's clearer :-)

> 
> > > +
> > > +2. Advance the scan cursor (``xchk_iscan_iter``) to get the next
> > > inode.
> > > +   If one is provided:
> > > +
> > > +   a. Lock the inode to prevent updates during the scan.
> > > +
> > > +   b. Scan the inode.
> > > +
> > > +   c. While still holding the inode lock, adjust the visited
> > > inode
> > > cursor
> > > +      (``xchk_iscan_mark_visited``) to point to this inode.
> > > +
> > > +   d. Unlock and release the inode.
> > > +
> > > +8. Call ``xchk_iscan_finish`` to complete the scan.
> > > +
> > > +There are subtleties with the inode cache that complicate
> > > grabbing
> > > the incore
> > > +inode for the caller.
> > > +Obviously, it is an absolute requirement that the inode metadata
> > > be
> > > consistent
> > > +enough to load it into the inode cache.
> > > +Second, if the incore inode is stuck in some intermediate state,
> > > the
> > > scan
> > > +coordinator must release the AGI and push the main filesystem to
> > > get
> > > the inode
> > > +back into a loadable state.
> > > +
> > > +The proposed patches are the
> > > +`inode scanner
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-iscan>`_
> > > +series.
> > > +
> > > +Inode Management
> > > +````````````````
> > > +
> > > +In regular filesystem code, references to allocated XFS incore
> > > inodes are
> > > +always obtained (``xfs_iget``) outside of transaction context
> > > because the
> > > +creation of the incore context for ane xisting file does not
> > > require
> > an existing
> 
> Corrected, thank you.
> 
> > > metadata
> > > +updates.
> > > +However, it is important to note that references to incore
> > > inodes
> > > obtained as
> > > +part of file creation must be performed in transaction context
> > > because the
> > > +filesystem must ensure the atomicity of the ondisk inode btree
> > > index
> > > updates
> > > +and the initialization of the actual ondisk inode.
> > > +
> > > +References to incore inodes are always released (``xfs_irele``)
> > > outside of
> > > +transaction context because there are a handful of activities
> > > that
> > > might
> > > +require ondisk updates:
> > > +
> > > +- The VFS may decide to kick off writeback as part of a
> > > ``DONTCACHE`` inode
> > > +  release.
> > > +
> > > +- Speculative preallocations need to be unreserved.
> > > +
> > > +- An unlinked file may have lost its last reference, in which
> > > case
> > > the entire
> > > +  file must be inactivated, which involves releasing all of its
> > > resources in
> > > +  the ondisk metadata and freeing the inode.
> > > +
> > > +These activities are collectively called inode inactivation.
> > > +Inactivation has two parts -- the VFS part, which initiates
> > > writeback on all
> > > +dirty file pages, and the XFS part, which cleans up XFS-specific
> > > information
> > > +and frees the inode if it was unlinked.
> > > +If the inode is unlinked (or unconnected after a file handle
> > > operation), the
> > > +kernel drops the inode into the inactivation machinery
> > > immediately.
> > > +
> > > +During normal operation, resource acquisition for an update
> > > follows
> > > this order
> > > +to avoid deadlocks:
> > > +
> > > +1. Inode reference (``iget``).
> > > +
> > > +2. Filesystem freeze protection, if repairing
> > > (``mnt_want_write_file``).
> > > +
> > > +3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO.
> > > +
> > > +4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for
> > > operations that
> > > +   can update page cache mappings.
> > > +
> > > +5. Log feature enablement.
> > > +
> > > +6. Transaction log space grant.
> > > +
> > > +7. Space on the data and realtime devices for the transaction.
> > > +
> > > +8. Incore dquot references, if a file is being repaired.
> > > +   Note that they are not locked, merely acquired.
> > > +
> > > +9. Inode ``ILOCK`` for file metadata updates.
> > > +
> > > +10. AG header buffer locks / Realtime metadata inode ILOCK.
> > > +
> > > +11. Realtime metadata buffer locks, if applicable.
> > > +
> > > +12. Extent mapping btree blocks, if applicable.
> > > +
> > > +Resources are often released in the reverse order, though this
> > > is
> > > not required.
> > > +However, online fsck differs from regular XFS operations because
> > > it
> > > may examine
> > > +an object that normally is acquired in a later stage of the
> > > locking
> > > order, and
> > > +then decide to cross-reference the object with an object that is
> > > acquired
> > > +earlier in the order.
> > > +The next few sections detail the specific ways in which online
> > > fsck
> > > takes care
> > > +to avoid deadlocks.
> > > +
> > > +iget and irele During a Scrub
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +An inode scan performed on behalf of a scrub operation runs in
> > > transaction
> > > +context, and possibly with resources already locked and bound to
> > > it.
> > > +This isn't much of a problem for ``iget`` since it can operate
> > > in
> > > the context
> > > +of an existing transaction, as long as all of the bound
> > > resources
> > > are acquired
> > > +before the inode reference in the regular filesystem.
> > > +
> > > +When the VFS ``iput`` function is given a linked inode with no
> > > other
> > > +references, it normally puts the inode on an LRU list in the
> > > hope
> > > that it can
> > > +save time if another process re-opens the file before the system
> > > runs out
> > > +of memory and frees it.
> > > +Filesystem callers can short-circuit the LRU process by setting
> > > a
> > > ``DONTCACHE``
> > > +flag on the inode to cause the kernel to try to drop the inode
> > > into
> > > the
> > > +inactivation machinery immediately.
> > > +
> > > +In the past, inactivation was always done from the process that
> > > dropped the
> > > +inode, which was a problem for scrub because scrub may already
> > > hold
> > > a
> > > +transaction, and XFS does not support nesting transactions.
> > > +On the other hand, if there is no scrub transaction, it is
> > > desirable
> > > to drop
> > > +otherwise unused inodes immediately to avoid polluting caches.
> > > +To capture these nuances, the online fsck code has a separate
> > > ``xchk_irele``
> > > +function to set or clear the ``DONTCACHE`` flag to get the
> > > required
> > > release
> > > +behavior.
> > > +
> > > +Proposed patchsets include fixing
> > > +`scrub iget usage
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-iget-fixes>`_ and
> > > +`dir iget usage
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-dir-iget-fixes>`_.
> > > +
> > > +Locking Inodes
> > > +^^^^^^^^^^^^^^
> > > +
> > > +In regular filesystem code, the VFS and XFS will acquire
> > > multiple
> > > IOLOCK locks
> > > +in a well-known order: parent → child when updating the
> > > directory
> > > tree, and
> > > +``struct inode`` address order otherwise.
> > > +For regular files, the MMAPLOCK can be acquired after the IOLOCK
> > > to
> > > stop page
> > > +faults.
> > > +If two MMAPLOCKs must be acquired, they are acquired in 
> > 
> > 
> > > ``struct
> > > +address_space`` order.
> > the order of their memory address
> > 
> > ?
> 
> Urghg.  I think I need to clarify this more:
> 
> "...they are acquired in numerical order of the addresses of their
> ``struct address_space`` objects."
> 
> See filemap_invalidate_lock_two.
> 
Yep, I think that works

> > > +Due to the structure of existing filesystem code, IOLOCKs and
> > > MMAPLOCKs must be
> > > +acquired before transactions are allocated.
> > > +If two ILOCKs must be acquired, they are acquired in inumber
> > > order.
> > > +
> > > +Inode lock acquisition must be done carefully during a
> > > coordinated
> > > inode scan.
> > > +Online fsck cannot abide these conventions, because for a
> > > directory
> > > tree
> > > +scanner, the scrub process holds the IOLOCK of the file being
> > > scanned and it
> > > +needs to take the IOLOCK of the file at the other end of the
> > > directory link.
> > > +If the directory tree is corrupt because it contains a cycle,
> > > ``xfs_scrub``
> > > +cannot use the regular inode locking functions and avoid
> > > becoming
> > > trapped in an
> > > +ABBA deadlock.
> > > +
> > > +Solving both of these problems is straightforward -- any time
> > > online
> > > fsck
> > > +needs to take a second lock of the same class, it uses trylock
> > > to
> > > avoid an ABBA
> > > +deadlock.
> > > +If the trylock fails, scrub drops all inode locks and use
> > > trylock
> > > loops to
> > > +(re)acquire all necessary resources.
> > > +Trylock loops enable scrub to check for pending fatal signals,
> > > which
> > > is how
> > > +scrub avoids deadlocking the filesystem or becoming an
> > > unresponsive
> > > process.
> > > +However, trylock loops means that online fsck must be prepared
> > > to
> > > measure the
> > > +resource being scrubbed before and after the lock cycle to
> > > detect
> > > changes and
> > > +react accordingly.
> > > +
> > > +.. _dirparent:
> > > +
> > > +Case Study: Finding a Directory Parent
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Consider the directory parent pointer repair code as an example.
> > > +Online fsck must verify that the dotdot dirent of a directory
> > > points
> > > up to a
> > > +parent directory, and that the parent directory contains exactly
> > > one
> > > dirent
> > > +pointing down to the child directory.
> > > +Fully validating this relationship (and repairing it if
> > > possible)
> > > requires a
> > > +walk of every directory on the filesystem while holding the
> > > child
> > > locked, and
> > > +while updates to the directory tree are being made.
> > > +The coordinated inode scan provides a way to walk the filesystem
> > > without the
> > > +possibility of missing an inode.
> > > +The child directory is kept locked to prevent updates to the
> > > dotdot
> > > dirent, but
> > > +if the scanner fails to lock a parent, it can drop and relock
> > > both
> > > the child
> > > +and the prospective parent.
> > > +If the dotdot entry changes while the directory is unlocked,
> > > then a
> > > move or
> > > +rename operation must have changed the child's parentage, and
> > > the
> > > scan can
> > > +exit early.
> > > +
> > > +The proposed patchset is the
> > > +`directory repair
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-dirs>`_
> > > +series.
> > > +
> > > +.. _fshooks:
> > > +
> > > +Filesystem Hooks
> > > +`````````````````
> > > +
> > > +The second piece of support that online fsck functions need
> > > during a
> > > full
> > > +filesystem scan is the ability to stay informed about updates
> > > being
> > > made by
> > > +other threads in the filesystem, since comparisons against the
> > > past
> > > are useless
> > > +in a dynamic environment.
> > > +Two pieces of Linux kernel infrastructure enable online fsck to
> > > monitor regular
> > > +filesystem operations: filesystem hooks and :ref:`static
> > > keys<jump_labels>`.
> > > +
> > > +Filesystem hooks convey information about an ongoing filesystem
> > > operation to
> > > +a downstream consumer.
> > > +In this case, the downstream consumer is always an online fsck
> > > function.
> > > +Because multiple fsck functions can run in parallel, online fsck
> > > uses the Linux
> > > +notifier call chain facility to dispatch updates to any number
> > > of
> > > interested
> > > +fsck processes.
> > > +Call chains are a dynamic list, which means that they can be
> > > configured at
> > > +run time.
> > > +Because these hooks are private to the XFS module, the
> > > information
> > > passed along
> > > +contains exactly what the checking function needs to update its
> > > observations.
> > > +
> > > +The current implementation of XFS hooks uses SRCU notifier
> > > chains to
> > > reduce the
> > > +impact to highly threaded workloads.
> > > +Regular blocking notifier chains use a rwsem and seem to have a
> > > much
> > > lower
> > > +overhead for single-threaded applications.
> > > +However, it may turn out that the combination of blocking chains
> > > and
> > > static
> > > +keys are a more performant combination; more study is needed
> > > here.
> > > +
> > > +The following pieces are necessary to hook a certain point in
> > > the
> > > filesystem:
> > > +
> > > +- A ``struct xfs_hooks`` object must be embedded in a convenient
> > > place such as
> > > +  a well-known incore filesystem object.
> > > +
> > > +- Each hook must define an action code and a structure
> > > containing
> > > more context
> > > +  about the action.
> > > +
> > > +- Hook providers should provide appropriate wrapper functions
> > > and
> > > structs
> > > +  around the ``xfs_hooks`` and ``xfs_hook`` objects to take
> > > advantage of type
> > > +  checking to ensure correct usage.
> > > +
> > > +- A callsite in the regular filesystem code must be chosen to
> > > call
> > > +  ``xfs_hooks_call`` with the action code and data structure.
> > > +  This place should be adjacent to (and not earlier than) the
> > > place
> > > where
> > > +  the filesystem update is committed to the transaction.
> > > +  In general, when the filesystem calls a hook chain, it should
> > > be
> > > able to
> > > +  handle sleeping and should not be vulnerable to memory reclaim
> > > or
> > > locking
> > > +  recursion.
> > > +  However, the exact requirements are very dependent on the
> > > context
> > > of the hook
> > > +  caller and the callee.
> > > +
> > > +- The online fsck function should define a structure to hold
> > > scan
> > > data, a lock
> > > +  to coordinate access to the scan data, and a ``struct
> > > xfs_hook``
> > > object.
> > > +  The scanner function and the regular filesystem code must
> > > acquire
> > > resources
> > > +  in the same order; see the next section for details.
> > > +
> > > +- The online fsck code must contain a C function to catch the
> > > hook
> > > action code
> > > +  and data structure.
> > > +  If the object being updated has already been visited by the
> > > scan,
> > > then the
> > > +  hook information must be applied to the scan data.
> > > +
> > > +- Prior to unlocking inodes to start the scan, online fsck must
> > > call
> > > +  ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and
> > > +  ``xfs_hooks_add`` to enable the hook.
> > > +
> > > +- Online fsck must call ``xfs_hooks_del`` to disable the hook
> > > once
> > > the scan is
> > > +  complete.
> > > +
> > > +The number of hooks should be kept to a minimum to reduce
> > > complexity.
> > > +Static keys are used to reduce the overhead of filesystem hooks
> > > to
> > > nearly
> > > +zero when online fsck is not running.
> > > +
> > > +.. _liveupdate:
> > > +
> > > +Live Updates During a Scan
> > > +``````````````````````````
> > > +
> > > +The code paths of the online fsck scanning code and the
> > > :ref:`hooked<fshooks>`
> > > +filesystem code look like this::
> > > +
> > > +            other program
> > > +                  ↓
> > > +            inode lock ←────────────────────┐
> > > +                  ↓                         │
> > > +            AG header lock                  │
> > > +                  ↓                         │
> > > +            filesystem function             │
> > > +                  ↓                         │
> > > +            notifier call chain             │    same
> > > +                  ↓                         ├─── inode
> > > +            scrub hook function             │    lock
> > > +                  ↓                         │
> > > +            scan data mutex ←──┐    same    │
> > > +                  ↓            ├─── scan    │
> > > +            update scan data   │    lock    │
> > > +                  ↑            │            │
> > > +            scan data mutex ←──┘            │
> > > +                  ↑                         │
> > > +            inode lock ←────────────────────┘
> > > +                  ↑
> > > +            scrub function
> > > +                  ↑
> > > +            inode scanner
> > > +                  ↑
> > > +            xfs_scrub
> > > +
> > > +These rules must be followed to ensure correct interactions
> > > between
> > > the
> > > +checking code and the code making an update to the filesystem:
> > > +
> > > +- Prior to invoking the notifier call chain, the filesystem
> > > function
> > > being
> > > +  hooked must acquire the same lock that the scrub scanning
> > > function
> > > acquires
> > > +  to scan the inode.
> > > +
> > > +- The scanning function and the scrub hook function must
> > > coordinate
> > > access to
> > > +  the scan data by acquiring a lock on the scan data.
> > > +
> > > +- Scrub hook function must not add the live update information
> > > to
> > > the scan
> > > +  observations unless the inode being updated has already been
> > > scanned.
> > > +  The scan coordinator has a helper predicate
> > > (``xchk_iscan_want_live_update``)
> > > +  for this.
> > > +
> > > +- Scrub hook functions must not change the caller's state,
> > > including
> > > the
> > > +  transaction that it is running.
> > > +  They must not acquire any resources that might conflict with
> > > the
> > > filesystem
> > > +  function being hooked.
> > > +
> > > +- The hook function can abort the inode scan to avoid breaking
> > > the
> > > other rules.
> > > +
> > > +The inode scan APIs are pretty simple:
> > > +
> > > +- ``xchk_iscan_start`` starts a scan
> > > +
> > > +- ``xchk_iscan_iter`` grabs a reference to the next inode in the
> > > scan or
> > > +  returns zero if there is nothing left to scan
> > > +
> > > +- ``xchk_iscan_want_live_update`` to decide if an inode has
> > > already
> > > been
> > > +  visited in the scan.
> > > +  This is critical for hook functions to decide if they need to
> > > update the
> > > +  in-memory scan information.
> > > +
> > > +- ``xchk_iscan_mark_visited`` to mark an inode as having been
> > > visited in the
> > > +  scan
> > > +
> > > +- ``xchk_iscan_finish`` to finish the scan
> > > +
> > > +The proposed patches are at the start of the
> > > +`online quotacheck
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-quota>`_
> > > +series.
> > Wrong link?  This looks like it goes to the section below.
> 
> Oops.  This one should link to scrub-iscan, and the next one should
> link
> to repair-quotacheck.
> 
> > > +
> > > +.. _quotacheck:
> > > +
> > > +Case Study: Quota Counter Checking
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +It is useful to compare the mount time quotacheck code to the
> > > online
> > > repair
> > > +quotacheck code.
> > > +Mount time quotacheck does not have to contend with concurrent
> > > operations, so
> > > +it does the following:
> > > +
> > > +1. Make sure the ondisk dquots are in good enough shape that all
> > > the
> > > incore
> > > +   dquots will actually load, and zero the resource usage
> > > counters
> > > in the
> > > +   ondisk buffer.
> > > +
> > > +2. Walk every inode in the filesystem.
> > > +   Add each file's resource usage to the incore dquot.
> > > +
> > > +3. Walk each incore dquot.
> > > +   If the incore dquot is not being flushed, add the ondisk
> > > buffer
> > > backing the
> > > +   incore dquot to a delayed write (delwri) list.
> > > +
> > > +4. Write the buffer list to disk.
> > > +
> > > +Like most online fsck functions, online quotacheck can't write
> > > to
> > > regular
> > > +filesystem objects until the newly collected metadata reflect
> > > all
> > > filesystem
> > > +state.
> > > +Therefore, online quotacheck records file resource usage to a
> > > shadow
> > > dquot
> > > +index implemented with a sparse ``xfarray``, and only writes to
> > > the
> > > real dquots
> > > +once the scan is complete.
> > > +Handling transactional updates is tricky because quota resource
> > > usage updates
> > > +are handled in phases to minimize contention on dquots:
> > > +
> > > +1. The inodes involved are joined and locked to a transaction.
> > > +
> > > +2. For each dquot attached to the file:
> > > +
> > > +   a. The dquot is locked.
> > > +
> > > +   b. A quota reservation is added to the dquot's resource
> > > usage.
> > > +      The reservation is recorded in the transaction.
> > > +
> > > +   c. The dquot is unlocked.
> > > +
> > > +3. Changes in actual quota usage are tracked in the transaction.
> > > +
> > > +4. At transaction commit time, each dquot is examined again:
> > > +
> > > +   a. The dquot is locked again.
> > > +
> > > +   b. Quota usage changes are logged and unused reservation is
> > > given
> > > back to
> > > +      the dquot.
> > > +
> > > +   c. The dquot is unlocked.
> > > +
> > > +For online quotacheck, hooks are placed in steps 2 and 4.
> > > +The step 2 hook creates a shadow version of the transaction
> > > dquot
> > > context
> > > +(``dqtrx``) that operates in a similar manner to the regular
> > > code.
> > > +The step 4 hook commits the shadow ``dqtrx`` changes to the
> > > shadow
> > > dquots.
> > > +Notice that both hooks are called with the inode locked, which
> > > is
> > > how the
> > > +live update coordinates with the inode scanner.
> > > +
> > > +The quotacheck scan looks like this:
> > > +
> > > +1. Set up a coordinated inode scan.
> > > +
> > > +2. For each inode returned by the inode scan iterator:
> > > +
> > > +   a. Grab and lock the inode.
> > > +
> > > +   b. Determine that inode's resource usage (data blocks, inode
> > > counts,
> > > +      realtime blocks) 
> > nit: move this list to the first appearance of "resource usage". 
> > Step
> > 2 of the first list I think
> 
> I don't understand this proposed change.  Are you talking about "2.
> For
> each dquot attached to the file:" above?  That list describes the
> steps
> taken by regular code wanting to allocate file space that's accounted
> to
> quotas.  This list describes what online quotacheck does.  The two
> don't
> mix.
Oh, youre right, disregard this one

> 
> > > and add that to the shadow dquots for the user, group,
> > > +      and project ids associated with the inode.
> > > +
> > > +   c. Unlock and release the inode.
> > > +
> > > +3. For each dquot in the system:
> > > +
> > > +   a. Grab and lock the dquot.
> > > +
> > > +   b. Check the dquot against the shadow dquots created by the
> > > scan
> > > and updated
> > > +      by the live hooks.
> > > +
> > > +Live updates are key to being able to walk every quota record
> > > without
> > > +needing to hold any locks for a long duration.
> > > +If repairs are desired, the real and shadow dquots are locked
> > > and
> > > their
> > > +resource counts are set to the values in the shadow dquot.
> > > +
> > > +The proposed patchset is the
> > > +`online quotacheck
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-quota>`_
> 
> Changed from repair-quota to repair-quotacheck.
> 
> > > +series.
> > > +
> > > +.. _nlinks:
> > > +
> > > +Case Study: File Link Count Checking
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +File link count checking also uses live update hooks.
> > > +The coordinated inode scanner is used to visit all directories
> > > on
> > > the
> > > +filesystem, and per-file link count records are stored in a
> > > sparse
> > > ``xfarray``
> > > +indexed by inumber.
> > > +During the scanning phase, each entry in a directory generates
> > > observation
> > > +data as follows:
> > > +
> > > +1. If the entry is a dotdot (``'..'``) entry of the root
> > > directory,
> > > the
> > > +   directory's parent link count is bumped because the root
> > > directory's dotdot
> > > +   entry is self referential.
> > > +
> > > +2. If the entry is a dotdot entry of a subdirectory, the
> > > parent's
> > > backref
> > > +   count is bumped.
> > > +
> > > +3. If the entry is neither a dot nor a dotdot entry, the target
> > > file's parent
> > > +   count is bumped.
> > > +
> > > +4. If the target is a subdirectory, the parent's child link
> > > count is
> > > bumped.
> > > +
> > > +A crucial point to understand about how the link count inode
> > > scanner
> > > interacts
> > > +with the live update hooks is that the scan cursor tracks which
> > > *parent*
> > > +directories have been scanned.
> > > +In other words, the live updates ignore any update about ``A →
> > > B``
> > > when A has
> > > +not been scanned, even if B has been scanned.
> > > +Furthermore, a subdirectory A with a dotdot entry pointing back
> > > to B
> > > is
> > > +accounted as a backref counter in the shadow data for A, since
> > > child
> > > dotdot
> > > +entries affect the parent's link count.
> > > +Live update hooks are carefully placed in all parts of the
> > > filesystem that
> > > +create, change, or remove directory entries, since those
> > > operations
> > > involve
> > > +bumplink and droplink.
> > > +
> > > +For any file, the correct link count is the number of parents
> > > plus
> > > the number
> > > +of child subdirectories.
> > > +Non-directories never have children of any kind.
> > > +The backref information is used to detect inconsistencies in the
> > > number of
> > > +links pointing to child subdirectories and the number of dotdot
> > > entries
> > > +pointing back.
> > > +
> > > +After the scan completes, the link count of each file can be
> > > checked
> > > by locking
> > > +both the inode and the shadow data, and comparing the link
> > > counts.
> > > +A second coordinated inode scan cursor is used for comparisons.
> > > +Live updates are key to being able to walk every inode without
> > > needing to hold
> > > +any locks between inodes.
> > > +If repairs are desired, the inode's link count is set to the
> > > value
> > > in the
> > > +shadow information.
> > > +If no parents are found, the file must be :ref:`reparented
> > > <orphanage>` to the
> > > +orphanage to prevent the file from being lost forever.
> > > +
> > > +The proposed patchset is the
> > > +`file link count repair
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=scrub-nlinks>`_
> > > +series.
> > > +
> > > +.. _rmap_repair:
> > > +
> > > +Case Study: Rebuilding Reverse Mapping Records
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Most repair functions follow the same pattern: lock filesystem
> > > resources,
> > > +walk the surviving ondisk metadata looking for replacement
> > > metadata
> > > records,
> > > +and use an :ref:`in-memory array <xfarray>` to store the
> > > gathered
> > > observations.
> > > +The primary advantage of this approach is the simplicity and
> > > modularity of the
> > > +repair code -- code and data are entirely contained within the
> > > scrub
> > > module,
> > > +do not require hooks in the main filesystem, and are usually the
> > > most efficient
> > > +in memory use.
> > > +A secondary advantage of this repair approach is atomicity --
> > > once
> > > the kernel
> > > +decides a structure is corrupt, no other threads can access the
> > > metadata until
> > > +the kernel finishes repairing and revalidating the metadata.
> > > +
> > > +For repairs going on within a shard of the filesystem, these
> > > advantages
> > > +outweigh the delays inherent in locking the shard while
> > > repairing
> > > parts of the
> > > +shard.
> > > +Unfortunately, repairs to the reverse mapping btree cannot use
> > > the
> > > "standard"
> > > +btree repair strategy because it must scan every space mapping
> > > of
> > > every fork of
> > > +every file in the filesystem, and the filesystem cannot stop.
> > > +Therefore, rmap repair foregoes atomicity between scrub and
> > > repair.
> > > +It combines a :ref:`coordinated inode scanner <iscan>`,
> > > :ref:`live
> > > update hooks
> > > +<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to
> > > complete the
> > > +scan for reverse mapping records.
> > > +
> > > +1. Set up an xfbtree to stage rmap records.
> > > +
> > > +2. While holding the locks on the AGI and AGF buffers acquired
> > > during the
> > > +   scrub, generate reverse mappings for all AG metadata: inodes,
> > > btrees, CoW
> > > +   staging extents, and the internal log.
> > > +
> > > +3. Set up an inode scanner.
> > > +
> > > +4. Hook into rmap updates for the AG being repaired so that the
> > > live
> > > scan data
> > > +   can receive updates to the rmap btree from the rest of the
> > > filesystem during
> > > +   the file scan.
> > > +
> > > +5. For each space mapping found in either fork of each file
> > > scanned,
> > > +   decide if the mapping matches the AG of interest.
> > > +   If so:
> > > +
> > > +   a. Create a btree cursor for the in-memory btree.
> > > +
> > > +   b. Use the rmap code to add the record to the in-memory
> > > btree.
> > > +
> > > +   c. Use the :ref:`special commit function <xfbtree_commit>` to
> > > write the
> > > +      xfbtree changes to the xfile.
> > > +
> > > +6. For each live update received via the hook, decide if the
> > > owner
> > > has already
> > > +   been scanned.
> > > +   If so, apply the live update into the scan data:
> > > +
> > > +   a. Create a btree cursor for the in-memory btree.
> > > +
> > > +   b. Replay the operation into the in-memory btree.
> > > +
> > > +   c. Use the :ref:`special commit function <xfbtree_commit>` to
> > > write the
> > > +      xfbtree changes to the xfile.
> > > +      This is performed with an empty transaction to avoid
> > > changing
> > > the
> > > +      caller's state.
> > > +
> > > +7. When the inode scan finishes, create a new scrub transaction
> > > and
> > > relock the
> > > +   two AG headers.
> > > +
> > > +8. Compute the new btree geometry using the number of rmap
> > > records
> > > in the
> > > +   shadow btree, like all other btree rebuilding functions.
> > > +
> > > +9. Allocate the number of blocks computed in the previous step.
> > > +
> > > +10. Perform the usual btree bulk loading and commit to install
> > > the
> > > new rmap
> > > +    btree.
> > > +
> > > +11. Reap the old rmap btree blocks as discussed in the case
> > > study
> > > about how
> > > +    to :ref:`reap after rmap btree repair <rmap_reap>`.
> > > +
> > > +12. Free the xfbtree now that it not needed.
> > > +
> > > +The proposed patchset is the
> > > +`rmap repair
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-rmap-btree>`_
> > > +series.
> > > 
> > 
> > Mostly looks good nits aside, I do sort of wonder if this patch
> > would
> > do better to appear before patch 6 (or move 6 down), since it gets
> > into
> > more challenges concerning locks and hooks, where as here we are
> > mostly
> > discussing what they are and how they work.  So it might build
> > better
> > to move this patch up a little.
> 
> (I might be a tad confused here, bear with me.)
> 
> Patch 6, the section about eventual consistency?
> 
> Hmm.  The intent drains exist to quiesce intent chains targeting
> specific AGs.  It briefly mentions "fshooks" in the context of using
> jump labels to avoid the overhead of calling notify_all on the drain
> waitqueue when scrub isn't running.  That's perhaps bad naming on my
> part, since the other "fshooks" are jump labels to avoid bouncing
> through the notifier chain code when scrub isn't running.  The jump
> labels themselves are not hooks, they're structured dynamic code
> patching.
> 
> I probably should've named those something else.  fsgates?
Oh, i see, yes I did sort of try to correlate them, so maybe the
different name would help.
> 
> Or maybe you were talking specifically about "Case Study: Rebuilding
> Reverse Mapping Records"?  In which case I remark that the case study
> needs both the intent drains to quiesce the AG and the live scans to
> work properly, which is why the case study of it couldn't come
> earlier.
> The intent drains section still ought to come before the refcountbt
> section, because it's the refcountbt scrubber that first hit the
> coordination problem.
> 
> Things are getting pretty awkward like this because there are sooo
> many
> interdependent pieces. :(

I see, ok no worries then, I think people will figure it out either
way.  I mostly look for ways to make the presentation easier but it is
getting harder to move stuff with chicken and egg dependencies.

> 
> Regardless, thank you very much for slogging through.
> 
> --D
> 
> > Allison
> > 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 11/14] xfs: document metadata file repair
  2023-02-25  7:33     ` Allison Henderson
@ 2023-03-01  2:42       ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-01  2:42 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Sat, Feb 25, 2023 at 07:33:13AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > File-based metadata (such as xattrs and directories) can be extremely
> > large.  To reduce the memory requirements and maximize code reuse, it
> > is
> > very convenient to create a temporary file, use the regular dir/attr
> > code to store salvaged information, and then atomically swap the
> > extents
> > between the file being repaired and the temporary file.  Record the
> > high
> > level concepts behind how temporary files and atomic content swapping
> > should work, and then present some case studies of what the actual
> > repair functions do.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  574
> > ++++++++++++++++++++
> >  1 file changed, 574 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index c0f08a773f08..e32506acb66f 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -3252,6 +3252,8 @@ Proposed patchsets include fixing
> >  `dir iget usage
> >  <
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=scrub-dir-iget-fixes>`_.
> >  
> > +.. _ilocking:
> > +
> hmm, this little  part look like maybe it was supposed to go in the
> last patch?

It's a link target for the header header that comes after it.  There
weren't any links pointing to the target until this patch, so I didn't
introduce the target until now.

(I wish that unused link targets would be benign, but the build system
complains about them.  OTOH there are plenty of other link target
warnings until you get to the final patch in this series, so...)

(So I don't really feel like fixing this.)

> >  Locking Inodes
> >  ^^^^^^^^^^^^^^
> >  
> > @@ -3695,3 +3697,575 @@ The proposed patchset is the
> >  `rmap repair
> >  <
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-rmap-btree>`_
> >  series.
> > +
> > +Staging Repairs with Temporary Files on Disk
> > +--------------------------------------------
> > +
> > +XFS stores a substantial amount of metadata in file forks:
> > directories,
> > +extended attributes, symbolic link targets, free space bitmaps and
> > summary
> > +information for the realtime volume, and quota records.
> > +File forks map 64-bit logical file fork space extents to physical
> > storage space
> > +extents, similar to how a memory management unit maps 64-bit virtual
> > addresses
> > +to physical memory addresses.
> > +Therefore, file-based tree structures (such as directories and
> > extended
> > +attributes) use blocks mapped in the file fork offset address space
> > that point
> > +to other blocks mapped within that same address space, and file-
> > based linear
> > +structures (such as bitmaps and quota records) compute array element
> > offsets in
> > +the file fork offset address space.
> > +
> 
> 
> > +In the initial iteration of file metadata repair, the damaged
> > metadata blocks
> > +would be scanned for salvageable data; the extents in the file fork
> > would be
> > +reaped; and then a new structure would be built in its place.
> > +This strategy did not survive the introduction of the atomic repair
> > requirement
> > +expressed earlier in this document.
> > +The second iteration explored building a second structure at a high
> > offset
> > +in the fork from the salvage data, reaping the old extents, and
> > using a
> > +``COLLAPSE_RANGE`` operation to slide the new extents into place.
> > +This had many drawbacks:
> > +
> > +- Array structures are linearly addressed, and the regular
> > filesystem codebase
> > +  does not have the concept of a linear offset that could be applied
> > to the
> > +  record offset computation to build an alternate copy.
> > +
> > +- Extended attributes are allowed to use the entire attr fork offset
> > address
> > +  space.
> > +
> > +- Even if repair could build an alternate copy of a data structure
> > in a
> > +  different part of the fork address space, the atomic repair commit
> > +  requirement means that online repair would have to be able to
> > perform a log
> > +  assisted ``COLLAPSE_RANGE`` operation to ensure that the old
> > structure was
> > +  completely replaced.
> > +
> > +- A crash after construction of the secondary tree but before the
> > range
> > +  collapse would leave unreachable blocks in the file fork.
> > +  This would likely confuse things further.
> > +
> > +- Reaping blocks after a repair is not a simple operation, and
> > initiating a
> > +  reap operation from a restarted range collapse operation during
> > log recovery
> > +  is daunting.
> > +
> > +- Directory entry blocks and quota records record the file fork
> > offset in the
> > +  header area of each block.
> > +  An atomic range collapse operation would have to rewrite this part
> > of each
> > +  block header.
> > +  Rewriting a single field in block headers is not a huge problem,
> > but it's
> > +  something to be aware of.
> > +
> > +- Each block in a directory or extended attributes btree index
> > contains sibling
> > +  and child block pointers.
> > +  Were the atomic commit to use a range collapse operation, each
> > block would
> > +  have to be rewritten very carefully to preserve the graph
> > structure.
> > +  Doing this as part of a range collapse means rewriting a large
> > number of
> > +  blocks repeatedly, which is not conducive to quick repairs.
> > +
> > +The third iteration of the design for file metadata repair went for
> > a totally
> > +new strategy -- 
> All the above looks like something that could be culled or side bared.
> I know you really like these, but I think the extra dialog is why
> people are having a hard time getting through it. 

<nod> I'll sidebar all the historical data.

> > create a temporary file in the XFS filesystem, write a new
> "The current design for metadata repair creates a temporary file..."

This paragraph now reads:

"Because file forks can consume as much space as the entire filesystem,
repairs cannot be staged in memory, even when a paging scheme is
available.  Therefore, online repair of file-based metadata createas a
temporary file in the XFS filesystem, writes a new structure at the
correct offsets into the temporary file, and atomically swaps the fork
mappings (and hence the fork contents) to commit the repair..."


> > +structure at the correct offsets into the temporary file, and
> > atomically swap
> > +the fork mappings (and hence the fork contents) to commit the
> > repair.
> > +Once the repair is complete, the old fork can be reaped as
> > necessary; if the
> > +system goes down during the reap, the iunlink code will delete the
> > blocks
> > +during log recovery.
> > +
> > +**Note**: All space usage and inode indices in the filesystem *must*
> > be
> > +consistent to use a temporary file safely!
> > +This dependency is the reason why online repair can only use
> > pageable kernel
> > +memory to stage ondisk space usage information.
> > +
> > +Swapping extents with a temporary file still requires a rewrite of
> > the owner
> > +field of the block headers, but this is *much* simpler than moving
> > tree blocks
> > +individually.
> > +Furthermore, the buffer verifiers do not verify owner fields (since
> > they are
> > +not aware of the inode that owns the block), which makes reaping of
> > old file
> > +blocks much simpler.
> > +Extent swapping requires that AG space metadata and the file fork
> > metadata of
> > +the file being repaired are all consistent with respect to each
> > other, but
> > +that's already a requirement for correct operation of files in
> > general.
> > +There is, however, a slight downside -- if the system crashes during
> > the reap
> > +phase and the fork extents are crosslinked, the iunlink processing
> > will fail
> > +because freeing space will find the extra reverse mappings and
> > abort.
> > +
> > +Temporary files created for repair are similar to ``O_TMPFILE``
> > files created
> > +by userspace.
> > +They are not linked into a directory and the entire file will be
> > reaped when
> > +the last reference to the file is lost.
> > +The key differences are that these files must have no access
> > permission outside
> > +the kernel at all, they must be specially marked to prevent them
> > from being
> > +opened by handle, and they must never be linked into the directory
> > tree.
> > +
> > +Using a Temporary File
> > +``````````````````````
> > +
> > +Online repair code should use the ``xrep_tempfile_create`` function
> > to create a
> > +temporary file inside the filesystem.
> > +This allocates an inode, marks the in-core inode private, and
> > attaches it to
> > +the scrub context.
> > +These files are hidden from userspace, may not be added to the
> > directory tree,
> > +and must be kept private.
> > +
> > +Temporary files only use two inode locks: the IOLOCK and the ILOCK.
> > +The MMAPLOCK is not needed here, because there must not be page
> > faults from
> > +userspace for data fork blocks.
> > +The usage patterns of these two locks are the same as for any other
> > XFS file --
> > +access to file data are controlled via the IOLOCK, and access to
> > file metadata
> > +are controlled via the ILOCK.
> > +Locking helpers are provided so that the temporary file and its lock
> > state can
> > +be cleaned up by the scrub context.
> > +To comply with the nested locking strategy laid out in the
> > :ref:`inode
> > +locking<ilocking>` section, it is recommended that scrub functions
> > use the
> > +xrep_tempfile_ilock*_nowait lock helpers.
> > +
> > +Data can be written to a temporary file by two means:
> > +
> > +1. ``xrep_tempfile_copyin`` can be used to set the contents of a
> > regular
> > +   temporary file from an xfile.
> > +
> > +2. The regular directory, symbolic link, and extended attribute
> > functions can
> > +   be used to write to the temporary file.
> > +
> > +Once a good copy of a data file has been constructed in a temporary
> > file, it
> > +must be conveyed to the file being repaired, which is the topic of
> > the next
> > +section.
> > +
> > +The proposed patches are in the
> > +`realtime summary repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-rtsummary>`_
> > +series.
> > +
> > +Atomic Extent Swapping
> > +----------------------
> > +
> > +Once repair builds a temporary file with a new data structure
> > written into
> > +it, it must commit the new changes into the existing file.
> > +It is not possible to swap the inumbers of two files, so instead the
> > new
> > +metadata must replace the old.
> > +This suggests the need for the ability to swap extents, but the
> > existing extent
> > +swapping code used by the file defragmenting tool ``xfs_fsr`` is not
> > sufficient
> > +for online repair because:
> > +
> > +a. When the reverse-mapping btree is enabled, the swap code must
> > keep the
> > +   reverse mapping information up to date with every exchange of
> > mappings.
> > +   Therefore, it can only exchange one mapping per transaction, and
> > each
> > +   transaction is independent.
> > +
> > +b. Reverse-mapping is critical for the operation of online fsck, so
> > the old
> > +   defragmentation code (which swapped entire extent forks in a
> > single
> > +   operation) is not useful here.
> > +
> > +c. Defragmentation is assumed to occur between two files with
> > identical
> > +   contents.
> > +   For this use case, an incomplete exchange will not result in a
> > user-visible
> > +   change in file contents, even if the operation is interrupted.
> > +
> > +d. Online repair needs to swap the contents of two files that are by
> > definition
> > +   *not* identical.
> > +   For directory and xattr repairs, the user-visible contents might
> > be the
> > +   same, but the contents of individual blocks may be very
> > different.
> > +
> > +e. Old blocks in the file may be cross-linked with another structure
> > and must
> > +   not reappear if the system goes down mid-repair.
> > +
> > +These problems are overcome by creating a new deferred operation and
> > a new type
> > +of log intent item to track the progress of an operation to exchange
> > two file
> > +ranges.
> > +The new deferred operation type chains together the same
> > transactions used by
> > +the reverse-mapping extent swap code.
> > +The new log item records the progress of the exchange to ensure that
> > once an
> > +exchange begins, it will always run to completion, even there are
> > +interruptions.
> > +
> > +The proposed patchset is the
> > +`atomic extent swap
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=atomic-file-updates>`_
> > +series.
> > +
> > +Using Log-Incompatible Feature Flags
> > +````````````````````````````````````
> > +
> > +Starting with XFS v5, the superblock contains a
> > ``sb_features_log_incompat``
> > +field to indicate that the log contains records that might not
> > readable by all
> > +kernels that could mount this filesystem.
> 
> 
> > +In short, log incompat features protect the log contents against
> > kernels that
> > +will not understand the contents.
> > +Unlike the other superblock feature bits, log incompat bits are
> > ephemeral
> > +because an empty (clean) log does not need protection.
> > +The log cleans itself after its contents have been committed into
> > the
> > +filesystem, either as part of an unmount or because the system is
> > otherwise
> > +idle.
> > +Because upper level code can be working on a transaction at the same
> > time that
> > +the log cleans itself, it is necessary for upper level code to
> > communicate to
> > +the log when it is going to use a log incompatible feature.
> > +
> > +The log coordinates access to incompatible features through the use
> > of one
> > +``struct rw_semaphore`` for each feature.
> > +The log cleaning code tries to take this rwsem in exclusive mode to
> > clear the
> > +bit; if the lock attempt fails, the feature bit remains set.
> > +Filesystem code signals its intention to use a log incompat feature
> > in a
> > +transaction by calling ``xlog_use_incompat_feat``, which takes the
> > rwsem in
> > +shared mode.
> > +The code supporting a log incompat feature should create wrapper
> > functions to
> > +obtain the log feature and call ``xfs_add_incompat_log_feature`` to
> > set the
> > +feature bits in the primary superblock.
> > +The superblock update is performed transactionally, so the wrapper
> > to obtain
> > +log assistance must be called just prior to the creation of the
> > transaction
> > +that uses the functionality.
> > +For a file operation, this step must happen after taking the IOLOCK
> > and the
> > +MMAPLOCK, but before allocating the transaction.
> > +When the transaction is complete, the ``xlog_drop_incompat_feat``
> > function
> > +is called to release the feature.
> > +The feature bit will not be cleared from the superblock until the
> > log becomes
> > +clean.
> While this section does make sense, it doesnt really seem like it's
> specific to ofsck either.  Pptrs and possibly other future features use
> the same incompat bit logic, but the implementation is pretty disjoint
> and I wouldnt really consider it part of that feature.  So I would
> either remove this part, or move it to its own section.  Then I would
> just give a quick blurb here about how ofsck uses it:
> 
> "Since atomic extent swap will introduce a new type of log item, it
> will also add a new XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP bit"

Ok, I'll add that at the end of the section above, and turn this into a
sidebar.

> > +
> > +Log-assisted extended attribute updates and atomic extent swaps both
> > use log
> > +incompat features and provide convenience wrappers around the
> > functionality.
> 
> "For more information on incompat bits, see...."
> 
> > +
> > +Mechanics of an Atomic Extent Swap
> > +``````````````````````````````````
> > +
> > +Swapping entire file forks is a complex task.
> > +The goal is to exchange all file fork mappings between two file fork
> > offset
> > +ranges.
> > +There are likely to be many extent mappings in each fork, and the
> > edges of
> > +the mappings aren't necessarily aligned.
> > +Furthermore, there may be other updates that need to happen after
> > the swap,
> > +such as exchanging file sizes, inode flags, or conversion of fork
> > data to local
> > +format.
> > +This is roughly the format of the new deferred extent swap work
> > item:
> > +
> > +.. code-block:: c
> > +
> > +       struct xfs_swapext_intent {
> > +           /* Inodes participating in the operation. */
> > +           struct xfs_inode    *sxi_ip1;
> > +           struct xfs_inode    *sxi_ip2;
> > +
> > +           /* File offset range information. */
> > +           xfs_fileoff_t       sxi_startoff1;
> > +           xfs_fileoff_t       sxi_startoff2;
> > +           xfs_filblks_t       sxi_blockcount;
> > +
> > +           /* Set these file sizes after the operation, unless
> > negative. */
> > +           xfs_fsize_t         sxi_isize1;
> > +           xfs_fsize_t         sxi_isize2;
> > +
> > +           /* XFS_SWAP_EXT_* log operation flags */
> > +           uint64_t            sxi_flags;
> > +       };
> > +
> > +The new log intent item contains enough information to track two
> > logical fork
> > +offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2,
> > startoff2,
> > +blockcount)``.
> > +Each step of a swap operation exchanges the largest file range
> > mapping possible
> > +from one file to the other.
> > +After each step in the swap operation, the two startoff fields are
> > incremented
> > +and the blockcount field is decremented to reflect the progress
> > made.
> > +The flags field captures behavioral parameters such as swapping the
> > attr fork
> > +instead of the data fork and other work to be done after the extent
> > swap.
> > +The two isize fields are used to swap the file size at the end of
> > the operation
> > +if the file data fork is the target of the swap operation.
> > +
> > +When the extent swap is initiated, the sequence of operations is as
> > follows:
> > +
> > +1. Create a deferred work item for the extent swap.
> > +   At the start, it should contain the entirety of the file ranges
> > to be
> > +   swapped.
> > +
> > +2. Call ``xfs_defer_finish`` 
> This seems like this should be some sort of defer start wrapper, not
> finish.  It would also help to have a link or function name to see the
> code it is trying to describe

It's encapsulated in xrep_tempswap_contents in tempfile.c; I'll make a
note of that here.

> > to start processing of the exchange.
> > +   This will log an extent swap intent item to the transaction for
> > the deferred
> > +   extent swap work item.
> > +
> > +3. Until ``sxi_blockcount`` of the deferred extent swap work item is
> > zero,
> > +
> > +   a. Read the block maps of both file ranges starting at
> > ``sxi_startoff1`` and
> > +      ``sxi_startoff2``, respectively, and compute the longest
> > extent that can
> > +      be swapped in a single step.
> > +      This is the minimum of the two ``br_blockcount`` s in the
> > mappings.
> > +      Keep advancing through the file forks until at least one of
> > the mappings
> > +      contains written blocks.
> > +      Mutual holes, unwritten extents, and extent mappings to the
> > same physical
> > +      space are not exchanged.
> > +
> > +      For the next few steps, this document will refer to the
> > mapping that came
> > +      from file 1 as "map1", and the mapping that came from file 2
> > as "map2".
> > +
> > +   b. Create a deferred block mapping update to unmap map1 from file
> > 1.
> > +
> > +   c. Create a deferred block mapping update to unmap map2 from file
> > 2.
> > +
> > +   d. Create a deferred block mapping update to map map1 into file
> > 2.
> > +
> > +   e. Create a deferred block mapping update to map map2 into file
> > 1.
> > +
> > +   f. Log the block, quota, and extent count updates for both files.
> > +
> > +   g. Extend the ondisk size of either file if necessary.
> > +
> > +   h. Log an extent swap done log item for the extent swap intent
> > log item
> > +      that was read at the start of step 3.
> > +
> > +   i. Compute the amount of file range that has just been covered.
> > +      This quantity is ``(map1.br_startoff + map1.br_blockcount -
> > +      sxi_startoff1)``, because step 3a could have skipped holes.
> > +
> > +   j. Increase the starting offsets of ``sxi_startoff1`` and
> > ``sxi_startoff2``
> > +      by the number of blocks computed in the previous step, and
> > decrease
> > +      ``sxi_blockcount`` by the same quantity.
> > +      This advances the cursor.
> > +
> > +   k. Log a new extent swap intent log item reflecting the advanced
> > state of
> > +      the work item.
> > +
> > +   l. Return the proper error code (EAGAIN) to the deferred
> > operation manager
> > +      to inform it that there is more work to be done.
> > +      The operation manager completes the deferred work in steps 3b-
> > 3e before
> > +      moving back to the start of step 3.
> > +
> > +4. Perform any post-processing.
> > +   This will be discussed in more detail in subsequent sections.
> > +
> > +If the filesystem goes down in the middle of an operation, log
> > recovery will
> > +find the most recent unfinished extent swap log intent item and
> > restart from
> > +there.
> > +This is how extent swapping guarantees that an outside observer will
> > either see
> > +the old broken structure or the new one, and never a mismash of
> > both.
> > +
> > +Extent Swapping with Regular User Files
> > +```````````````````````````````````````
> > +
> > +As mentioned earlier, XFS has long had the ability to swap extents
> > between
> > +files, which is used almost exclusively by ``xfs_fsr`` to defragment
> > files.
> > +The earliest form of this was the fork swap mechanism, where the
> > entire
> > +contents of data forks could be exchanged between two files by
> > exchanging the
> > +raw bytes in each inode fork's immediate area.
> > +When XFS v5 came along with self-describing metadata, this old
> > mechanism grew
> > +some log support to continue rewriting the owner fields of BMBT
> > blocks during
> > +log recovery.
> > +When the reverse mapping btree was later added to XFS, the only way
> > to maintain
> > +the consistency of the fork mappings with the reverse mapping index
> > was to
> > +develop an iterative mechanism that used deferred bmap and rmap
> > operations to
> > +swap mappings one at a time.
> > +This mechanism is identical to steps 2-3 from the procedure above
> > except for
> > +the new tracking items, because the atomic extent swap mechanism is
> > an
> > +iteration of an existing mechanism and not something totally novel.
> > +For the narrow case of file defragmentation, the file contents must
> > be
> > +identical, so the recovery guarantees are not much of a gain.
> > +
> > +Atomic extent swapping is much more flexible than the existing
> > swapext
> > +implementations because it can guarantee that the caller never sees
> > a mix of
> > +old and new contents even after a crash, and it can operate on two
> > arbitrary
> > +file fork ranges.
> > +The extra flexibility enables several new use cases:
> > +
> > +- **Atomic commit of file writes**: A userspace process opens a file
> > that it
> > +  wants to update.
> > +  Next, it opens a temporary file and calls the file clone operation
> > to reflink
> > +  the first file's contents into the temporary file.
> > +  Writes to the original file should instead be written to the
> > temporary file.
> > +  Finally, the process calls the atomic extent swap system call
> > +  (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby
> > committing all
> > +  of the updates to the original file, or none of them.
> > +
> > +- **Transactional file updates**: The same mechanism as above, but
> > the caller
> > +  only wants the commit to occur if the original file's contents
> > have not
> > +  changed.
> > +  To make this happen, the calling process snapshots the file
> > modification and
> > +  change timestamps of the original file before reflinking its data
> > to the
> > +  temporary file.
> > +  When the program is ready to commit the changes, it passes the
> > timestamps
> > +  into the kernel as arguments to the atomic extent swap system
> > call.
> > +  The kernel only commits the changes if the provided timestamps
> > match the
> > +  original file.
> > +
> > +- **Emulation of atomic block device writes**: Export a block device
> > with a
> > +  logical sector size matching the filesystem block size to force
> > all writes
> > +  to be aligned to the filesystem block size.
> > +  Stage all writes to a temporary file, and when that is complete,
> > call the
> > +  atomic extent swap system call with a flag to indicate that holes
> > in the
> > +  temporary file should be ignored.
> > +  This emulates an atomic device write in software, and can support
> > arbitrary
> > +  scattered writes.
> Mmm, this section here I would either let go or move.  Since we're not
> really talking about ofsc anymore, it's more like an "extra use case"
> section.  Side uses are great and all, but they're generally not worth
> the implementation on their own, so I think we want to keep readers
> focused on the main ofsck feature and it's mechanics.  Once we get that
> out of the way, we can come back and touch on goodies later at the end
> of the document.  

Good point.  I'll chop this out and put it in the future work section.

> > +
> > +Preparation for Extent Swapping
> > +```````````````````````````````
> > +
> > +There are a few things that need to be taken care of before
> > initiating an
> > +atomic extent swap operation.
> > +First, regular files require the page cache to be flushed to disk
> > before the
> > +operation begins, and directio writes to be quiesced.
> > +Like any filesystem operation, extent swapping must determine the
> > maximum
> > +amount of disk space and quota that can be consumed on behalf of
> > both files in
> > +the operation, and reserve that quantity of resources to avoid an
> > unrecoverable
> > +out of space failure once it starts dirtying metadata.
> > +The preparation step scans the ranges of both files to estimate:
> > +
> > +- Data device blocks needed to handle the repeated updates to the
> > fork
> > +  mappings.
> > +- Change in data and realtime block counts for both files.
> > +- Increase in quota usage for both files, if the two files do not
> > share the
> > +  same set of quota ids.
> > +- The number of extent mappings that will be added to each file.
> > +- Whether or not there are partially written realtime extents.
> > +  User programs must never be able to access a realtime file extent
> > that maps
> > +  to different extents on the realtime volume, which could happen if
> > the
> > +  operation fails to run to completion.
> > +
> > +The need for precise estimation increases the run time of the swap
> > operation,
> > +but it is very important to maintain correct accounting.
> > +The filesystem must not run completely out of free space, nor can
> > the extent
> > +swap ever add more extent mappings to a fork than it can support.
> > +Regular users are required to abide the quota limits, though
> > metadata repairs
> > +may exceed quota to resolve inconsistent metadata elsewhere.
> > +
> > +Special Features for Swapping Metadata File Extents
> > +```````````````````````````````````````````````````
> > +
> > +Extended attributes, symbolic links, and directories can set the
> > fork format to
> > +"local" and treat the fork as a literal area for data storage.
> > +Metadata repairs must take extra steps to support these cases:
> > +
> > +- If both forks are in local format and the fork areas are large
> > enough, the
> > +  swap is performed by copying the incore fork contents, logging
> > both forks,
> > +  and committing.
> > +  The atomic extent swap mechanism is not necessary, since this can
> > be done
> > +  with a single transaction.
> > +
> > +- If both forks map blocks, then the regular atomic extent swap is
> > used.
> > +
> > +- Otherwise, only one fork is in local format.
> > +  The contents of the local format fork are converted to a block to
> > perform the
> > +  swap.
> > +  The conversion to block format must be done in the same
> > transaction that
> > +  logs the initial extent swap intent log item.
> > +  The regular atomic extent swap is used to exchange the mappings.
> > +  Special flags are set on the swap operation so that the
> > transaction can be
> > +  rolled one more time to convert the second file's fork back to
> > local format
> > +  if possible.
> I feel like there's probably a function name or link that could go with
> this

It's ... scattered everywhere.  For example, the directory repair code
converts the temporary file from shortform to block format if necessary.
Then it calls xrep_tempswap_contents.  The last step of the atomic swap
is to convert block metadata back to shortform on the file being
scrubbed, which happens before control is returned to the directory
repair code.

Before the repair, we don't care if the temporary file could have been
shortform, and after the exchange, everything in the file being scrubbed
/must/ be correct.   That's why the responsibilities are split the way
they are.

> > +
> > +Extended attributes and directories stamp the owning inode into
> > every block,
> > +but the buffer verifiers do not actually check the inode number!
> > +Although there is no verification, it is still important to maintain
> > +referential integrity, so prior to performing the extent swap,
> > online repair
> > +walks every block in the new data structure to update the owner
> > field and flush
> > +the buffer to disk.
> > +
> > +After a successful swap operation, the repair operation must reap
> > the old fork
> > +blocks by processing each fork mapping through the standard
> > :ref:`file extent
> > +reaping <reaping>` mechanism that is done post-repair.
> > +If the filesystem should go down during the reap part of the repair,
> > the
> > +iunlink processing at the end of recovery will free both the
> > temporary file and
> > +whatever blocks were not reaped.
> > +However, this iunlink processing omits the cross-link detection of
> > online
> > +repair, and is not completely foolproof.
> > +
> > +Swapping Temporary File Extents
> > +```````````````````````````````
> > +
> > +To repair a metadata file, online repair proceeds as follows:
> > +
> > +1. Create a temporary repair file.
> > +
> > +2. Use the staging data to write out new contents into the temporary
> > repair
> > +   file.
> > +   The same fork must be written to as is being repaired.
> > +
> > +3. Commit the scrub transaction, since the swap estimation step must
> > be
> > +   completed before transaction reservations are made.
> > +
> > +4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub
> > transaction with
> > +   the appropriate resource reservations, locks, and fill out a
> > ``struct
> > +   xfs_swapext_req`` with the details of the swap operation.
> > +
> > +5. Call ``xrep_tempswap_contents`` to swap the contents.
> > +
> > +6. Commit the transaction to complete the repair.
> Here too.  A reference to the code would help to be able to see it side
> by side

It's the xfs_trans_commit in xchk_teardown, same as any other repair
function.

> > +
> > +.. _rtsummary:
> > +
> > +Case Study: Repairing the Realtime Summary File
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +In the "realtime" section of an XFS filesystem, free space is
> > tracked via a
> > +bitmap, similar to Unix FFS.
> > +Each bit in the bitmap represents one realtime extent, which is a
> > multiple of
> > +the filesystem block size between 4KiB and 1GiB in size.
> > +The realtime summary file indexes the number of free extents of a
> > given size to
> > +the offset of the block within the realtime free space bitmap where
> > those free
> > +extents begin.
> > +In other words, the summary file helps the allocator find free
> > extents by
> > +length, similar to what the free space by count (cntbt) btree does
> > for the data
> > +section.
> > +
> > +The summary file itself is a flat file (with no block headers or
> > checksums!)
> > +partitioned into ``log2(total rt extents)`` sections containing
> > enough 32-bit
> > +counters to match the number of blocks in the rt bitmap.
> > +Each counter records the number of free extents that start in that
> > bitmap block
> > +and can satisfy a power-of-two allocation request.
> > +
> > +To check the summary file against the bitmap:
> > +
> > +1. Take the ILOCK of both the realtime bitmap and summary files.
> > +
> > +2. For each free space extent recorded in the bitmap:
> > +
> > +   a. Compute the position in the summary file that contains a
> > counter that
> > +      represents this free extent.
> > +
> > +   b. Read the counter from the xfile.
> > +
> > +   c. Increment it, and write it back to the xfile.
> > +
> > +3. Compare the contents of the xfile against the ondisk file.
> > +
> > +To repair the summary file, write the xfile contents into the
> > temporary file
> > +and use atomic extent swap to commit the new contents.
> > +The temporary file is then reaped.
> > +
> > +The proposed patchset is the
> > +`realtime summary repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-rtsummary>`_
> I think this is the same link as the last.  Did you mean to have a
> different link here?

Ooh, you're right, it's the previous link that should have bene to the
repair-tempfiles branch.  Thank you for catching this.

> > +series.
> > +
> > +Case Study: Salvaging Extended Attributes
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +In XFS, extended attributes are implemented as a namespaced name-
> > value store.
> > +Values are limited in size to 64KiB, but there is no limit in the
> > number of
> > +names.
> > +The attribute fork is unpartitioned, which means that the root of
> > the attribute
> > +structure is always in logical block zero, but attribute leaf
> > blocks, dabtree
> > +index blocks, and remote value blocks are intermixed.
> > +Attribute leaf blocks contain variable-sized records that associate
> > +user-provided names with the user-provided values.
> > +Values larger than a block are allocated separate extents and
> > written there.
> > +If the leaf information expands beyond a single block, a
> > directory/attribute
> > +btree (``dabtree``) is created to map hashes of attribute names to
> > entries
> > +for fast lookup.
> > +
> > +Salvaging extended attributes is done as follows:
> > +
> > +1. Walk the attr fork mappings of the file being repaired to find
> > the attribute
> > +   leaf blocks.
> > +   When one is found,
> > +
> > +   a. Walk the attr leaf block to find candidate keys.
> > +      When one is found,
> > +
> > +      1. Check the name for problems, and ignore the name if there
> > are.
> > +
> > +      2. Retrieve the value.
> > +         If that succeeds, add the name and value to the staging
> > xfarray and
> > +         xfblob.
> > +
> > +2. If the memory usage of the xfarray and xfblob exceed a certain
> > amount of
> > +   memory or there are no more attr fork blocks to examine, unlock
> > the file and
> > +   add the staged extended attributes to the temporary file.
> > +
> > +3. Use atomic extent swapping to exchange the new and old extended
> > attribute
> > +   structures.
> > +   The old attribute blocks are now attached to the temporary file.
> > +
> > +4. Reap the temporary file.
> > +
> > +The proposed patchset is the
> > +`extended attribute repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-xattrs>`_
> > +series.
> 
> I think it mostly looks good, culling and link suggestions aside.

Thanks!

--D

> Allison
> 
> > 
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 13/14] xfs: document the userspace fsck driver program
  2022-12-30 22:10   ` [PATCH 13/14] xfs: document the userspace fsck driver program Darrick J. Wong
@ 2023-03-01  5:36     ` Allison Henderson
  2023-03-02  0:27       ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-03-01  5:36 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add the sixth chapter of the online fsck design documentation, where
> we discuss the details of the data structures and algorithms used by
> the
> driver program xfs_scrub.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  313
> ++++++++++++++++++++
>  1 file changed, 313 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index 2e20314f1831..05b9411fac7f 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -300,6 +300,9 @@ The seven phases are as follows:
>  7. Re-check the summary counters and presents the caller with a
> summary of
>     space usage and file counts.
>  
> +This allocation of responsibilities will be :ref:`revisited
> <scrubcheck>`
> +later in this document.
> +
>  Steps for Each Scrub Item
>  -------------------------
>  
> @@ -4505,3 +4508,313 @@ The proposed patches are in the
>  `orphanage adoption
>  <
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=repair-orphanage>`_
>  series.
> +
> +6. Userspace Algorithms and Data Structures
> +===========================================
> +
> +This section discusses the key algorithms and data structures of the
> userspace
> +program, ``xfs_scrub``, that provide the ability to drive metadata
> checks and
> +repairs in the kernel, verify file data, and look for other
> potential problems.
> +
> +.. _scrubcheck:
> +
> +Checking Metadata
> +-----------------
> +
> +Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier.
> +That structure follows naturally from the data dependencies designed
> into the
> +filesystem from its beginnings in 1993.
> +In XFS, there are several groups of metadata dependencies:
> +
> +a. Filesystem summary counts depend on consistency within the inode
> indices,
> +   the allocation group space btrees, and the realtime volume space
> +   information.
> +
> +b. Quota resource counts depend on consistency within the quota file
> data
> +   forks, inode indices, inode records, and the forks of every file
> on the
> +   system.
> +
> +c. The naming hierarchy depends on consistency within the directory
> and
> +   extended attribute structures.
> +   This includes file link counts.
> +
> +d. Directories, extended attributes, and file data depend on
> consistency within
> +   the file forks that map directory and extended attribute data to
> physical
> +   storage media.
> +
> +e. The file forks depends on consistency within inode records and
> the space
> +   metadata indices of the allocation groups and the realtime
> volume.
> +   This includes quota and realtime metadata files.
> +
> +f. Inode records depends on consistency within the inode metadata
> indices.
> +
> +g. Realtime space metadata depend on the inode records and data
> forks of the
> +   realtime metadata inodes.
> +
> +h. The allocation group metadata indices (free space, inodes,
> reference count,
> +   and reverse mapping btrees) depend on consistency within the AG
> headers and
> +   between all the AG metadata btrees.
> +
> +i. ``xfs_scrub`` depends on the filesystem being mounted and kernel
> support
> +   for online fsck functionality.
> +
> +Therefore, a metadata dependency graph is a convenient way to
> schedule checking
> +operations in the ``xfs_scrub`` program:
> +
> +- Phase 1 checks that the provided path maps to an XFS filesystem
> and detect
> +  the kernel's scrubbing abilities, which validates group (i).
> +
> +- Phase 2 scrubs groups (g) and (h) in parallel using a threaded
> workqueue.
> +
> +- Phase 3 checks groups (f), (e), and (d), in that order.
> +  These groups are all file metadata, which means that inodes are
> scanned in
> +  parallel.
...When things are done in order, then they are done in serial right?
Things done in parallel are done at the same time.  Either the phrase
"in that order" needs to go away, or the last line needs to drop

> +
> +- Phase 4 repairs everything in groups (i) through (d) so that
> phases 5 and 6
> +  may run reliably.
> +
> +- Phase 5 starts by checking groups (b) and (c) in parallel before
> moving on
> +  to checking names.
> +
> +- Phase 6 depends on groups (i) through (b) to find file data blocks
> to verify,
> +  to read them, and to report which blocks of which files are
> affected.
> +
> +- Phase 7 checks group (a), having validated everything else.
> +
> +Notice that the data dependencies between groups are enforced by the
> structure
> +of the program flow.
> +
> +Parallel Inode Scans
> +--------------------
> +
> +An XFS filesystem can easily contain hundreds of millions of inodes.
> +Given that XFS targets installations with large high-performance
> storage,
> +it is desirable to scrub inodes in parallel to minimize runtime,
> particularly
> +if the program has been invoked manually from a command line.
> +This requires careful scheduling to keep the threads as evenly
> loaded as
> +possible.
> +
> +Early iterations of the ``xfs_scrub`` inode scanner naïvely created
> a single
> +workqueue and scheduled a single workqueue item per AG.
> +Each workqueue item walked the inode btree (with
> ``XFS_IOC_INUMBERS``) to find
> +inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to
> gather enough
> +information to construct file handles.
> +The file handle was then passed to a function to generate scrub
> items for each
> +metadata object of each inode.
> +This simple algorithm leads to thread balancing problems in phase 3
> if the
> +filesystem contains one AG with a few large sparse files and the
> rest of the
> +AGs contain many smaller files.
> +The inode scan dispatch function was not sufficiently granular; it
> should have
> +been dispatching at the level of individual inodes, or, to constrain
> memory
> +consumption, inode btree records.
> +
> +Thanks to Dave Chinner, bounded workqueues in userspace enable
> ``xfs_scrub`` to
> +avoid this problem with ease by adding a second workqueue.
> +Just like before, the first workqueue is seeded with one workqueue
> item per AG,
> +and it uses INUMBERS to find inode btree chunks.
> +The second workqueue, however, is configured with an upper bound on
> the number
> +of items that can be waiting to be run.
> +Each inode btree chunk found by the first workqueue's workers are
> queued to the
> +second workqueue, and it is this second workqueue that queries
> BULKSTAT,
> +creates a file handle, and passes it to a function to generate scrub
> items for
> +each metadata object of each inode.
> +If the second workqueue is too full, the workqueue add function
> blocks the
> +first workqueue's workers until the backlog eases.
> +This doesn't completely solve the balancing problem, but reduces it
> enough to
> +move on to more pressing issues.
> +
> +The proposed patchsets are the scrub
> +`performance tweaks
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> it/log/?h=scrub-performance-tweaks>`_
> +and the
> +`inode scan rebalance
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> it/log/?h=scrub-iscan-rebalance>`_
> +series.
> +
> +.. _scrubrepair:
> +
> +Scheduling Repairs
> +------------------
> +
> +During phase 2, corruptions and inconsistencies reported in any AGI
> header or
> +inode btree are repaired immediately, because phase 3 relies on
> proper
> +functioning of the inode indices to find inodes to scan.
> +Failed repairs are rescheduled to phase 4.
> +Problems reported in any other space metadata are deferred to phase
> 4.
> +Optimization opportunities are always deferred to phase 4, no matter
> their
> +origin.
> +
> +During phase 3, corruptions and inconsistencies reported in any part
> of a
> +file's metadata are repaired immediately if all space metadata were
> validated
> +during phase 2.
> +Repairs that fail or cannot be repaired immediately are scheduled
> for phase 4.
> +
> +In the original design of ``xfs_scrub``, it was thought that repairs
> would be
> +so infrequent that the ``struct xfs_scrub_metadata`` objects used to
> +communicate with the kernel could also be used as the primary object
> to
> +schedule repairs.
> +With recent increases in the number of optimizations possible for a
> given
> +filesystem object, it became much more memory-efficient to track all
> eligible
> +repairs for a given filesystem object with a single repair item.
> +Each repair item represents a single lockable object -- AGs,
> metadata files,
> +individual inodes, or a class of summary information.
> +
> +Phase 4 is responsible for scheduling a lot of repair work in as
> quick a
> +manner as is practical.
> +The :ref:`data dependencies <scrubcheck>` outlined earlier still
> apply, which
> +means that ``xfs_scrub`` must try to complete the repair work
> scheduled by
> +phase 2 before trying repair work scheduled by phase 3.
> +The repair process is as follows:
> +
> +1. Start a round of repair with a workqueue and enough workers to
> keep the CPUs
> +   as busy as the user desires.
> +
> +   a. For each repair item queued by phase 2,
> +
> +      i.   Ask the kernel to repair everything listed in the repair
> item for a
> +           given filesystem object.
> +
> +      ii.  Make a note if the kernel made any progress in reducing
> the number
> +           of repairs needed for this object.
> +
> +      iii. If the object no longer requires repairs, revalidate all
> metadata
> +           associated with this object.
> +           If the revalidation succeeds, drop the repair item.
> +           If not, requeue the item for more repairs.
> +
> +   b. If any repairs were made, jump back to 1a to retry all the
> phase 2 items.
> +
> +   c. For each repair item queued by phase 3,
> +
> +      i.   Ask the kernel to repair everything listed in the repair
> item for a
> +           given filesystem object.
> +
> +      ii.  Make a note if the kernel made any progress in reducing
> the number
> +           of repairs needed for this object.
> +
> +      iii. If the object no longer requires repairs, revalidate all
> metadata
> +           associated with this object.
> +           If the revalidation succeeds, drop the repair item.
> +           If not, requeue the item for more repairs.
> +
> +   d. If any repairs were made, jump back to 1c to retry all the
> phase 3 items.
> +
> +2. If step 1 made any repair progress of any kind, jump back to step
> 1 to start
> +   another round of repair.
> +
> +3. If there are items left to repair, run them all serially one more
> time.
> +   Complain if the repairs were not successful, since this is the
> last chance
> +   to repair anything.
> +
> +Corruptions and inconsistencies encountered during phases 5 and 7
> are repaired
> +immediately.
> +Corrupt file data blocks reported by phase 6 cannot be recovered by
> the
> +filesystem.
> +
> +The proposed patchsets are the
> +`repair warning improvements
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> it/log/?h=scrub-better-repair-warnings>`_,
> +refactoring of the
> +`repair data dependency
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> it/log/?h=scrub-repair-data-deps>`_
> +and
> +`object tracking
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> it/log/?h=scrub-object-tracking>`_,
> +and the
> +`repair scheduling
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> it/log/?h=scrub-repair-scheduling>`_
> +improvement series.
> +
> +Checking Names for Confusable Unicode Sequences
> +-----------------------------------------------
> +
> +If ``xfs_scrub`` succeeds in validating the filesystem metadata by
> the end of
> +phase 4, it moves on to phase 5, which checks for suspicious looking
> names in
> +the filesystem.
> +These names consist of the filesystem label, names in directory
> entries, and
> +the names of extended attributes.
> +Like most Unix filesystems, XFS imposes the sparest of constraints
> on the
> +contents of a name -- slashes and null bytes are not allowed in
> directory
> +entries; and null bytes are not allowed in extended attributes and
maybe say "standard user accessible extended attributes"
> the
> +filesystem label.
> +Directory entries and attribute keys store the length of the name
> explicitly
> +ondisk, which means that nulls are not name terminators.
> +For this section, the term "naming domain" refers to any place where
> names are
> +presented together -- all the names in a directory, or all the
> attributes of a
> +file.
> +
> +Although the Unix naming constraints are very permissive, the
> reality of most
> +modern-day Linux systems is that programs work with Unicode
> character code
> +points to support international languages.
> +These programs typically encode those code points in UTF-8 when
> interfacing
> +with the C library because the kernel expects null-terminated names.
> +In the common case, therefore, names found in an XFS filesystem are
> actually
> +UTF-8 encoded Unicode data.
> +
> +To maximize its expressiveness, the Unicode standard defines
> separate control
> +points for various characters that render similarly or identically
> in writing
> +systems around the world.
> +For example, the character "Cyrillic Small Letter A" U+0430 "а"
> often renders
> +identically to "Latin Small Letter A" U+0061 "a".


> +
> +The standard also permits characters to be constructed in multiple
> ways --
> +either by using a defined code point, or by combining one code point
> with
> +various combining marks.
> +For example, the character "Angstrom Sign U+212B "Å" can also be
> expressed
> +as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring
> Above"
> +U+030A "◌̊".
> +Both sequences render identically.
> +
> +Like the standards that preceded it, Unicode also defines various
> control
> +characters to alter the presentation of text.
> +For example, the character "Right-to-Left Override" U+202E can trick
> some
> +programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png".
> +A second category of rendering problems involves whitespace
> characters.
> +If the character "Zero Width Space" U+200B is encountered in a file
> name, the
> +name will render identically to a name that does not have the zero
> width
> +space.
> +
> +If two names within a naming domain have different byte sequences
> but render
> +identically, a user may be confused by it.
> +The kernel, in its indifference to upper level encoding schemes,
> permits this.
> +Most filesystem drivers persist the byte sequence names that are
> given to them
> +by the VFS.
> +
> +Techniques for detecting confusable names are explained in great
> detail in
> +sections 4 and 5 of the
> +`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_
> +document.
I don't know that we need this much detail on character rendering.  I
think the example above is enough to make the point that character
strings can differ in binary, but render the same, so we need to deal
with that.  So I think that's really all the justification we need for
the NFD usage

> +``xfs_scrub``, when it detects UTF-8 encoding in use on a system,
> uses the
When ``xfs_scrub`` detects UTF-8 encoding, it uses the...

> +Unicode normalization form NFD in conjunction with the confusable
> name
> +detection component of
> +`libicu <https://github.com/unicode-org/icu>`_
> +to identify names with a directory or within a file's extended
> attributes that
> +could be confused for each other.
> +Names are also checked for control characters, non-rendering
> characters, and
> +mixing of bidirectional characters.
> +All of these potential issues are reported to the system
> administrator during
> +phase 5.
> +
> +Media Verification of File Data Extents
> +---------------------------------------
> +
> +The system administrator can elect to initiate a media scan of all
> file data
> +blocks.
> +This scan after validation of all filesystem metadata (except for
> the summary
> +counters) as phase 6.
> +The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the
> filesystem space map
> +to find areas that are allocated to file data fork extents.
> +Gaps betweeen data fork extents that are smaller than 64k are
> treated as if
> +they were data fork extents to reduce the command setup overhead.
> +When the space map scan accumulates a region larger than 32MB, a
> media
> +verification request is sent to the disk as a directio read of the
> raw block
> +device.
> +
> +If the verification read fails, ``xfs_scrub`` retries with single-
> block reads
> +to narrow down the failure to the specific region of the media and
> recorded.
> +When it has finished issuing verification requests, it again uses
> the space
> +mapping ioctl to map the recorded media errors back to metadata
> structures
> +and report what has been lost.
> +For media errors in blocks owned by files, the lack of parent
> pointers means
> +that the entire filesystem must be walked to report the file paths
> and offsets
> +corresponding to the media error.
> 
This last bit will need to be updated after we come to a decision with
the rfc

Other than that, I think it looks pretty good.

Allison


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 14/14] xfs: document future directions of online fsck
  2022-12-30 22:10   ` [PATCH 14/14] xfs: document future directions of online fsck Darrick J. Wong
@ 2023-03-01  5:37     ` Allison Henderson
  2023-03-02  0:39       ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-03-01  5:37 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add the seventh and final chapter of the online fsck documentation,
> where we talk about future functionality that can tie in with the
> functionality provided by the online fsck patchset.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  155
> ++++++++++++++++++++
>  1 file changed, 155 insertions(+)
> 
> 
> diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> b/Documentation/filesystems/xfs-online-fsck-design.rst
> index 05b9411fac7f..41291edb02b9 100644
> --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> @@ -4067,6 +4067,8 @@ The extra flexibility enables several new use
> cases:
>    (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby
> committing all
>    of the updates to the original file, or none of them.
>  
> +.. _swapext_if_unchanged:
> +
>  - **Transactional file updates**: The same mechanism as above, but
> the caller
>    only wants the commit to occur if the original file's contents
> have not
>    changed.
> @@ -4818,3 +4820,156 @@ and report what has been lost.
>  For media errors in blocks owned by files, the lack of parent
> pointers means
>  that the entire filesystem must be walked to report the file paths
> and offsets
>  corresponding to the media error.
> +
> +7. Conclusion and Future Work
> +=============================
> +
> +It is hoped that the reader of this document has followed the
> designs laid out
> +in this document and now has some familiarity with how XFS performs
> online
> +rebuilding of its metadata indices, and how filesystem users can
> interact with
> +that functionality.
> +Although the scope of this work is daunting, it is hoped that this
> guide will
> +make it easier for code readers to understand what has been built,
> for whom it
> +has been built, and why.
> +Please feel free to contact the XFS mailing list with questions.
> +
> +FIEXCHANGE_RANGE
> +----------------
> +
> +As discussed earlier, a second frontend to the atomic extent swap
> mechanism is
> +a new ioctl call that userspace programs can use to commit updates
> to files
> +atomically.
> +This frontend has been out for review for several years now, though
> the
> +necessary refinements to online repair and lack of customer demand
> mean that
> +the proposal has not been pushed very hard.
> +
> +Vectorized Scrub
> +----------------
> +
> +As it turns out, the :ref:`refactoring <scrubrepair>` of repair
> items mentioned
> +earlier was a catalyst for enabling a vectorized scrub system call.
> +Since 2018, the cost of making a kernel call has increased
> considerably on some
> +systems to mitigate the effects of speculative execution attacks.
> +This incentivizes program authors to make as few system calls as
> possible to
> +reduce the number of times an execution path crosses a security
> boundary.
> +
> +With vectorized scrub, userspace pushes to the kernel the identity
> of a
> +filesystem object, a list of scrub types to run against that object,
> and a
> +simple representation of the data dependencies between the selected
> scrub
> +types.
> +The kernel executes as much of the caller's plan as it can until it
> hits a
> +dependency that cannot be satisfied due to a corruption, and tells
> userspace
> +how much was accomplished.
> +It is hoped that ``io_uring`` will pick up enough of this
> functionality that
> +online fsck can use that instead of adding a separate vectored scrub
> system
> +call to XFS.
> +
> +The relevant patchsets are the
> +`kernel vectorized scrub
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=vectorized-scrub>`_
> +and
> +`userspace vectorized scrub
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> it/log/?h=vectorized-scrub>`_
> +series.
> +
> +Quality of Service Targets for Scrub
> +------------------------------------
> +
> +One serious shortcoming of the online fsck code is that the amount
> of time that
> +it can spend in the kernel holding resource locks is basically
> unbounded.
> +Userspace is allowed to send a fatal signal to the process which
> will cause
> +``xfs_scrub`` to exit when it reaches a good stopping point, but
> there's no way
> +for userspace to provide a time budget to the kernel.
> +Given that the scrub codebase has helpers to detect fatal signals,
> it shouldn't
> +be too much work to allow userspace to specify a timeout for a
> scrub/repair
> +operation and abort the operation if it exceeds budget.
> +However, most repair functions have the property that once they
> begin to touch
> +ondisk metadata, the operation cannot be cancelled cleanly, after
> which a QoS
> +timeout is no longer useful.
> +
> +Defragmenting Free Space
> +------------------------
> +
> +Over the years, many XFS users have requested the creation of a
> program to
> +clear a portion of the physical storage underlying a filesystem so
> that it
> +becomes a contiguous chunk of free space.
> +Call this free space defragmenter ``clearspace`` for short.
> +
> +The first piece the ``clearspace`` program needs is the ability to
> read the
> +reverse mapping index from userspace.
> +This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
> +The second piece it needs is a new fallocate mode
> +(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a
> region and
> +maps it to a file.
> +Call this file the "space collector" file.
> +The third piece is the ability to force an online repair.
> +
> +To clear all the metadata out of a portion of physical storage,
> clearspace
> +uses the new fallocate map-freespace call to map any free space in
> that region
> +to the space collector file.
> +Next, clearspace finds all metadata blocks in that region by way of
> +``GETFSMAP`` and issues forced repair requests on the data
> structure.
> +This often results in the metadata being rebuilt somewhere that is
> not being
> +cleared.
> +After each relocation, clearspace calls the "map free space"
> function again to
> +collect any newly freed space in the region being cleared.
> +
> +To clear all the file data out of a portion of the physical storage,
> clearspace
> +uses the FSMAP information to find relevant file data blocks.
> +Having identified a good target, it uses the ``FICLONERANGE`` call
> on that part
> +of the file to try to share the physical space with a dummy file.
> +Cloning the extent means that the original owners cannot overwrite
> the
> +contents; any changes will be written somewhere else via copy-on-
> write.
> +Clearspace makes its own copy of the frozen extent in an area that
> is not being
> +cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
> +<swapext_if_unchanged>` feature) to change the target file's data
> extent
> +mapping away from the area being cleared.
> +When all other mappings have been moved, clearspace reflinks the
> space into the
> +space collector file so that it becomes unavailable.
> +
> +There are further optimizations that could apply to the above
> algorithm.
> +To clear a piece of physical storage that has a high sharing factor,
> it is
> +strongly desirable to retain this sharing factor.
> +In fact, these extents should be moved first to maximize sharing
> factor after
> +the operation completes.
> +To make this work smoothly, clearspace needs a new ioctl
> +(``FS_IOC_GETREFCOUNTS``) to report reference count information to
> userspace.
> +With the refcount information exposed, clearspace can quickly find
> the longest,
> +most shared data extents in the filesystem, and target them first.
> +


> +**Question**: How might the filesystem move inode chunks?
> +
> +*Answer*: 
"In order to move inode chunks.."

> Dave Chinner has a prototype that creates a new file with the old
> +contents and then locklessly runs around the filesystem updating
> directory
> +entries.
> +The operation cannot complete if the filesystem goes down.
> +That problem isn't totally insurmountable: create an inode remapping
> table
> +hidden behind a jump label, and a log item that tracks the kernel
> walking the
> +filesystem to update directory entries.
> +The trouble is, the kernel can't do anything about open files, since
> it cannot
> +revoke them.
> +


> +**Question**: Can static keys be used to add a revoke bailout return
> to
> +*every* code path coming in from userspace?
> +
> +*Answer*: In principle, yes.
> +This 

"It is also possible to use static keys to add a revoke bailout return
to each code path coming in from userspace.  This..."

> would eliminate the overhead of the check until a revocation happens.
> +It's not clear what we do to a revoked file after all the callers
> are finished
> +with it, however.
> +
> +The relevant patchsets are the
> +`kernel freespace defrag
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> log/?h=defrag-freespace>`_
> +and
> +`userspace freespace defrag
> +<
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> it/log/?h=defrag-freespace>`_
> +series.

I guess since they're just future ideas just light documentation is
fine.  Other than cleaning out the Q & A's, I think it looks pretty
good.

Allison

> +
> +Shrinking Filesystems
> +---------------------
> +
> +Removing the end of the filesystem ought to be a simple matter of
> evacuating
> +the data and metadata at the end of the filesystem, and handing the
> freed space
> +to the shrink code.
> +That requires an evacuation of the space at end of the filesystem,
> which is a
> +use of free space defragmentation!
> 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 10/14] xfs: document full filesystem scans for online fsck
  2023-02-25  7:33         ` Allison Henderson
@ 2023-03-01 22:09           ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-01 22:09 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Sat, Feb 25, 2023 at 07:33:38AM +0000, Allison Henderson wrote:

<snip>

> > > Mostly looks good nits aside, I do sort of wonder if this patch
> > > would
> > > do better to appear before patch 6 (or move 6 down), since it gets
> > > into
> > > more challenges concerning locks and hooks, where as here we are
> > > mostly
> > > discussing what they are and how they work.  So it might build
> > > better
> > > to move this patch up a little.
> > 
> > (I might be a tad confused here, bear with me.)
> > 
> > Patch 6, the section about eventual consistency?
> > 
> > Hmm.  The intent drains exist to quiesce intent chains targeting
> > specific AGs.  It briefly mentions "fshooks" in the context of using
> > jump labels to avoid the overhead of calling notify_all on the drain
> > waitqueue when scrub isn't running.  That's perhaps bad naming on my
> > part, since the other "fshooks" are jump labels to avoid bouncing
> > through the notifier chain code when scrub isn't running.  The jump
> > labels themselves are not hooks, they're structured dynamic code
> > patching.
> > 
> > I probably should've named those something else.  fsgates?
> Oh, i see, yes I did sort of try to correlate them, so maybe the
> different name would help.

Done.

> > Or maybe you were talking specifically about "Case Study: Rebuilding
> > Reverse Mapping Records"?  In which case I remark that the case study
> > needs both the intent drains to quiesce the AG and the live scans to
> > work properly, which is why the case study of it couldn't come
> > earlier.
> > The intent drains section still ought to come before the refcountbt
> > section, because it's the refcountbt scrubber that first hit the
> > coordination problem.
> > 
> > Things are getting pretty awkward like this because there are sooo
> > many
> > interdependent pieces. :(
> 
> I see, ok no worries then, I think people will figure it out either
> way.  I mostly look for ways to make the presentation easier but it is
> getting harder to move stuff with chicken and egg dependencies.

Indeed.  Thank you so much for your patience. :)

--D

> > 
> > Regardless, thank you very much for slogging through.
> > 
> > --D
> > 
> > > Allison
> > > 
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH v24.3 12/14] xfs: document directory tree repairs
  2023-02-25  7:33       ` Allison Henderson
@ 2023-03-02  0:14         ` Darrick J. Wong
  2023-03-03 23:50           ` Allison Henderson
  0 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-02  0:14 UTC (permalink / raw)
  To: Allison Henderson
  Cc: david, linux-fsdevel, hch, linux-xfs, willy, Catherine Hoang,
	Chandan Babu

On Sat, Feb 25, 2023 at 07:33:23AM +0000, Allison Henderson wrote:
> On Thu, 2023-02-02 at 18:12 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Directory tree repairs are the least complete part of online fsck,
> > due
> > to the lack of directory parent pointers.  However, even without that
> > feature, we can still make some corrections to the directory tree --
> > we
> > can salvage as many directory entries as we can from a damaged
> > directory, and we can reattach orphaned inodes to the lost+found,
> > just
> > as xfs_repair does now.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> > v24.2: updated with my latest thoughts about how to use parent
> > pointers
> > v24.3: updated to reflect the online fsck code I built for parent
> > pointers
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  410
> > ++++++++++++++++++++
> >  1 file changed, 410 insertions(+)
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index af7755fe0107..51d040e4a2d0 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -4359,3 +4359,413 @@ The proposed patchset is the
> >  `extended attribute repair
> >  <
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-xattrs>`_
> >  series.
> > +
> > +Fixing Directories
> > +------------------
> > +
> > +Fixing directories is difficult with currently available filesystem
> > features,
> > +since directory entries are not redundant.
> > +The offline repair tool scans all inodes to find files with nonzero
> > link count,
> > +and then it scans all directories to establish parentage of those
> > linked files.
> > +Damaged files and directories are zapped, and files with no parent
> > are
> > +moved to the ``/lost+found`` directory.
> > +It does not try to salvage anything.
> > +
> > +The best that online repair can do at this time is to read directory
> > data
> > +blocks and salvage any dirents that look plausible, correct link
> > counts, and
> > +move orphans back into the directory tree.
> > +The salvage process is discussed in the case study at the end of
> > this section.
> > +The :ref:`file link count fsck <nlinks>` code takes care of fixing
> > link counts
> > +and moving orphans to the ``/lost+found`` directory.
> > +
> > +Case Study: Salvaging Directories
> > +`````````````````````````````````
> > +
> > +Unlike extended attributes, directory blocks are all the same size,
> > so
> > +salvaging directories is straightforward:
> > +
> > +1. Find the parent of the directory.
> > +   If the dotdot entry is not unreadable, try to confirm that the
> > alleged
> > +   parent has a child entry pointing back to the directory being
> > repaired.
> > +   Otherwise, walk the filesystem to find it.
> > +
> > +2. Walk the first partition of data fork of the directory to find
> > the directory
> > +   entry data blocks.
> > +   When one is found,
> > +
> > +   a. Walk the directory data block to find candidate entries.
> > +      When an entry is found:
> > +
> > +      i. Check the name for problems, and ignore the name if there
> > are.
> > +
> > +      ii. Retrieve the inumber and grab the inode.
> > +          If that succeeds, add the name, inode number, and file
> > type to the
> > +          staging xfarray and xblob.
> > +
> > +3. If the memory usage of the xfarray and xfblob exceed a certain
> > amount of
> > +   memory or there are no more directory data blocks to examine,
> > unlock the
> > +   directory and add the staged dirents into the temporary
> > directory.
> > +   Truncate the staging files.
> > +
> > +4. Use atomic extent swapping to exchange the new and old directory
> > structures.
> > +   The old directory blocks are now attached to the temporary file.
> > +
> > +5. Reap the temporary file.
> > +
> 
> 
> 
> > +**Future Work Question**: Should repair revalidate the dentry cache
> > when
> > +rebuilding a directory?
> > +
> > +*Answer*: Yes, though the current dentry cache code doesn't provide
> > a means
> > +to walk every dentry of a specific directory.
> > +If the cache contains an entry that the salvaging code does not
> > find, the
> > +repair cannot proceed.
> > +
> > +**Future Work Question**: Can the dentry cache know about a
> > directory entry
> > +that cannot be salvaged?
> > +
> > +*Answer*: In theory, the dentry cache should be a subset of the
> > directory
> > +entries on disk because there's no way to load a dentry without
> > having
> > +something to read in the directory.
> > +However, it is possible for a coherency problem to be introduced if
> > the ondisk
> > +structures becomes corrupt *after* the cache loads.
> > +In theory it is necessary to scan all dentry cache entries for a
> > directory to
> > +ensure that one of the following apply:
> 
> "Currently the dentry cache code doesn't provide a means to walk every
> dentry of a specific directory.  This makes validation of the rebuilt
> directory difficult, and it is possible that an ondisk structure to
> become corrupt *after* the cache loads.  Walking the dentry cache is
> currently being considered as a future improvement.  This will also
> enable the ability to report which entries were not salvageable since
> these will be the subset of entries that are absent after the walk. 
> This improvement will ensure that one of the following apply:"

The thing is -- I'm not considering restructuring the dentry cache.  The
cache key is a one-way hash function of the parent_ino and the dirent
name, and I can't even imagine how one would support using that for
arbitrary lookups or walks.

This is the giant hole in all of the online repair code -- the design of
the dentry cache is such that we can't invalidate the entire cache.  We
also cannot walk it to perform targeted invalidation of just the pieces
we want.  If after a repair the cache contains a dentry that isn't
backed by an actual ondisk directory entry ... kaboom.

The one thing I'll grant you is that I don't think it's likely that the
dentry cache will get populated with some information and later the
ondisk directory bitrots undetectably.

> ?
> 
> I just think it reads cleaner.  I realize this is an area that still
> sort of in flux, but definitely before we call the document done we
> should probably strip out the Q's and just document the A's.  If
> someone re-raises the Q's we can always refer to the archives and then
> have the discussion on the mailing list.  But I think the document
> should maintain the goal of making clear whatever the current plan is
> just to keep it reading cleanly. 

Yeah, I'll shorten this section so that it only mentions these things
once and clearly states that I have no solution.

> > +
> > +1. The cached dentry reflects an ondisk dirent in the new directory.
> > +
> > +2. The cached dentry no longer has a corresponding ondisk dirent in
> > the new
> > +   directory and the dentry can be purged from the cache.
> > +
> > +3. The cached dentry no longer has an ondisk dirent but the dentry
> > cannot be
> > +   purged.
> 
> > +   This is bad.
> These entries are irrecoverable, but can now be reported.
> 
> 
> 
> > +
> > +As mentioned above, the dentry cache does not have a means to walk
> > all the
> > +dentries with a particular directory as a parent.
> > +This makes detecting situations #2 and #3 impossible, and remains an
> > +interesting question for research.
> I think the above paraphrase makes this last bit redundant.

N

> > +
> > +The proposed patchset is the
> > +`directory repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-dirs>`_
> > +series.
> > +
> > +Parent Pointers
> > +```````````````
> > +
> "Generally speaking, a parent pointer is any kind of metadata that
> enables an inode to locate its parent with out having to traverse the
> directory tree from the root."
> 
> > +The lack of secondary directory metadata hinders directory tree
> "Without them, the lack of secondary..." 

Ok.  I want to reword the first sentence slightly, yielding this:

"A parent pointer is a piece of file metadata that enables a user to
locate the file's parent directory without having to traverse the
directory tree from the root.  Without them, reconstruction of directory
trees is hindered in much the same way that the historic lack of reverse
space mapping information once hindered reconstruction of filesystem
space metadata.  The parent pointer feature, however, makes total
directory reconstruction
possible."

But that's a much better start to the paragraph, thank you.

> > reconstruction
> > +in much the same way that the historic lack of reverse space mapping
> > +information once hindered reconstruction of filesystem space
> > metadata.
> > +The parent pointer feature, however, makes total directory
> > reconstruction
> > +possible.
> > +
> 
> History side bar the below chunk...

Done.

> > +Directory parent pointers were first proposed as an XFS feature more
> > than a
> > +decade ago by SGI.
> > +Each link from a parent directory to a child file is mirrored with
> > an extended
> > +attribute in the child that could be used to identify the parent
> > directory.
> > +Unfortunately, this early implementation had major shortcomings and
> > was never
> > +merged into Linux XFS:
> > +
> > +1. The XFS codebase of the late 2000s did not have the
> > infrastructure to
> > +   enforce strong referential integrity in the directory tree.
> > +   It did not guarantee that a change in a forward link would always
> > be
> > +   followed up with the corresponding change to the reverse links.
> > +
> > +2. Referential integrity was not integrated into offline repair.
> > +   Checking and repairs were performed on mounted filesystems
> > without taking
> > +   any kernel or inode locks to coordinate access.
> > +   It is not clear how this actually worked properly.
> > +
> > +3. The extended attribute did not record the name of the directory
> > entry in the
> > +   parent, so the SGI parent pointer implementation cannot be used
> > to reconnect
> > +   the directory tree.
> > +
> > +4. Extended attribute forks only support 65,536 extents, which means
> > that
> > +   parent pointer attribute creation is likely to fail at some point
> > before the
> > +   maximum file link count is achieved.
> 
> 
> "The original parent pointer design was too unstable for something like
> a file system repair to depend on."

Er... I think this is addressed by #2 above?

> > +
> > +Allison Henderson, Chandan Babu, and Catherine Hoang are working on
> > a second
> > +implementation that solves all shortcomings of the first.
> > +During 2022, Allison introduced log intent items to track physical
> > +manipulations of the extended attribute structures.
> > +This solves the referential integrity problem by making it possible
> > to commit
> > +a dirent update and a parent pointer update in the same transaction.
> > +Chandan increased the maximum extent counts of both data and
> > attribute forks,
> 
> > +thereby addressing the fourth problem.
> which ensures the parent pointer creation will succeed even if the max
> extent count is reached.

The max extent count cannot be exceeded, but the nrext64 feature ensures
that the xattr structure can grow enough to handle maximal hardlinking.

"Chandan increased the maximum extent counts of both data and attribute
forks, thereby ensuring that the extended attribute structure can grow
to handle the maximum hardlink count of any file."

> > +
> > +To solve the third problem, parent pointers include the dirent name
> "Lastly, the new design includes the dirent name..."

<nod>

> > and
> > +location of the entry within the parent directory.
> > +In other words, child files use extended attributes to store
> > pointers to
> > +parents in the form ``(parent_inum, parent_gen, dirent_pos) →
> > (dirent_name)``.
> This parts still in flux, so probably this will have to get updated
> later...

Yep, I'll add a note about that.

> > +
> > +On a filesystem with parent pointers, the directory checking process
> > can be
> > +strengthened to ensure that the target of each dirent also contains
> > a parent
> > +pointer pointing back to the dirent.
> > +Likewise, each parent pointer can be checked by ensuring that the
> > target of
> > +each parent pointer is a directory and that it contains a dirent
> > matching
> > +the parent pointer.
> > +Both online and offline repair can use this strategy.

I moved this paragraph up to become the second paragraph, and now it
reads:

"XFS parent pointers include the dirent name and location of the entry
within the parent directory.  In other words, child files use extended
attributes to store pointers to parents in the form ``(parent_inum,
parent_gen, dirent_pos) → (dirent_name)``.  The directory checking
process can be strengthened to ensure that the target of each dirent
also contains a parent pointer pointing back to the dirent.  Likewise,
each parent pointer can be checked by ensuring that the target of each
parent pointer is a directory and that it contains a dirent matching the
parent pointer.  Both online and offline repair can use this strategy.

Note: The ondisk format of parent pointers is not yet finalized."

After which comes the historical sidebar.

> > +
> > +Case Study: Repairing Directories with Parent Pointers
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Directory rebuilding uses a :ref:`coordinated inode scan <iscan>`
> > and
> > +a :ref:`directory entry live update hook <liveupdate>` as follows:
> > +
> > +1. Set up a temporary directory for generating the new directory
> > structure,
> > +   an xfblob for storing entry names, and an xfarray for stashing
> > directory
> > +   updates.
> > +
> > +2. Set up an inode scanner and hook into the directory entry code to
> > receive
> > +   updates on directory operations.
> > +
> > +3. For each parent pointer found in each file scanned, decide if the
> > parent
> > +   pointer references the directory of interest.
> > +   If so:
> > +
> > +   a. Stash an addname entry for this dirent in the xfarray for
> > later.
> > +
> > +   b. When finished scanning that file, flush the stashed updates to
> > the
> > +      temporary directory.
> > +
> > +4. For each live directory update received via the hook, decide if
> > the child
> > +   has already been scanned.
> > +   If so:
> > +
> > +   a. Stash an addname or removename entry for this dirent update in
> > the
> > +      xfarray for later.
> > +      We cannot write directly to the temporary directory because
> > hook
> > +      functions are not allowed to modify filesystem metadata.
> > +      Instead, we stash updates in the xfarray and rely on the
> > scanner thread
> > +      to apply the stashed updates to the temporary directory.
> > +
> > +5. When the scan is complete, atomically swap the contents of the
> > temporary
> > +   directory and the directory being repaired.
> > +   The temporary directory now contains the damaged directory
> > structure.
> > +
> > +6. Reap the temporary directory.
> > +
> > +7. Update the dirent position field of parent pointers as necessary.
> > +   This may require the queuing of a substantial number of xattr log
> > intent
> > +   items.
> > +
> > +The proposed patchset is the
> > +`parent pointers directory repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=pptrs-online-dir-repair>`_
> > +series.
> > +
> > +**Unresolved Question**: How will repair ensure that the
> > ``dirent_pos`` fields
> > +match in the reconstructed directory?
> > +
> > +*Answer*: There are a few ways to solve this problem:
> > +
> > +1. The field could be designated advisory, since the other three
> > values are
> > +   sufficient to find the entry in the parent.
> > +   However, this makes indexed key lookup impossible while repairs
> > are ongoing.
> > +
> > +2. We could allow creating directory entries at specified offsets,
> > which solves
> > +   the referential integrity problem but runs the risk that dirent
> > creation
> > +   will fail due to conflicts with the free space in the directory.
> > +
> > +   These conflicts could be resolved by appending the directory
> > entry and
> > +   amending the xattr code to support updating an xattr key and
> > reindexing the
> > +   dabtree, though this would have to be performed with the parent
> > directory
> > +   still locked.
> > +
> > +3. Same as above, but remove the old parent pointer entry and add a
> > new one
> > +   atomically.
> > +
> > +4. Change the ondisk xattr format to ``(parent_inum, name) →
> > (parent_gen)``,
> > +   which would provide the attr name uniqueness that we require,
> > without
> > +   forcing repair code to update the dirent position.
> > +   Unfortunately, this requires changes to the xattr code to support
> > attr
> > +   names as long as 263 bytes.
> > +
> > +5. Change the ondisk xattr format to ``(parent_inum, hash(name)) →
> > +   (name, parent_gen)``.
> > +   If the hash is sufficiently resistant to collisions (e.g. sha256)
> > then
> > +   this should provide the attr name uniqueness that we require.
> > +   Names shorter than 247 bytes could be stored directly.
> I think the RFC deluge is the same question but more context, so
> probably this section will follow what we decide there.  I will save
> commentary to keep the discussion in the same thread...
> 
> I'll just link it here for anyone else following this for now...
> https://www.spinics.net/lists/linux-xfs/msg69397.html

Yes, the deluge has much more detailed information.  I'll add this link
(for now) to the doc.

> > +
> > +Case Study: Repairing Parent Pointers
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Online reconstruction of a file's parent pointer information works
> > similarly to
> > +directory reconstruction:
> > +
> > +1. Set up a temporary file for generating a new extended attribute
> > structure,
> > +   an xfblob for storing parent pointer names, and an xfarray for
> > stashing
> > +   parent pointer updates.
> we did talk about blobs in patch 6 though it took me a moment to
> remember... if there's a way to link or tag it, that would be helpful
> for with the quick refresh.  kinda like wikipedia hyperlinks, you
> really only need like the first line or two to get it snap back

There is; I'll put in a backreference.

> > +
> > +2. Set up an inode scanner and hook into the directory entry code to
> > receive
> > +   updates on directory operations.
> > +
> > +3. For each directory entry found in each directory scanned, decide
> > if the
> > +   dirent references the file of interest.
> > +   If so:
> > +
> > +   a. Stash an addpptr entry for this parent pointer in the xfblob
> > and xfarray
> > +      for later.
> > +
> > +   b. When finished scanning the directory, flush the stashed
> > updates to the
> > +      temporary directory.
> > +
> > +4. For each live directory update received via the hook, decide if
> > the parent
> > +   has already been scanned.
> > +   If so:
> > +
> > +   a. Stash an addpptr or removepptr entry for this dirent update in
> > the
> > +      xfarray for later.
> > +      We cannot write parent pointers directly to the temporary file
> > because
> > +      hook functions are not allowed to modify filesystem metadata.
> > +      Instead, we stash updates in the xfarray and rely on the
> > scanner thread
> > +      to apply the stashed parent pointer updates to the temporary
> > file.
> > +
> > +5. Copy all non-parent pointer extended attributes to the temporary
> > file.
> > +
> > +6. When the scan is complete, atomically swap the attribute fork of
> > the
> > +   temporary file and the file being repaired.
> > +   The temporary file now contains the damaged extended attribute
> > structure.
> > +
> > +7. Reap the temporary file.
> Seems like it should work

Let's hope so!

> > +
> > +The proposed patchset is the
> > +`parent pointers repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=pptrs-online-parent-repair>`_
> > +series.
> > +
> > +Digression: Offline Checking of Parent Pointers
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +Examining parent pointers in offline repair works differently
> > because corrupt
> > +files are erased long before directory tree connectivity checks are
> > performed.
> > +Parent pointer checks are therefore a second pass to be added to the
> > existing
> > +connectivity checks:
> > +
> > +1. After the set of surviving files has been established (i.e. phase
> > 6),
> > +   walk the surviving directories of each AG in the filesystem.
> > +   This is already performed as part of the connectivity checks.
> > +
> > +2. For each directory entry found, record the name in an xfblob, and
> > store
> > +   ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples
> > in a
> > +   per-AG in-memory slab.
> > +
> > +3. For each AG in the filesystem,
> > +
> > +   a. Sort the per-AG tuples in order of child_ag_inum, parent_inum,
> > and
> > +      dirent_pos.
> > +
> > +   b. For each inode in the AG,
> > +
> > +      1. Scan the inode for parent pointers.
> > +         Record the names in a per-file xfblob, and store
> > ``(parent_inum,
> > +         parent_gen, dirent_pos)`` tuples in a per-file slab.
> > +
> > +      2. Sort the per-file tuples in order of parent_inum, and
> > dirent_pos.
> > +
> > +      3. Position one slab cursor at the start of the inode's
> > records in the
> > +         per-AG tuple slab.
> > +         This should be trivial since the per-AG tuples are in child
> > inumber
> > +         order.
> > +
> > +      4. Position a second slab cursor at the start of the per-file
> > tuple slab.
> > +
> > +      5. Iterate the two cursors in lockstep, comparing the
> > parent_ino and
> > +         dirent_pos fields of the records under each cursor.
> > +
> > +         a. Tuples in the per-AG list but not the per-file list are
> > missing and
> > +            need to be written to the inode.
> > +
> > +         b. Tuples in the per-file list but not the per-AG list are
> > dangling
> > +            and need to be removed from the inode.
> > +
> > +         c. For tuples in both lists, update the parent_gen and name
> > components
> > +            of the parent pointer if necessary.
> > +
> > +4. Move on to examining link counts, as we do today.
> > +
> > +The proposed patchset is the
> > +`offline parent pointers repair
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > it/log/?h=pptrs-repair>`_
> > +series.
> > +
> > +Rebuilding directories from parent pointers in offline repair is
> > very
> > +challenging because it currently uses a single-pass scan of the
> > filesystem
> > +during phase 3 to decide which files are corrupt enough to be
> > zapped.
> > +This scan would have to be converted into a multi-pass scan:
> > +
> > +1. The first pass of the scan zaps corrupt inodes, forks, and
> > attributes
> > +   much as it does now.
> > +   Corrupt directories are noted but not zapped.
> > +
> > +2. The next pass records parent pointers pointing to the directories
> > noted
> > +   as being corrupt in the first pass.
> > +   This second pass may have to happen after the phase 4 scan for
> > duplicate
> > +   blocks, if phase 4 is also capable of zapping directories.
> > +
> > +3. The third pass resets corrupt directories to an empty shortform
> > directory.
> > +   Free space metadata has not been ensured yet, so repair cannot
> > yet use the
> > +   directory building code in libxfs.
> > +
> > +4. At the start of phase 6, space metadata have been rebuilt.
> > +   Use the parent pointer information recorded during step 2 to
> > reconstruct
> > +   the dirents and add them to the now-empty directories.
> > +
> > +This code has not yet been constructed.
> > +
> > +.. _orphanage:
> > +
> > +The Orphanage
> > +-------------
> > +
> > +Filesystems present files as a directed, and hopefully acyclic,
> > graph.
> > +In other words, a tree.
> > +The root of the filesystem is a directory, and each entry in a
> > directory points
> > +downwards either to more subdirectories or to non-directory files.
> > +Unfortunately, a disruption in the directory graph pointers result
> > in a
> > +disconnected graph, which makes files impossible to access via
> > regular path
> > +resolution.
> > +The directory parent pointer online scrub code can detect a dotdot
> > entry
> > +pointing to a parent directory that doesn't have a link back to the
> > child
> > +directory, and the file link count checker can detect a file that
> > isn't pointed
> > +to by any directory in the filesystem.
> > +If the file in question has a positive link count, the file in
> > question is an
> > +orphan.
> 
> Hmm, I kinda felt like this should have flowed into something like:
> "now that we have parent pointers, we can reparent them instead of
> putting them in the orphanage..."

That's only true if we actually *find* the relevant forward or back
pointers.  If a file has positive link count but there aren't any links
to it from anywhere, we still have to dump it in the /lost+found.

Parent pointers make it a lot less likely that we'll have to put a file
in the /lost+found, but it's still possible.

I think I'll change this paragraph to start:

"Without parent pointers, the directory parent pointer online scrub code
can detect a dotdot entry pointing to a parent directory..."

and then add a new paragraph:

"With parent pointers, directories can be rebuilt by scanning parent
pointers and parent pointers can be rebuilt by scanning directories.
This should reduce the incidence of files ending up in ``/lost+found``."

> ?
> > +
> > +When orphans are found, they should be reconnected to the directory
> > tree.
> > +Offline fsck solves the problem by creating a directory
> > ``/lost+found`` to
> > +serve as an orphanage, and linking orphan files into the orphanage
> > by using the
> > +inumber as the name.
> > +Reparenting a file to the orphanage does not reset any of its
> > permissions or
> > +ACLs.
> > +
> > +This process is more involved in the kernel than it is in userspace.
> > +The directory and file link count repair setup functions must use
> > the regular
> > +VFS mechanisms to create the orphanage directory with all the
> > necessary
> > +security attributes and dentry cache entries, just like a regular
> > directory
> > +tree modification.
> > +
> > +Orphaned files are adopted by the orphanage as follows:
> > +
> > +1. Call ``xrep_orphanage_try_create`` at the start of the scrub
> > setup function
> > +   to try to ensure that the lost and found directory actually
> > exists.
> > +   This also attaches the orphanage directory to the scrub context.
> > +
> > +2. If the decision is made to reconnect a file, take the IOLOCK of
> > both the
> > +   orphanage and the file being reattached.
> > +   The ``xrep_orphanage_iolock_two`` function follows the inode
> > locking
> > +   strategy discussed earlier.
> > +
> > +3. Call ``xrep_orphanage_compute_blkres`` and
> > ``xrep_orphanage_compute_name``
> > +   to compute the new name in the orphanage and the block
> > reservation required.
> > +
> > +4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the
> > repair
> > +   transaction.
> > +
> > +5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into
> > the lost
> > +   and found, and update the kernel dentry cache.
> > +
> > +The proposed patches are in the
> > +`orphanage adoption
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-orphanage>`_
> > +series.
> 
> Certainly we'll need to come back and update all the parts that would
> be affected by the RFC, but otherwise looks ok.  It seems trying to
> document code before it's written tends to cause things to go around
> for a while, since we really just cant know how stable a design is
> until it's been through at least a few prototypes.

Agreed!

--D

> Allison

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 13/14] xfs: document the userspace fsck driver program
  2023-03-01  5:36     ` Allison Henderson
@ 2023-03-02  0:27       ` Darrick J. Wong
  2023-03-03 23:51         ` Allison Henderson
  0 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-02  0:27 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Wed, Mar 01, 2023 at 05:36:59AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add the sixth chapter of the online fsck design documentation, where
> > we discuss the details of the data structures and algorithms used by
> > the
> > driver program xfs_scrub.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  313
> > ++++++++++++++++++++
> >  1 file changed, 313 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index 2e20314f1831..05b9411fac7f 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -300,6 +300,9 @@ The seven phases are as follows:
> >  7. Re-check the summary counters and presents the caller with a
> > summary of
> >     space usage and file counts.
> >  
> > +This allocation of responsibilities will be :ref:`revisited
> > <scrubcheck>`
> > +later in this document.
> > +
> >  Steps for Each Scrub Item
> >  -------------------------
> >  
> > @@ -4505,3 +4508,313 @@ The proposed patches are in the
> >  `orphanage adoption
> >  <
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=repair-orphanage>`_
> >  series.
> > +
> > +6. Userspace Algorithms and Data Structures
> > +===========================================
> > +
> > +This section discusses the key algorithms and data structures of the
> > userspace
> > +program, ``xfs_scrub``, that provide the ability to drive metadata
> > checks and
> > +repairs in the kernel, verify file data, and look for other
> > potential problems.
> > +
> > +.. _scrubcheck:
> > +
> > +Checking Metadata
> > +-----------------
> > +
> > +Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier.
> > +That structure follows naturally from the data dependencies designed
> > into the
> > +filesystem from its beginnings in 1993.
> > +In XFS, there are several groups of metadata dependencies:
> > +
> > +a. Filesystem summary counts depend on consistency within the inode
> > indices,
> > +   the allocation group space btrees, and the realtime volume space
> > +   information.
> > +
> > +b. Quota resource counts depend on consistency within the quota file
> > data
> > +   forks, inode indices, inode records, and the forks of every file
> > on the
> > +   system.
> > +
> > +c. The naming hierarchy depends on consistency within the directory
> > and
> > +   extended attribute structures.
> > +   This includes file link counts.
> > +
> > +d. Directories, extended attributes, and file data depend on
> > consistency within
> > +   the file forks that map directory and extended attribute data to
> > physical
> > +   storage media.
> > +
> > +e. The file forks depends on consistency within inode records and
> > the space
> > +   metadata indices of the allocation groups and the realtime
> > volume.
> > +   This includes quota and realtime metadata files.
> > +
> > +f. Inode records depends on consistency within the inode metadata
> > indices.
> > +
> > +g. Realtime space metadata depend on the inode records and data
> > forks of the
> > +   realtime metadata inodes.
> > +
> > +h. The allocation group metadata indices (free space, inodes,
> > reference count,
> > +   and reverse mapping btrees) depend on consistency within the AG
> > headers and
> > +   between all the AG metadata btrees.
> > +
> > +i. ``xfs_scrub`` depends on the filesystem being mounted and kernel
> > support
> > +   for online fsck functionality.
> > +
> > +Therefore, a metadata dependency graph is a convenient way to
> > schedule checking
> > +operations in the ``xfs_scrub`` program:
> > +
> > +- Phase 1 checks that the provided path maps to an XFS filesystem
> > and detect
> > +  the kernel's scrubbing abilities, which validates group (i).
> > +
> > +- Phase 2 scrubs groups (g) and (h) in parallel using a threaded
> > workqueue.
> > +
> > +- Phase 3 checks groups (f), (e), and (d), in that order.
> > +  These groups are all file metadata, which means that inodes are
> > scanned in
> > +  parallel.
> ...When things are done in order, then they are done in serial right?
> Things done in parallel are done at the same time.  Either the phrase
> "in that order" needs to go away, or the last line needs to drop

Each inode is processed in parallel, but individual inodes are processed
in f-e-d order.

"Phase 3 scans inodes in parallel.  For each inode, groups (f), (e), and
(d) are checked, in that order."

> > +
> > +- Phase 4 repairs everything in groups (i) through (d) so that
> > phases 5 and 6
> > +  may run reliably.
> > +
> > +- Phase 5 starts by checking groups (b) and (c) in parallel before
> > moving on
> > +  to checking names.
> > +
> > +- Phase 6 depends on groups (i) through (b) to find file data blocks
> > to verify,
> > +  to read them, and to report which blocks of which files are
> > affected.
> > +
> > +- Phase 7 checks group (a), having validated everything else.
> > +
> > +Notice that the data dependencies between groups are enforced by the
> > structure
> > +of the program flow.
> > +
> > +Parallel Inode Scans
> > +--------------------
> > +
> > +An XFS filesystem can easily contain hundreds of millions of inodes.
> > +Given that XFS targets installations with large high-performance
> > storage,
> > +it is desirable to scrub inodes in parallel to minimize runtime,
> > particularly
> > +if the program has been invoked manually from a command line.
> > +This requires careful scheduling to keep the threads as evenly
> > loaded as
> > +possible.
> > +
> > +Early iterations of the ``xfs_scrub`` inode scanner naïvely created
> > a single
> > +workqueue and scheduled a single workqueue item per AG.
> > +Each workqueue item walked the inode btree (with
> > ``XFS_IOC_INUMBERS``) to find
> > +inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to
> > gather enough
> > +information to construct file handles.
> > +The file handle was then passed to a function to generate scrub
> > items for each
> > +metadata object of each inode.
> > +This simple algorithm leads to thread balancing problems in phase 3
> > if the
> > +filesystem contains one AG with a few large sparse files and the
> > rest of the
> > +AGs contain many smaller files.
> > +The inode scan dispatch function was not sufficiently granular; it
> > should have
> > +been dispatching at the level of individual inodes, or, to constrain
> > memory
> > +consumption, inode btree records.
> > +
> > +Thanks to Dave Chinner, bounded workqueues in userspace enable
> > ``xfs_scrub`` to
> > +avoid this problem with ease by adding a second workqueue.
> > +Just like before, the first workqueue is seeded with one workqueue
> > item per AG,
> > +and it uses INUMBERS to find inode btree chunks.
> > +The second workqueue, however, is configured with an upper bound on
> > the number
> > +of items that can be waiting to be run.
> > +Each inode btree chunk found by the first workqueue's workers are
> > queued to the
> > +second workqueue, and it is this second workqueue that queries
> > BULKSTAT,
> > +creates a file handle, and passes it to a function to generate scrub
> > items for
> > +each metadata object of each inode.
> > +If the second workqueue is too full, the workqueue add function
> > blocks the
> > +first workqueue's workers until the backlog eases.
> > +This doesn't completely solve the balancing problem, but reduces it
> > enough to
> > +move on to more pressing issues.
> > +
> > +The proposed patchsets are the scrub
> > +`performance tweaks
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > it/log/?h=scrub-performance-tweaks>`_
> > +and the
> > +`inode scan rebalance
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > it/log/?h=scrub-iscan-rebalance>`_
> > +series.
> > +
> > +.. _scrubrepair:
> > +
> > +Scheduling Repairs
> > +------------------
> > +
> > +During phase 2, corruptions and inconsistencies reported in any AGI
> > header or
> > +inode btree are repaired immediately, because phase 3 relies on
> > proper
> > +functioning of the inode indices to find inodes to scan.
> > +Failed repairs are rescheduled to phase 4.
> > +Problems reported in any other space metadata are deferred to phase
> > 4.
> > +Optimization opportunities are always deferred to phase 4, no matter
> > their
> > +origin.
> > +
> > +During phase 3, corruptions and inconsistencies reported in any part
> > of a
> > +file's metadata are repaired immediately if all space metadata were
> > validated
> > +during phase 2.
> > +Repairs that fail or cannot be repaired immediately are scheduled
> > for phase 4.
> > +
> > +In the original design of ``xfs_scrub``, it was thought that repairs
> > would be
> > +so infrequent that the ``struct xfs_scrub_metadata`` objects used to
> > +communicate with the kernel could also be used as the primary object
> > to
> > +schedule repairs.
> > +With recent increases in the number of optimizations possible for a
> > given
> > +filesystem object, it became much more memory-efficient to track all
> > eligible
> > +repairs for a given filesystem object with a single repair item.
> > +Each repair item represents a single lockable object -- AGs,
> > metadata files,
> > +individual inodes, or a class of summary information.
> > +
> > +Phase 4 is responsible for scheduling a lot of repair work in as
> > quick a
> > +manner as is practical.
> > +The :ref:`data dependencies <scrubcheck>` outlined earlier still
> > apply, which
> > +means that ``xfs_scrub`` must try to complete the repair work
> > scheduled by
> > +phase 2 before trying repair work scheduled by phase 3.
> > +The repair process is as follows:
> > +
> > +1. Start a round of repair with a workqueue and enough workers to
> > keep the CPUs
> > +   as busy as the user desires.
> > +
> > +   a. For each repair item queued by phase 2,
> > +
> > +      i.   Ask the kernel to repair everything listed in the repair
> > item for a
> > +           given filesystem object.
> > +
> > +      ii.  Make a note if the kernel made any progress in reducing
> > the number
> > +           of repairs needed for this object.
> > +
> > +      iii. If the object no longer requires repairs, revalidate all
> > metadata
> > +           associated with this object.
> > +           If the revalidation succeeds, drop the repair item.
> > +           If not, requeue the item for more repairs.
> > +
> > +   b. If any repairs were made, jump back to 1a to retry all the
> > phase 2 items.
> > +
> > +   c. For each repair item queued by phase 3,
> > +
> > +      i.   Ask the kernel to repair everything listed in the repair
> > item for a
> > +           given filesystem object.
> > +
> > +      ii.  Make a note if the kernel made any progress in reducing
> > the number
> > +           of repairs needed for this object.
> > +
> > +      iii. If the object no longer requires repairs, revalidate all
> > metadata
> > +           associated with this object.
> > +           If the revalidation succeeds, drop the repair item.
> > +           If not, requeue the item for more repairs.
> > +
> > +   d. If any repairs were made, jump back to 1c to retry all the
> > phase 3 items.
> > +
> > +2. If step 1 made any repair progress of any kind, jump back to step
> > 1 to start
> > +   another round of repair.
> > +
> > +3. If there are items left to repair, run them all serially one more
> > time.
> > +   Complain if the repairs were not successful, since this is the
> > last chance
> > +   to repair anything.
> > +
> > +Corruptions and inconsistencies encountered during phases 5 and 7
> > are repaired
> > +immediately.
> > +Corrupt file data blocks reported by phase 6 cannot be recovered by
> > the
> > +filesystem.
> > +
> > +The proposed patchsets are the
> > +`repair warning improvements
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > it/log/?h=scrub-better-repair-warnings>`_,
> > +refactoring of the
> > +`repair data dependency
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > it/log/?h=scrub-repair-data-deps>`_
> > +and
> > +`object tracking
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > it/log/?h=scrub-object-tracking>`_,
> > +and the
> > +`repair scheduling
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > it/log/?h=scrub-repair-scheduling>`_
> > +improvement series.
> > +
> > +Checking Names for Confusable Unicode Sequences
> > +-----------------------------------------------
> > +
> > +If ``xfs_scrub`` succeeds in validating the filesystem metadata by
> > the end of
> > +phase 4, it moves on to phase 5, which checks for suspicious looking
> > names in
> > +the filesystem.
> > +These names consist of the filesystem label, names in directory
> > entries, and
> > +the names of extended attributes.
> > +Like most Unix filesystems, XFS imposes the sparest of constraints
> > on the
> > +contents of a name -- slashes and null bytes are not allowed in
> > directory
> > +entries; and null bytes are not allowed in extended attributes and
> maybe say "standard user accessible extended attributes"

"userspace visible"?

I'll list-ify this too:

Like most Unix filesystems, XFS imposes the sparest of constraints on
the contents of a name:

- slashes and null bytes are not allowed in directory entries;

- null bytes are not allowed in userspace-visible extended attributes;

- null bytes are not allowed in the filesystem label

> > the
> > +filesystem label.
> > +Directory entries and attribute keys store the length of the name
> > explicitly
> > +ondisk, which means that nulls are not name terminators.
> > +For this section, the term "naming domain" refers to any place where
> > names are
> > +presented together -- all the names in a directory, or all the
> > attributes of a
> > +file.
> > +
> > +Although the Unix naming constraints are very permissive, the
> > reality of most
> > +modern-day Linux systems is that programs work with Unicode
> > character code
> > +points to support international languages.
> > +These programs typically encode those code points in UTF-8 when
> > interfacing
> > +with the C library because the kernel expects null-terminated names.
> > +In the common case, therefore, names found in an XFS filesystem are
> > actually
> > +UTF-8 encoded Unicode data.
> > +
> > +To maximize its expressiveness, the Unicode standard defines
> > separate control
> > +points for various characters that render similarly or identically
> > in writing
> > +systems around the world.
> > +For example, the character "Cyrillic Small Letter A" U+0430 "а"
> > often renders
> > +identically to "Latin Small Letter A" U+0061 "a".
> 
> 
> > +
> > +The standard also permits characters to be constructed in multiple
> > ways --
> > +either by using a defined code point, or by combining one code point
> > with
> > +various combining marks.
> > +For example, the character "Angstrom Sign U+212B "Å" can also be
> > expressed
> > +as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring
> > Above"
> > +U+030A "◌̊".
> > +Both sequences render identically.
> > +
> > +Like the standards that preceded it, Unicode also defines various
> > control
> > +characters to alter the presentation of text.
> > +For example, the character "Right-to-Left Override" U+202E can trick
> > some
> > +programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png".
> > +A second category of rendering problems involves whitespace
> > characters.
> > +If the character "Zero Width Space" U+200B is encountered in a file
> > name, the
> > +name will render identically to a name that does not have the zero
> > width
> > +space.
> > +
> > +If two names within a naming domain have different byte sequences
> > but render
> > +identically, a user may be confused by it.
> > +The kernel, in its indifference to upper level encoding schemes,
> > permits this.
> > +Most filesystem drivers persist the byte sequence names that are
> > given to them
> > +by the VFS.
> > +
> > +Techniques for detecting confusable names are explained in great
> > detail in
> > +sections 4 and 5 of the
> > +`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_
> > +document.
> I don't know that we need this much detail on character rendering.  I
> think the example above is enough to make the point that character
> strings can differ in binary, but render the same, so we need to deal
> with that.  So I think that's really all the justification we need for
> the NFD usage

I want to leave the link in, because TR39 is the canonical source for
information about confusability detection.  That is the location where
the Unicode folks publish everything they currently know on the topic.

> > +``xfs_scrub``, when it detects UTF-8 encoding in use on a system,
> > uses the
> When ``xfs_scrub`` detects UTF-8 encoding, it uses the...

Changed, thanks.

> > +Unicode normalization form NFD in conjunction with the confusable
> > name
> > +detection component of
> > +`libicu <https://github.com/unicode-org/icu>`_
> > +to identify names with a directory or within a file's extended
> > attributes that
> > +could be confused for each other.
> > +Names are also checked for control characters, non-rendering
> > characters, and
> > +mixing of bidirectional characters.
> > +All of these potential issues are reported to the system
> > administrator during
> > +phase 5.
> > +
> > +Media Verification of File Data Extents
> > +---------------------------------------
> > +
> > +The system administrator can elect to initiate a media scan of all
> > file data
> > +blocks.
> > +This scan after validation of all filesystem metadata (except for
> > the summary
> > +counters) as phase 6.
> > +The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the
> > filesystem space map
> > +to find areas that are allocated to file data fork extents.
> > +Gaps betweeen data fork extents that are smaller than 64k are
> > treated as if
> > +they were data fork extents to reduce the command setup overhead.
> > +When the space map scan accumulates a region larger than 32MB, a
> > media
> > +verification request is sent to the disk as a directio read of the
> > raw block
> > +device.
> > +
> > +If the verification read fails, ``xfs_scrub`` retries with single-
> > block reads
> > +to narrow down the failure to the specific region of the media and
> > recorded.
> > +When it has finished issuing verification requests, it again uses
> > the space
> > +mapping ioctl to map the recorded media errors back to metadata
> > structures
> > +and report what has been lost.
> > +For media errors in blocks owned by files, the lack of parent
> > pointers means
> > +that the entire filesystem must be walked to report the file paths
> > and offsets
> > +corresponding to the media error.
> > 
> This last bit will need to be updated after we come to a decision with
> the rfc

I'll at least update it since this doc is now pretty deep into the pptrs
stuff:

"For media errors in blocks owned by files, parent pointers can be used
to construct file paths from inode numbers for user-friendly reporting."

> Other than that, I think it looks pretty good.

Woot.

--D

> Allison
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 14/14] xfs: document future directions of online fsck
  2023-03-01  5:37     ` Allison Henderson
@ 2023-03-02  0:39       ` Darrick J. Wong
  2023-03-03 23:51         ` Allison Henderson
  0 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-02  0:39 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Wed, Mar 01, 2023 at 05:37:19AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add the seventh and final chapter of the online fsck documentation,
> > where we talk about future functionality that can tie in with the
> > functionality provided by the online fsck patchset.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  155
> > ++++++++++++++++++++
> >  1 file changed, 155 insertions(+)
> > 
> > 
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index 05b9411fac7f..41291edb02b9 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -4067,6 +4067,8 @@ The extra flexibility enables several new use
> > cases:
> >    (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby
> > committing all
> >    of the updates to the original file, or none of them.
> >  
> > +.. _swapext_if_unchanged:
> > +
> >  - **Transactional file updates**: The same mechanism as above, but
> > the caller
> >    only wants the commit to occur if the original file's contents
> > have not
> >    changed.
> > @@ -4818,3 +4820,156 @@ and report what has been lost.
> >  For media errors in blocks owned by files, the lack of parent
> > pointers means
> >  that the entire filesystem must be walked to report the file paths
> > and offsets
> >  corresponding to the media error.
> > +
> > +7. Conclusion and Future Work
> > +=============================
> > +
> > +It is hoped that the reader of this document has followed the
> > designs laid out
> > +in this document and now has some familiarity with how XFS performs
> > online
> > +rebuilding of its metadata indices, and how filesystem users can
> > interact with
> > +that functionality.
> > +Although the scope of this work is daunting, it is hoped that this
> > guide will
> > +make it easier for code readers to understand what has been built,
> > for whom it
> > +has been built, and why.
> > +Please feel free to contact the XFS mailing list with questions.
> > +
> > +FIEXCHANGE_RANGE
> > +----------------
> > +
> > +As discussed earlier, a second frontend to the atomic extent swap
> > mechanism is
> > +a new ioctl call that userspace programs can use to commit updates
> > to files
> > +atomically.
> > +This frontend has been out for review for several years now, though
> > the
> > +necessary refinements to online repair and lack of customer demand
> > mean that
> > +the proposal has not been pushed very hard.

Note: The "Extent Swapping with Regular User Files" section has moved
here.

> > +Vectorized Scrub
> > +----------------
> > +
> > +As it turns out, the :ref:`refactoring <scrubrepair>` of repair
> > items mentioned
> > +earlier was a catalyst for enabling a vectorized scrub system call.
> > +Since 2018, the cost of making a kernel call has increased
> > considerably on some
> > +systems to mitigate the effects of speculative execution attacks.
> > +This incentivizes program authors to make as few system calls as
> > possible to
> > +reduce the number of times an execution path crosses a security
> > boundary.
> > +
> > +With vectorized scrub, userspace pushes to the kernel the identity
> > of a
> > +filesystem object, a list of scrub types to run against that object,
> > and a
> > +simple representation of the data dependencies between the selected
> > scrub
> > +types.
> > +The kernel executes as much of the caller's plan as it can until it
> > hits a
> > +dependency that cannot be satisfied due to a corruption, and tells
> > userspace
> > +how much was accomplished.
> > +It is hoped that ``io_uring`` will pick up enough of this
> > functionality that
> > +online fsck can use that instead of adding a separate vectored scrub
> > system
> > +call to XFS.
> > +
> > +The relevant patchsets are the
> > +`kernel vectorized scrub
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=vectorized-scrub>`_
> > +and
> > +`userspace vectorized scrub
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > it/log/?h=vectorized-scrub>`_
> > +series.
> > +
> > +Quality of Service Targets for Scrub
> > +------------------------------------
> > +
> > +One serious shortcoming of the online fsck code is that the amount
> > of time that
> > +it can spend in the kernel holding resource locks is basically
> > unbounded.
> > +Userspace is allowed to send a fatal signal to the process which
> > will cause
> > +``xfs_scrub`` to exit when it reaches a good stopping point, but
> > there's no way
> > +for userspace to provide a time budget to the kernel.
> > +Given that the scrub codebase has helpers to detect fatal signals,
> > it shouldn't
> > +be too much work to allow userspace to specify a timeout for a
> > scrub/repair
> > +operation and abort the operation if it exceeds budget.
> > +However, most repair functions have the property that once they
> > begin to touch
> > +ondisk metadata, the operation cannot be cancelled cleanly, after
> > which a QoS
> > +timeout is no longer useful.
> > +
> > +Defragmenting Free Space
> > +------------------------
> > +
> > +Over the years, many XFS users have requested the creation of a
> > program to
> > +clear a portion of the physical storage underlying a filesystem so
> > that it
> > +becomes a contiguous chunk of free space.
> > +Call this free space defragmenter ``clearspace`` for short.
> > +
> > +The first piece the ``clearspace`` program needs is the ability to
> > read the
> > +reverse mapping index from userspace.
> > +This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
> > +The second piece it needs is a new fallocate mode
> > +(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a
> > region and
> > +maps it to a file.
> > +Call this file the "space collector" file.
> > +The third piece is the ability to force an online repair.
> > +
> > +To clear all the metadata out of a portion of physical storage,
> > clearspace
> > +uses the new fallocate map-freespace call to map any free space in
> > that region
> > +to the space collector file.
> > +Next, clearspace finds all metadata blocks in that region by way of
> > +``GETFSMAP`` and issues forced repair requests on the data
> > structure.
> > +This often results in the metadata being rebuilt somewhere that is
> > not being
> > +cleared.
> > +After each relocation, clearspace calls the "map free space"
> > function again to
> > +collect any newly freed space in the region being cleared.
> > +
> > +To clear all the file data out of a portion of the physical storage,
> > clearspace
> > +uses the FSMAP information to find relevant file data blocks.
> > +Having identified a good target, it uses the ``FICLONERANGE`` call
> > on that part
> > +of the file to try to share the physical space with a dummy file.
> > +Cloning the extent means that the original owners cannot overwrite
> > the
> > +contents; any changes will be written somewhere else via copy-on-
> > write.
> > +Clearspace makes its own copy of the frozen extent in an area that
> > is not being
> > +cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
> > +<swapext_if_unchanged>` feature) to change the target file's data
> > extent
> > +mapping away from the area being cleared.
> > +When all other mappings have been moved, clearspace reflinks the
> > space into the
> > +space collector file so that it becomes unavailable.
> > +
> > +There are further optimizations that could apply to the above
> > algorithm.
> > +To clear a piece of physical storage that has a high sharing factor,
> > it is
> > +strongly desirable to retain this sharing factor.
> > +In fact, these extents should be moved first to maximize sharing
> > factor after
> > +the operation completes.
> > +To make this work smoothly, clearspace needs a new ioctl
> > +(``FS_IOC_GETREFCOUNTS``) to report reference count information to
> > userspace.
> > +With the refcount information exposed, clearspace can quickly find
> > the longest,
> > +most shared data extents in the filesystem, and target them first.
> > +
> 
> 
> > +**Question**: How might the filesystem move inode chunks?
> > +
> > +*Answer*: 
> "In order to move inode chunks.."

Done.

> > Dave Chinner has a prototype that creates a new file with the old
> > +contents and then locklessly runs around the filesystem updating
> > directory
> > +entries.
> > +The operation cannot complete if the filesystem goes down.
> > +That problem isn't totally insurmountable: create an inode remapping
> > table
> > +hidden behind a jump label, and a log item that tracks the kernel
> > walking the
> > +filesystem to update directory entries.
> > +The trouble is, the kernel can't do anything about open files, since
> > it cannot
> > +revoke them.
> > +
> 
> 
> > +**Question**: Can static keys be used to add a revoke bailout return
> > to
> > +*every* code path coming in from userspace?
> > +
> > +*Answer*: In principle, yes.
> > +This 
> 
> "It is also possible to use static keys to add a revoke bailout return
> to each code path coming in from userspace.  This..."

I think this change would make the answer redundant with the question.

"Can static keys be used to minimize the runtime cost of supporting
``revoke()`` on XFS files?"

"Yes.  Until the first revocation, the bailout code need not be in the
call path at all."

> > would eliminate the overhead of the check until a revocation happens.
> > +It's not clear what we do to a revoked file after all the callers
> > are finished
> > +with it, however.
> > +
> > +The relevant patchsets are the
> > +`kernel freespace defrag
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=defrag-freespace>`_
> > +and
> > +`userspace freespace defrag
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > it/log/?h=defrag-freespace>`_
> > +series.
> 
> I guess since they're just future ideas just light documentation is
> fine.  Other than cleaning out the Q & A's, I think it looks pretty
> good.

Ok.  Thank you x100000000 for being the first person to publicly comment
on the entire document!

--D

> Allison
> 
> > +
> > +Shrinking Filesystems
> > +---------------------
> > +
> > +Removing the end of the filesystem ought to be a simple matter of
> > evacuating
> > +the data and metadata at the end of the filesystem, and handing the
> > freed space
> > +to the shrink code.
> > +That requires an evacuation of the space at end of the filesystem,
> > which is a
> > +use of free space defragmentation!
> > 
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH v24.3 12/14] xfs: document directory tree repairs
  2023-03-02  0:14         ` Darrick J. Wong
@ 2023-03-03 23:50           ` Allison Henderson
  2023-03-04  2:19             ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-03-03 23:50 UTC (permalink / raw)
  To: djwong
  Cc: david, Catherine Hoang, linux-fsdevel, hch, linux-xfs, willy,
	Chandan Babu

On Wed, 2023-03-01 at 16:14 -0800, Darrick J. Wong wrote:
> On Sat, Feb 25, 2023 at 07:33:23AM +0000, Allison Henderson wrote:
> > On Thu, 2023-02-02 at 18:12 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Directory tree repairs are the least complete part of online
> > > fsck,
> > > due
> > > to the lack of directory parent pointers.  However, even without
> > > that
> > > feature, we can still make some corrections to the directory tree
> > > --
> > > we
> > > can salvage as many directory entries as we can from a damaged
> > > directory, and we can reattach orphaned inodes to the lost+found,
> > > just
> > > as xfs_repair does now.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > > v24.2: updated with my latest thoughts about how to use parent
> > > pointers
> > > v24.3: updated to reflect the online fsck code I built for parent
> > > pointers
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  410
> > > ++++++++++++++++++++
> > >  1 file changed, 410 insertions(+)
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index af7755fe0107..51d040e4a2d0 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -4359,3 +4359,413 @@ The proposed patchset is the
> > >  `extended attribute repair
> > >  <
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-xattrs>`_
> > >  series.
> > > +
> > > +Fixing Directories
> > > +------------------
> > > +
> > > +Fixing directories is difficult with currently available
> > > filesystem
> > > features,
> > > +since directory entries are not redundant.
> > > +The offline repair tool scans all inodes to find files with
> > > nonzero
> > > link count,
> > > +and then it scans all directories to establish parentage of
> > > those
> > > linked files.
> > > +Damaged files and directories are zapped, and files with no
> > > parent
> > > are
> > > +moved to the ``/lost+found`` directory.
> > > +It does not try to salvage anything.
> > > +
> > > +The best that online repair can do at this time is to read
> > > directory
> > > data
> > > +blocks and salvage any dirents that look plausible, correct link
> > > counts, and
> > > +move orphans back into the directory tree.
> > > +The salvage process is discussed in the case study at the end of
> > > this section.
> > > +The :ref:`file link count fsck <nlinks>` code takes care of
> > > fixing
> > > link counts
> > > +and moving orphans to the ``/lost+found`` directory.
> > > +
> > > +Case Study: Salvaging Directories
> > > +`````````````````````````````````
> > > +
> > > +Unlike extended attributes, directory blocks are all the same
> > > size,
> > > so
> > > +salvaging directories is straightforward:
> > > +
> > > +1. Find the parent of the directory.
> > > +   If the dotdot entry is not unreadable, try to confirm that
> > > the
> > > alleged
> > > +   parent has a child entry pointing back to the directory being
> > > repaired.
> > > +   Otherwise, walk the filesystem to find it.
> > > +
> > > +2. Walk the first partition of data fork of the directory to
> > > find
> > > the directory
> > > +   entry data blocks.
> > > +   When one is found,
> > > +
> > > +   a. Walk the directory data block to find candidate entries.
> > > +      When an entry is found:
> > > +
> > > +      i. Check the name for problems, and ignore the name if
> > > there
> > > are.
> > > +
> > > +      ii. Retrieve the inumber and grab the inode.
> > > +          If that succeeds, add the name, inode number, and file
> > > type to the
> > > +          staging xfarray and xblob.
> > > +
> > > +3. If the memory usage of the xfarray and xfblob exceed a
> > > certain
> > > amount of
> > > +   memory or there are no more directory data blocks to examine,
> > > unlock the
> > > +   directory and add the staged dirents into the temporary
> > > directory.
> > > +   Truncate the staging files.
> > > +
> > > +4. Use atomic extent swapping to exchange the new and old
> > > directory
> > > structures.
> > > +   The old directory blocks are now attached to the temporary
> > > file.
> > > +
> > > +5. Reap the temporary file.
> > > +
> > 
> > 
> > 
> > > +**Future Work Question**: Should repair revalidate the dentry
> > > cache
> > > when
> > > +rebuilding a directory?
> > > +
> > > +*Answer*: Yes, though the current dentry cache code doesn't
> > > provide
> > > a means
> > > +to walk every dentry of a specific directory.
> > > +If the cache contains an entry that the salvaging code does not
> > > find, the
> > > +repair cannot proceed.
> > > +
> > > +**Future Work Question**: Can the dentry cache know about a
> > > directory entry
> > > +that cannot be salvaged?
> > > +
> > > +*Answer*: In theory, the dentry cache should be a subset of the
> > > directory
> > > +entries on disk because there's no way to load a dentry without
> > > having
> > > +something to read in the directory.
> > > +However, it is possible for a coherency problem to be introduced
> > > if
> > > the ondisk
> > > +structures becomes corrupt *after* the cache loads.
> > > +In theory it is necessary to scan all dentry cache entries for a
> > > directory to
> > > +ensure that one of the following apply:
> > 
> > "Currently the dentry cache code doesn't provide a means to walk
> > every
> > dentry of a specific directory.  This makes validation of the
> > rebuilt
> > directory difficult, and it is possible that an ondisk structure to
> > become corrupt *after* the cache loads.  Walking the dentry cache
> > is
> > currently being considered as a future improvement.  This will also
> > enable the ability to report which entries were not salvageable
> > since
> > these will be the subset of entries that are absent after the walk.
> > This improvement will ensure that one of the following apply:"
> 
> The thing is -- I'm not considering restructuring the dentry cache. 
> The
> cache key is a one-way hash function of the parent_ino and the dirent
> name, and I can't even imagine how one would support using that for
> arbitrary lookups or walks.
> 
> This is the giant hole in all of the online repair code -- the design
> of
> the dentry cache is such that we can't invalidate the entire cache. 
> We
> also cannot walk it to perform targeted invalidation of just the
> pieces
> we want.  If after a repair the cache contains a dentry that isn't
> backed by an actual ondisk directory entry ... kaboom.
> 
> The one thing I'll grant you is that I don't think it's likely that
> the
> dentry cache will get populated with some information and later the
> ondisk directory bitrots undetectably.
> 
> > ?
> > 
> > I just think it reads cleaner.  I realize this is an area that
> > still
> > sort of in flux, but definitely before we call the document done we
> > should probably strip out the Q's and just document the A's.  If
> > someone re-raises the Q's we can always refer to the archives and
> > then
> > have the discussion on the mailing list.  But I think the document
> > should maintain the goal of making clear whatever the current plan
> > is
> > just to keep it reading cleanly. 
> 
> Yeah, I'll shorten this section so that it only mentions these things
> once and clearly states that I have no solution.
I see, yes I got the impression from the original phrasing that is was
an intended "todo", so clarifying that its not should help. 

> 
> > > +
> > > +1. The cached dentry reflects an ondisk dirent in the new
> > > directory.
> > > +
> > > +2. The cached dentry no longer has a corresponding ondisk dirent
> > > in
> > > the new
> > > +   directory and the dentry can be purged from the cache.
> > > +
> > > +3. The cached dentry no longer has an ondisk dirent but the
> > > dentry
> > > cannot be
> > > +   purged.
> > 
> > > +   This is bad.
> > These entries are irrecoverable, but can now be reported.
> > 
> > 
> > 
> > > +
> > > +As mentioned above, the dentry cache does not have a means to
> > > walk
> > > all the
> > > +dentries with a particular directory as a parent.
> > > +This makes detecting situations #2 and #3 impossible, and
> > > remains an
> > > +interesting question for research.
> > I think the above paraphrase makes this last bit redundant.
> 
> N
Not sure if this is "no" or an unfinished thought?

> 
> > > +
> > > +The proposed patchset is the
> > > +`directory repair
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-dirs>`_
> > > +series.
> > > +
> > > +Parent Pointers
> > > +```````````````
> > > +
> > "Generally speaking, a parent pointer is any kind of metadata that
> > enables an inode to locate its parent with out having to traverse
> > the
> > directory tree from the root."
> > 
> > > +The lack of secondary directory metadata hinders directory tree
> > "Without them, the lack of secondary..." 
> 
> Ok.  I want to reword the first sentence slightly, yielding this:
> 
> "A parent pointer is a piece of file metadata that enables a user to
> locate the file's parent directory without having to traverse the
> directory tree from the root.  Without them, reconstruction of
> directory
> trees is hindered in much the same way that the historic lack of
> reverse
> space mapping information once hindered reconstruction of filesystem
> space metadata.  The parent pointer feature, however, makes total
> directory reconstruction
> possible."

Alrighty, that sounds good

> 
> But that's a much better start to the paragraph, thank you.
> 
> > > reconstruction
> > > +in much the same way that the historic lack of reverse space
> > > mapping
> > > +information once hindered reconstruction of filesystem space
> > > metadata.
> > > +The parent pointer feature, however, makes total directory
> > > reconstruction
> > > +possible.
> > > +
> > 
> > History side bar the below chunk...
> 
> Done.
> 
> > > +Directory parent pointers were first proposed as an XFS feature
> > > more
> > > than a
> > > +decade ago by SGI.
> > > +Each link from a parent directory to a child file is mirrored
> > > with
> > > an extended
> > > +attribute in the child that could be used to identify the parent
> > > directory.
> > > +Unfortunately, this early implementation had major shortcomings
> > > and
> > > was never
> > > +merged into Linux XFS:
> > > +
> > > +1. The XFS codebase of the late 2000s did not have the
> > > infrastructure to
> > > +   enforce strong referential integrity in the directory tree.
> > > +   It did not guarantee that a change in a forward link would
> > > always
> > > be
> > > +   followed up with the corresponding change to the reverse
> > > links.
> > > +
> > > +2. Referential integrity was not integrated into offline repair.
> > > +   Checking and repairs were performed on mounted filesystems
> > > without taking
> > > +   any kernel or inode locks to coordinate access.
> > > +   It is not clear how this actually worked properly.
> > > +
> > > +3. The extended attribute did not record the name of the
> > > directory
> > > entry in the
> > > +   parent, so the SGI parent pointer implementation cannot be
> > > used
> > > to reconnect
> > > +   the directory tree.
> > > +
> > > +4. Extended attribute forks only support 65,536 extents, which
> > > means
> > > that
> > > +   parent pointer attribute creation is likely to fail at some
> > > point
> > > before the
> > > +   maximum file link count is achieved.
> > 
> > 
> > "The original parent pointer design was too unstable for something
> > like
> > a file system repair to depend on."
> 
> Er... I think this is addressed by #2 above?
Sorry, I meant for the history side bar to go through the list, and
then add that quotation to connect the paragraphs.  In a way, simply
talking about the new improvements below implies everything that the
old design lacked.

> 
> > > +
> > > +Allison Henderson, Chandan Babu, and Catherine Hoang are working
> > > on
> > > a second
> > > +implementation that solves all shortcomings of the first.
> > > +During 2022, Allison introduced log intent items to track
> > > physical
> > > +manipulations of the extended attribute structures.
> > > +This solves the referential integrity problem by making it
> > > possible
> > > to commit
> > > +a dirent update and a parent pointer update in the same
> > > transaction.
> > > +Chandan increased the maximum extent counts of both data and
> > > attribute forks,
> > 
> > > +thereby addressing the fourth problem.
> > which ensures the parent pointer creation will succeed even if the
> > max
> > extent count is reached.
> 
> The max extent count cannot be exceeded, but the nrext64 feature
> ensures
> that the xattr structure can grow enough to handle maximal
> hardlinking.
> 
> "Chandan increased the maximum extent counts of both data and
> attribute
> forks, thereby ensuring that the extended attribute structure can
> grow
> to handle the maximum hardlink count of any file."

Ok, sounds good.

> 
> > > +
> > > +To solve the third problem, parent pointers include the dirent
> > > name
> > "Lastly, the new design includes the dirent name..."
> 
> <nod>
> 
> > > and
> > > +location of the entry within the parent directory.
> > > +In other words, child files use extended attributes to store
> > > pointers to
> > > +parents in the form ``(parent_inum, parent_gen, dirent_pos) →
> > > (dirent_name)``.
> > This parts still in flux, so probably this will have to get updated
> > later...
> 
> Yep, I'll add a note about that.
> 
> > > +
> > > +On a filesystem with parent pointers, the directory checking
> > > process
> > > can be
> > > +strengthened to ensure that the target of each dirent also
> > > contains
> > > a parent
> > > +pointer pointing back to the dirent.
> > > +Likewise, each parent pointer can be checked by ensuring that
> > > the
> > > target of
> > > +each parent pointer is a directory and that it contains a dirent
> > > matching
> > > +the parent pointer.
> > > +Both online and offline repair can use this strategy.
> 
> I moved this paragraph up to become the second paragraph, and now it
> reads:
> 
> "XFS parent pointers include the dirent name and location of the
> entry
> within the parent directory.  In other words, child files use
> extended
> attributes to store pointers to parents in the form ``(parent_inum,
> parent_gen, dirent_pos) → (dirent_name)``.  The directory checking
> process can be strengthened to ensure that the target of each dirent
> also contains a parent pointer pointing back to the dirent. 
> Likewise,
> each parent pointer can be checked by ensuring that the target of
> each
> parent pointer is a directory and that it contains a dirent matching
> the
> parent pointer.  Both online and offline repair can use this
> strategy.
> 
> Note: The ondisk format of parent pointers is not yet finalized."
> 
> After which comes the historical sidebar.
Alrighty, I think that's fine for now

> 
> > > +
> > > +Case Study: Repairing Directories with Parent Pointers
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Directory rebuilding uses a :ref:`coordinated inode scan
> > > <iscan>`
> > > and
> > > +a :ref:`directory entry live update hook <liveupdate>` as
> > > follows:
> > > +
> > > +1. Set up a temporary directory for generating the new directory
> > > structure,
> > > +   an xfblob for storing entry names, and an xfarray for
> > > stashing
> > > directory
> > > +   updates.
> > > +
> > > +2. Set up an inode scanner and hook into the directory entry
> > > code to
> > > receive
> > > +   updates on directory operations.
> > > +
> > > +3. For each parent pointer found in each file scanned, decide if
> > > the
> > > parent
> > > +   pointer references the directory of interest.
> > > +   If so:
> > > +
> > > +   a. Stash an addname entry for this dirent in the xfarray for
> > > later.
> > > +
> > > +   b. When finished scanning that file, flush the stashed
> > > updates to
> > > the
> > > +      temporary directory.
> > > +
> > > +4. For each live directory update received via the hook, decide
> > > if
> > > the child
> > > +   has already been scanned.
> > > +   If so:
> > > +
> > > +   a. Stash an addname or removename entry for this dirent
> > > update in
> > > the
> > > +      xfarray for later.
> > > +      We cannot write directly to the temporary directory
> > > because
> > > hook
> > > +      functions are not allowed to modify filesystem metadata.
> > > +      Instead, we stash updates in the xfarray and rely on the
> > > scanner thread
> > > +      to apply the stashed updates to the temporary directory.
> > > +
> > > +5. When the scan is complete, atomically swap the contents of
> > > the
> > > temporary
> > > +   directory and the directory being repaired.
> > > +   The temporary directory now contains the damaged directory
> > > structure.
> > > +
> > > +6. Reap the temporary directory.
> > > +
> > > +7. Update the dirent position field of parent pointers as
> > > necessary.
> > > +   This may require the queuing of a substantial number of xattr
> > > log
> > > intent
> > > +   items.
> > > +
> > > +The proposed patchset is the
> > > +`parent pointers directory repair
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=pptrs-online-dir-repair>`_
> > > +series.
> > > +
> > > +**Unresolved Question**: How will repair ensure that the
> > > ``dirent_pos`` fields
> > > +match in the reconstructed directory?
> > > +
> > > +*Answer*: There are a few ways to solve this problem:
> > > +
> > > +1. The field could be designated advisory, since the other three
> > > values are
> > > +   sufficient to find the entry in the parent.
> > > +   However, this makes indexed key lookup impossible while
> > > repairs
> > > are ongoing.
> > > +
> > > +2. We could allow creating directory entries at specified
> > > offsets,
> > > which solves
> > > +   the referential integrity problem but runs the risk that
> > > dirent
> > > creation
> > > +   will fail due to conflicts with the free space in the
> > > directory.
> > > +
> > > +   These conflicts could be resolved by appending the directory
> > > entry and
> > > +   amending the xattr code to support updating an xattr key and
> > > reindexing the
> > > +   dabtree, though this would have to be performed with the
> > > parent
> > > directory
> > > +   still locked.
> > > +
> > > +3. Same as above, but remove the old parent pointer entry and
> > > add a
> > > new one
> > > +   atomically.
> > > +
> > > +4. Change the ondisk xattr format to ``(parent_inum, name) →
> > > (parent_gen)``,
> > > +   which would provide the attr name uniqueness that we require,
> > > without
> > > +   forcing repair code to update the dirent position.
> > > +   Unfortunately, this requires changes to the xattr code to
> > > support
> > > attr
> > > +   names as long as 263 bytes.
> > > +
> > > +5. Change the ondisk xattr format to ``(parent_inum, hash(name))
> > > →
> > > +   (name, parent_gen)``.
> > > +   If the hash is sufficiently resistant to collisions (e.g.
> > > sha256)
> > > then
> > > +   this should provide the attr name uniqueness that we require.
> > > +   Names shorter than 247 bytes could be stored directly.
> > I think the RFC deluge is the same question but more context, so
> > probably this section will follow what we decide there.  I will
> > save
> > commentary to keep the discussion in the same thread...
> > 
> > I'll just link it here for anyone else following this for now...
> > https://www.spinics.net/lists/linux-xfs/msg69397.html
> 
> Yes, the deluge has much more detailed information.  I'll add this
> link
> (for now) to the doc.
> 
> > > +
> > > +Case Study: Repairing Parent Pointers
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Online reconstruction of a file's parent pointer information
> > > works
> > > similarly to
> > > +directory reconstruction:
> > > +
> > > +1. Set up a temporary file for generating a new extended
> > > attribute
> > > structure,
> > > +   an xfblob for storing parent pointer names, and an xfarray
> > > for
> > > stashing
> > > +   parent pointer updates.
> > we did talk about blobs in patch 6 though it took me a moment to
> > remember... if there's a way to link or tag it, that would be
> > helpful
> > for with the quick refresh.  kinda like wikipedia hyperlinks, you
> > really only need like the first line or two to get it snap back
> 
> There is; I'll put in a backreference.
> 
> > > +
> > > +2. Set up an inode scanner and hook into the directory entry
> > > code to
> > > receive
> > > +   updates on directory operations.
> > > +
> > > +3. For each directory entry found in each directory scanned,
> > > decide
> > > if the
> > > +   dirent references the file of interest.
> > > +   If so:
> > > +
> > > +   a. Stash an addpptr entry for this parent pointer in the
> > > xfblob
> > > and xfarray
> > > +      for later.
> > > +
> > > +   b. When finished scanning the directory, flush the stashed
> > > updates to the
> > > +      temporary directory.
> > > +
> > > +4. For each live directory update received via the hook, decide
> > > if
> > > the parent
> > > +   has already been scanned.
> > > +   If so:
> > > +
> > > +   a. Stash an addpptr or removepptr entry for this dirent
> > > update in
> > > the
> > > +      xfarray for later.
> > > +      We cannot write parent pointers directly to the temporary
> > > file
> > > because
> > > +      hook functions are not allowed to modify filesystem
> > > metadata.
> > > +      Instead, we stash updates in the xfarray and rely on the
> > > scanner thread
> > > +      to apply the stashed parent pointer updates to the
> > > temporary
> > > file.
> > > +
> > > +5. Copy all non-parent pointer extended attributes to the
> > > temporary
> > > file.
> > > +
> > > +6. When the scan is complete, atomically swap the attribute fork
> > > of
> > > the
> > > +   temporary file and the file being repaired.
> > > +   The temporary file now contains the damaged extended
> > > attribute
> > > structure.
> > > +
> > > +7. Reap the temporary file.
> > Seems like it should work
> 
> Let's hope so!
> 
> > > +
> > > +The proposed patchset is the
> > > +`parent pointers repair
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=pptrs-online-parent-repair>`_
> > > +series.
> > > +
> > > +Digression: Offline Checking of Parent Pointers
> > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > +Examining parent pointers in offline repair works differently
> > > because corrupt
> > > +files are erased long before directory tree connectivity checks
> > > are
> > > performed.
> > > +Parent pointer checks are therefore a second pass to be added to
> > > the
> > > existing
> > > +connectivity checks:
> > > +
> > > +1. After the set of surviving files has been established (i.e.
> > > phase
> > > 6),
> > > +   walk the surviving directories of each AG in the filesystem.
> > > +   This is already performed as part of the connectivity checks.
> > > +
> > > +2. For each directory entry found, record the name in an xfblob,
> > > and
> > > store
> > > +   ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)``
> > > tuples
> > > in a
> > > +   per-AG in-memory slab.
> > > +
> > > +3. For each AG in the filesystem,
> > > +
> > > +   a. Sort the per-AG tuples in order of child_ag_inum,
> > > parent_inum,
> > > and
> > > +      dirent_pos.
> > > +
> > > +   b. For each inode in the AG,
> > > +
> > > +      1. Scan the inode for parent pointers.
> > > +         Record the names in a per-file xfblob, and store
> > > ``(parent_inum,
> > > +         parent_gen, dirent_pos)`` tuples in a per-file slab.
> > > +
> > > +      2. Sort the per-file tuples in order of parent_inum, and
> > > dirent_pos.
> > > +
> > > +      3. Position one slab cursor at the start of the inode's
> > > records in the
> > > +         per-AG tuple slab.
> > > +         This should be trivial since the per-AG tuples are in
> > > child
> > > inumber
> > > +         order.
> > > +
> > > +      4. Position a second slab cursor at the start of the per-
> > > file
> > > tuple slab.
> > > +
> > > +      5. Iterate the two cursors in lockstep, comparing the
> > > parent_ino and
> > > +         dirent_pos fields of the records under each cursor.
> > > +
> > > +         a. Tuples in the per-AG list but not the per-file list
> > > are
> > > missing and
> > > +            need to be written to the inode.
> > > +
> > > +         b. Tuples in the per-file list but not the per-AG list
> > > are
> > > dangling
> > > +            and need to be removed from the inode.
> > > +
> > > +         c. For tuples in both lists, update the parent_gen and
> > > name
> > > components
> > > +            of the parent pointer if necessary.
> > > +
> > > +4. Move on to examining link counts, as we do today.
> > > +
> > > +The proposed patchset is the
> > > +`offline parent pointers repair
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > it/log/?h=pptrs-repair>`_
> > > +series.
> > > +
> > > +Rebuilding directories from parent pointers in offline repair is
> > > very
> > > +challenging because it currently uses a single-pass scan of the
> > > filesystem
> > > +during phase 3 to decide which files are corrupt enough to be
> > > zapped.
> > > +This scan would have to be converted into a multi-pass scan:
> > > +
> > > +1. The first pass of the scan zaps corrupt inodes, forks, and
> > > attributes
> > > +   much as it does now.
> > > +   Corrupt directories are noted but not zapped.
> > > +
> > > +2. The next pass records parent pointers pointing to the
> > > directories
> > > noted
> > > +   as being corrupt in the first pass.
> > > +   This second pass may have to happen after the phase 4 scan
> > > for
> > > duplicate
> > > +   blocks, if phase 4 is also capable of zapping directories.
> > > +
> > > +3. The third pass resets corrupt directories to an empty
> > > shortform
> > > directory.
> > > +   Free space metadata has not been ensured yet, so repair
> > > cannot
> > > yet use the
> > > +   directory building code in libxfs.
> > > +
> > > +4. At the start of phase 6, space metadata have been rebuilt.
> > > +   Use the parent pointer information recorded during step 2 to
> > > reconstruct
> > > +   the dirents and add them to the now-empty directories.
> > > +
> > > +This code has not yet been constructed.
> > > +
> > > +.. _orphanage:
> > > +
> > > +The Orphanage
> > > +-------------
> > > +
> > > +Filesystems present files as a directed, and hopefully acyclic,
> > > graph.
> > > +In other words, a tree.
> > > +The root of the filesystem is a directory, and each entry in a
> > > directory points
> > > +downwards either to more subdirectories or to non-directory
> > > files.
> > > +Unfortunately, a disruption in the directory graph pointers
> > > result
> > > in a
> > > +disconnected graph, which makes files impossible to access via
> > > regular path
> > > +resolution.
> > > +The directory parent pointer online scrub code can detect a
> > > dotdot
> > > entry
> > > +pointing to a parent directory that doesn't have a link back to
> > > the
> > > child
> > > +directory, and the file link count checker can detect a file
> > > that
> > > isn't pointed
> > > +to by any directory in the filesystem.
> > > +If the file in question has a positive link count, the file in
> > > question is an
> > > +orphan.
> > 
> > Hmm, I kinda felt like this should have flowed into something like:
> > "now that we have parent pointers, we can reparent them instead of
> > putting them in the orphanage..."
> 
> That's only true if we actually *find* the relevant forward or back
> pointers.  If a file has positive link count but there aren't any
> links
> to it from anywhere, we still have to dump it in the /lost+found.
> 
> Parent pointers make it a lot less likely that we'll have to put a
> file
> in the /lost+found, but it's still possible.
> 
> I think I'll change this paragraph to start:
> 
> "Without parent pointers, the directory parent pointer online scrub
> code
> can detect a dotdot entry pointing to a parent directory..."
> 
> and then add a new paragraph:
> 
> "With parent pointers, directories can be rebuilt by scanning parent
> pointers and parent pointers can be rebuilt by scanning directories.
> This should reduce the incidence of files ending up in
> ``/lost+found``."
I see, ok i think that sounds good then.

Allison
> 
> > ?
> > > +
> > > +When orphans are found, they should be reconnected to the
> > > directory
> > > tree.
> > > +Offline fsck solves the problem by creating a directory
> > > ``/lost+found`` to
> > > +serve as an orphanage, and linking orphan files into the
> > > orphanage
> > > by using the
> > > +inumber as the name.
> > > +Reparenting a file to the orphanage does not reset any of its
> > > permissions or
> > > +ACLs.
> > > +
> > > +This process is more involved in the kernel than it is in
> > > userspace.
> > > +The directory and file link count repair setup functions must
> > > use
> > > the regular
> > > +VFS mechanisms to create the orphanage directory with all the
> > > necessary
> > > +security attributes and dentry cache entries, just like a
> > > regular
> > > directory
> > > +tree modification.
> > > +
> > > +Orphaned files are adopted by the orphanage as follows:
> > > +
> > > +1. Call ``xrep_orphanage_try_create`` at the start of the scrub
> > > setup function
> > > +   to try to ensure that the lost and found directory actually
> > > exists.
> > > +   This also attaches the orphanage directory to the scrub
> > > context.
> > > +
> > > +2. If the decision is made to reconnect a file, take the IOLOCK
> > > of
> > > both the
> > > +   orphanage and the file being reattached.
> > > +   The ``xrep_orphanage_iolock_two`` function follows the inode
> > > locking
> > > +   strategy discussed earlier.
> > > +
> > > +3. Call ``xrep_orphanage_compute_blkres`` and
> > > ``xrep_orphanage_compute_name``
> > > +   to compute the new name in the orphanage and the block
> > > reservation required.
> > > +
> > > +4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to
> > > the
> > > repair
> > > +   transaction.
> > > +
> > > +5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file
> > > into
> > > the lost
> > > +   and found, and update the kernel dentry cache.
> > > +
> > > +The proposed patches are in the
> > > +`orphanage adoption
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-orphanage>`_
> > > +series.
> > 
> > Certainly we'll need to come back and update all the parts that
> > would
> > be affected by the RFC, but otherwise looks ok.  It seems trying to
> > document code before it's written tends to cause things to go
> > around
> > for a while, since we really just cant know how stable a design is
> > until it's been through at least a few prototypes.
> 
> Agreed!
> 
> --D
> 
> > Allison


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 13/14] xfs: document the userspace fsck driver program
  2023-03-02  0:27       ` Darrick J. Wong
@ 2023-03-03 23:51         ` Allison Henderson
  2023-03-04  2:25           ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-03-03 23:51 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Wed, 2023-03-01 at 16:27 -0800, Darrick J. Wong wrote:
> On Wed, Mar 01, 2023 at 05:36:59AM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Add the sixth chapter of the online fsck design documentation,
> > > where
> > > we discuss the details of the data structures and algorithms used
> > > by
> > > the
> > > driver program xfs_scrub.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  313
> > > ++++++++++++++++++++
> > >  1 file changed, 313 insertions(+)
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index 2e20314f1831..05b9411fac7f 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -300,6 +300,9 @@ The seven phases are as follows:
> > >  7. Re-check the summary counters and presents the caller with a
> > > summary of
> > >     space usage and file counts.
> > >  
> > > +This allocation of responsibilities will be :ref:`revisited
> > > <scrubcheck>`
> > > +later in this document.
> > > +
> > >  Steps for Each Scrub Item
> > >  -------------------------
> > >  
> > > @@ -4505,3 +4508,313 @@ The proposed patches are in the
> > >  `orphanage adoption
> > >  <
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=repair-orphanage>`_
> > >  series.
> > > +
> > > +6. Userspace Algorithms and Data Structures
> > > +===========================================
> > > +
> > > +This section discusses the key algorithms and data structures of
> > > the
> > > userspace
> > > +program, ``xfs_scrub``, that provide the ability to drive
> > > metadata
> > > checks and
> > > +repairs in the kernel, verify file data, and look for other
> > > potential problems.
> > > +
> > > +.. _scrubcheck:
> > > +
> > > +Checking Metadata
> > > +-----------------
> > > +
> > > +Recall the :ref:`phases of fsck work<scrubphases>` outlined
> > > earlier.
> > > +That structure follows naturally from the data dependencies
> > > designed
> > > into the
> > > +filesystem from its beginnings in 1993.
> > > +In XFS, there are several groups of metadata dependencies:
> > > +
> > > +a. Filesystem summary counts depend on consistency within the
> > > inode
> > > indices,
> > > +   the allocation group space btrees, and the realtime volume
> > > space
> > > +   information.
> > > +
> > > +b. Quota resource counts depend on consistency within the quota
> > > file
> > > data
> > > +   forks, inode indices, inode records, and the forks of every
> > > file
> > > on the
> > > +   system.
> > > +
> > > +c. The naming hierarchy depends on consistency within the
> > > directory
> > > and
> > > +   extended attribute structures.
> > > +   This includes file link counts.
> > > +
> > > +d. Directories, extended attributes, and file data depend on
> > > consistency within
> > > +   the file forks that map directory and extended attribute data
> > > to
> > > physical
> > > +   storage media.
> > > +
> > > +e. The file forks depends on consistency within inode records
> > > and
> > > the space
> > > +   metadata indices of the allocation groups and the realtime
> > > volume.
> > > +   This includes quota and realtime metadata files.
> > > +
> > > +f. Inode records depends on consistency within the inode
> > > metadata
> > > indices.
> > > +
> > > +g. Realtime space metadata depend on the inode records and data
> > > forks of the
> > > +   realtime metadata inodes.
> > > +
> > > +h. The allocation group metadata indices (free space, inodes,
> > > reference count,
> > > +   and reverse mapping btrees) depend on consistency within the
> > > AG
> > > headers and
> > > +   between all the AG metadata btrees.
> > > +
> > > +i. ``xfs_scrub`` depends on the filesystem being mounted and
> > > kernel
> > > support
> > > +   for online fsck functionality.
> > > +
> > > +Therefore, a metadata dependency graph is a convenient way to
> > > schedule checking
> > > +operations in the ``xfs_scrub`` program:
> > > +
> > > +- Phase 1 checks that the provided path maps to an XFS
> > > filesystem
> > > and detect
> > > +  the kernel's scrubbing abilities, which validates group (i).
> > > +
> > > +- Phase 2 scrubs groups (g) and (h) in parallel using a threaded
> > > workqueue.
> > > +
> > > +- Phase 3 checks groups (f), (e), and (d), in that order.
> > > +  These groups are all file metadata, which means that inodes
> > > are
> > > scanned in
> > > +  parallel.
> > ...When things are done in order, then they are done in serial
> > right?
> > Things done in parallel are done at the same time.  Either the
> > phrase
> > "in that order" needs to go away, or the last line needs to drop
> 
> Each inode is processed in parallel, but individual inodes are
> processed
> in f-e-d order.
> 
> "Phase 3 scans inodes in parallel.  For each inode, groups (f), (e),
> and
> (d) are checked, in that order."
Ohh, ok.  Now that I re-read it, it makes sense but lets keep the new
one

> 
> > > +
> > > +- Phase 4 repairs everything in groups (i) through (d) so that
> > > phases 5 and 6
> > > +  may run reliably.
> > > +
> > > +- Phase 5 starts by checking groups (b) and (c) in parallel
> > > before
> > > moving on
> > > +  to checking names.
> > > +
> > > +- Phase 6 depends on groups (i) through (b) to find file data
> > > blocks
> > > to verify,
> > > +  to read them, and to report which blocks of which files are
> > > affected.
> > > +
> > > +- Phase 7 checks group (a), having validated everything else.
> > > +
> > > +Notice that the data dependencies between groups are enforced by
> > > the
> > > structure
> > > +of the program flow.
> > > +
> > > +Parallel Inode Scans
> > > +--------------------
> > > +
> > > +An XFS filesystem can easily contain hundreds of millions of
> > > inodes.
> > > +Given that XFS targets installations with large high-performance
> > > storage,
> > > +it is desirable to scrub inodes in parallel to minimize runtime,
> > > particularly
> > > +if the program has been invoked manually from a command line.
> > > +This requires careful scheduling to keep the threads as evenly
> > > loaded as
> > > +possible.
> > > +
> > > +Early iterations of the ``xfs_scrub`` inode scanner naïvely
> > > created
> > > a single
> > > +workqueue and scheduled a single workqueue item per AG.
> > > +Each workqueue item walked the inode btree (with
> > > ``XFS_IOC_INUMBERS``) to find
> > > +inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to
> > > gather enough
> > > +information to construct file handles.
> > > +The file handle was then passed to a function to generate scrub
> > > items for each
> > > +metadata object of each inode.
> > > +This simple algorithm leads to thread balancing problems in
> > > phase 3
> > > if the
> > > +filesystem contains one AG with a few large sparse files and the
> > > rest of the
> > > +AGs contain many smaller files.
> > > +The inode scan dispatch function was not sufficiently granular;
> > > it
> > > should have
> > > +been dispatching at the level of individual inodes, or, to
> > > constrain
> > > memory
> > > +consumption, inode btree records.
> > > +
> > > +Thanks to Dave Chinner, bounded workqueues in userspace enable
> > > ``xfs_scrub`` to
> > > +avoid this problem with ease by adding a second workqueue.
> > > +Just like before, the first workqueue is seeded with one
> > > workqueue
> > > item per AG,
> > > +and it uses INUMBERS to find inode btree chunks.
> > > +The second workqueue, however, is configured with an upper bound
> > > on
> > > the number
> > > +of items that can be waiting to be run.
> > > +Each inode btree chunk found by the first workqueue's workers
> > > are
> > > queued to the
> > > +second workqueue, and it is this second workqueue that queries
> > > BULKSTAT,
> > > +creates a file handle, and passes it to a function to generate
> > > scrub
> > > items for
> > > +each metadata object of each inode.
> > > +If the second workqueue is too full, the workqueue add function
> > > blocks the
> > > +first workqueue's workers until the backlog eases.
> > > +This doesn't completely solve the balancing problem, but reduces
> > > it
> > > enough to
> > > +move on to more pressing issues.
> > > +
> > > +The proposed patchsets are the scrub
> > > +`performance tweaks
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > it/log/?h=scrub-performance-tweaks>`_
> > > +and the
> > > +`inode scan rebalance
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > it/log/?h=scrub-iscan-rebalance>`_
> > > +series.
> > > +
> > > +.. _scrubrepair:
> > > +
> > > +Scheduling Repairs
> > > +------------------
> > > +
> > > +During phase 2, corruptions and inconsistencies reported in any
> > > AGI
> > > header or
> > > +inode btree are repaired immediately, because phase 3 relies on
> > > proper
> > > +functioning of the inode indices to find inodes to scan.
> > > +Failed repairs are rescheduled to phase 4.
> > > +Problems reported in any other space metadata are deferred to
> > > phase
> > > 4.
> > > +Optimization opportunities are always deferred to phase 4, no
> > > matter
> > > their
> > > +origin.
> > > +
> > > +During phase 3, corruptions and inconsistencies reported in any
> > > part
> > > of a
> > > +file's metadata are repaired immediately if all space metadata
> > > were
> > > validated
> > > +during phase 2.
> > > +Repairs that fail or cannot be repaired immediately are
> > > scheduled
> > > for phase 4.
> > > +
> > > +In the original design of ``xfs_scrub``, it was thought that
> > > repairs
> > > would be
> > > +so infrequent that the ``struct xfs_scrub_metadata`` objects
> > > used to
> > > +communicate with the kernel could also be used as the primary
> > > object
> > > to
> > > +schedule repairs.
> > > +With recent increases in the number of optimizations possible
> > > for a
> > > given
> > > +filesystem object, it became much more memory-efficient to track
> > > all
> > > eligible
> > > +repairs for a given filesystem object with a single repair item.
> > > +Each repair item represents a single lockable object -- AGs,
> > > metadata files,
> > > +individual inodes, or a class of summary information.
> > > +
> > > +Phase 4 is responsible for scheduling a lot of repair work in as
> > > quick a
> > > +manner as is practical.
> > > +The :ref:`data dependencies <scrubcheck>` outlined earlier still
> > > apply, which
> > > +means that ``xfs_scrub`` must try to complete the repair work
> > > scheduled by
> > > +phase 2 before trying repair work scheduled by phase 3.
> > > +The repair process is as follows:
> > > +
> > > +1. Start a round of repair with a workqueue and enough workers
> > > to
> > > keep the CPUs
> > > +   as busy as the user desires.
> > > +
> > > +   a. For each repair item queued by phase 2,
> > > +
> > > +      i.   Ask the kernel to repair everything listed in the
> > > repair
> > > item for a
> > > +           given filesystem object.
> > > +
> > > +      ii.  Make a note if the kernel made any progress in
> > > reducing
> > > the number
> > > +           of repairs needed for this object.
> > > +
> > > +      iii. If the object no longer requires repairs, revalidate
> > > all
> > > metadata
> > > +           associated with this object.
> > > +           If the revalidation succeeds, drop the repair item.
> > > +           If not, requeue the item for more repairs.
> > > +
> > > +   b. If any repairs were made, jump back to 1a to retry all the
> > > phase 2 items.
> > > +
> > > +   c. For each repair item queued by phase 3,
> > > +
> > > +      i.   Ask the kernel to repair everything listed in the
> > > repair
> > > item for a
> > > +           given filesystem object.
> > > +
> > > +      ii.  Make a note if the kernel made any progress in
> > > reducing
> > > the number
> > > +           of repairs needed for this object.
> > > +
> > > +      iii. If the object no longer requires repairs, revalidate
> > > all
> > > metadata
> > > +           associated with this object.
> > > +           If the revalidation succeeds, drop the repair item.
> > > +           If not, requeue the item for more repairs.
> > > +
> > > +   d. If any repairs were made, jump back to 1c to retry all the
> > > phase 3 items.
> > > +
> > > +2. If step 1 made any repair progress of any kind, jump back to
> > > step
> > > 1 to start
> > > +   another round of repair.
> > > +
> > > +3. If there are items left to repair, run them all serially one
> > > more
> > > time.
> > > +   Complain if the repairs were not successful, since this is
> > > the
> > > last chance
> > > +   to repair anything.
> > > +
> > > +Corruptions and inconsistencies encountered during phases 5 and
> > > 7
> > > are repaired
> > > +immediately.
> > > +Corrupt file data blocks reported by phase 6 cannot be recovered
> > > by
> > > the
> > > +filesystem.
> > > +
> > > +The proposed patchsets are the
> > > +`repair warning improvements
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > it/log/?h=scrub-better-repair-warnings>`_,
> > > +refactoring of the
> > > +`repair data dependency
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > it/log/?h=scrub-repair-data-deps>`_
> > > +and
> > > +`object tracking
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > it/log/?h=scrub-object-tracking>`_,
> > > +and the
> > > +`repair scheduling
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > it/log/?h=scrub-repair-scheduling>`_
> > > +improvement series.
> > > +
> > > +Checking Names for Confusable Unicode Sequences
> > > +-----------------------------------------------
> > > +
> > > +If ``xfs_scrub`` succeeds in validating the filesystem metadata
> > > by
> > > the end of
> > > +phase 4, it moves on to phase 5, which checks for suspicious
> > > looking
> > > names in
> > > +the filesystem.
> > > +These names consist of the filesystem label, names in directory
> > > entries, and
> > > +the names of extended attributes.
> > > +Like most Unix filesystems, XFS imposes the sparest of
> > > constraints
> > > on the
> > > +contents of a name -- slashes and null bytes are not allowed in
> > > directory
> > > +entries; and null bytes are not allowed in extended attributes
> > > and
> > maybe say "standard user accessible extended attributes"
> 
> "userspace visible"?
Thats fine, mostly I meant to exclude parent pointers, but I've seen
other ideas that talk about using xattrs to store binary metadata, so
pptrs may not be the last to do this.

> 
> I'll list-ify this too:
> 
> Like most Unix filesystems, XFS imposes the sparest of constraints on
> the contents of a name:
> 
> - slashes and null bytes are not allowed in directory entries;
> 
> - null bytes are not allowed in userspace-visible extended
> attributes;
> 
> - null bytes are not allowed in the filesystem label
Ok, I think that works

> 
> > > the
> > > +filesystem label.
> > > +Directory entries and attribute keys store the length of the
> > > name
> > > explicitly
> > > +ondisk, which means that nulls are not name terminators.
> > > +For this section, the term "naming domain" refers to any place
> > > where
> > > names are
> > > +presented together -- all the names in a directory, or all the
> > > attributes of a
> > > +file.
> > > +
> > > +Although the Unix naming constraints are very permissive, the
> > > reality of most
> > > +modern-day Linux systems is that programs work with Unicode
> > > character code
> > > +points to support international languages.
> > > +These programs typically encode those code points in UTF-8 when
> > > interfacing
> > > +with the C library because the kernel expects null-terminated
> > > names.
> > > +In the common case, therefore, names found in an XFS filesystem
> > > are
> > > actually
> > > +UTF-8 encoded Unicode data.
> > > +
> > > +To maximize its expressiveness, the Unicode standard defines
> > > separate control
> > > +points for various characters that render similarly or
> > > identically
> > > in writing
> > > +systems around the world.
> > > +For example, the character "Cyrillic Small Letter A" U+0430 "а"
> > > often renders
> > > +identically to "Latin Small Letter A" U+0061 "a".
> > 
> > 
> > > +
> > > +The standard also permits characters to be constructed in
> > > multiple
> > > ways --
> > > +either by using a defined code point, or by combining one code
> > > point
> > > with
> > > +various combining marks.
> > > +For example, the character "Angstrom Sign U+212B "Å" can also be
> > > expressed
> > > +as "Latin Capital Letter A" U+0041 "A" followed by "Combining
> > > Ring
> > > Above"
> > > +U+030A "◌̊".
> > > +Both sequences render identically.
> > > +
> > > +Like the standards that preceded it, Unicode also defines
> > > various
> > > control
> > > +characters to alter the presentation of text.
> > > +For example, the character "Right-to-Left Override" U+202E can
> > > trick
> > > some
> > > +programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as
> > > "mootxt.png".
> > > +A second category of rendering problems involves whitespace
> > > characters.
> > > +If the character "Zero Width Space" U+200B is encountered in a
> > > file
> > > name, the
> > > +name will render identically to a name that does not have the
> > > zero
> > > width
> > > +space.
> > > +
> > > +If two names within a naming domain have different byte
> > > sequences
> > > but render
> > > +identically, a user may be confused by it.
> > > +The kernel, in its indifference to upper level encoding schemes,
> > > permits this.
> > > +Most filesystem drivers persist the byte sequence names that are
> > > given to them
> > > +by the VFS.
> > > +
> > > +Techniques for detecting confusable names are explained in great
> > > detail in
> > > +sections 4 and 5 of the
> > > +`Unicode Security Mechanisms
> > > <https://unicode.org/reports/tr39/>`_
> > > +document.
> > I don't know that we need this much detail on character rendering. 
> > I
> > think the example above is enough to make the point that character
> > strings can differ in binary, but render the same, so we need to
> > deal
> > with that.  So I think that's really all the justification we need
> > for
> > the NFD usage
> 
> I want to leave the link in, because TR39 is the canonical source for
> information about confusability detection.  That is the location
> where
> the Unicode folks publish everything they currently know on the
> topic.

Sure, maybe just keep the last line then.

Allison

> 
> > > +``xfs_scrub``, when it detects UTF-8 encoding in use on a
> > > system,
> > > uses the
> > When ``xfs_scrub`` detects UTF-8 encoding, it uses the...
> 
> Changed, thanks.
> 
> > > +Unicode normalization form NFD in conjunction with the
> > > confusable
> > > name
> > > +detection component of
> > > +`libicu <https://github.com/unicode-org/icu>`_
> > > +to identify names with a directory or within a file's extended
> > > attributes that
> > > +could be confused for each other.
> > > +Names are also checked for control characters, non-rendering
> > > characters, and
> > > +mixing of bidirectional characters.
> > > +All of these potential issues are reported to the system
> > > administrator during
> > > +phase 5.
> > > +
> > > +Media Verification of File Data Extents
> > > +---------------------------------------
> > > +
> > > +The system administrator can elect to initiate a media scan of
> > > all
> > > file data
> > > +blocks.
> > > +This scan after validation of all filesystem metadata (except
> > > for
> > > the summary
> > > +counters) as phase 6.
> > > +The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the
> > > filesystem space map
> > > +to find areas that are allocated to file data fork extents.
> > > +Gaps betweeen data fork extents that are smaller than 64k are
> > > treated as if
> > > +they were data fork extents to reduce the command setup
> > > overhead.
> > > +When the space map scan accumulates a region larger than 32MB, a
> > > media
> > > +verification request is sent to the disk as a directio read of
> > > the
> > > raw block
> > > +device.
> > > +
> > > +If the verification read fails, ``xfs_scrub`` retries with
> > > single-
> > > block reads
> > > +to narrow down the failure to the specific region of the media
> > > and
> > > recorded.
> > > +When it has finished issuing verification requests, it again
> > > uses
> > > the space
> > > +mapping ioctl to map the recorded media errors back to metadata
> > > structures
> > > +and report what has been lost.
> > > +For media errors in blocks owned by files, the lack of parent
> > > pointers means
> > > +that the entire filesystem must be walked to report the file
> > > paths
> > > and offsets
> > > +corresponding to the media error.
> > > 
> > This last bit will need to be updated after we come to a decision
> > with
> > the rfc
> 
> I'll at least update it since this doc is now pretty deep into the
> pptrs
> stuff:
> 
> "For media errors in blocks owned by files, parent pointers can be
> used
> to construct file paths from inode numbers for user-friendly
> reporting."
> 
> > Other than that, I think it looks pretty good.
> 
> Woot.
> 
> --D
> 
> > Allison
> > 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 14/14] xfs: document future directions of online fsck
  2023-03-02  0:39       ` Darrick J. Wong
@ 2023-03-03 23:51         ` Allison Henderson
  2023-03-04  2:28           ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Allison Henderson @ 2023-03-03 23:51 UTC (permalink / raw)
  To: djwong
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Wed, 2023-03-01 at 16:39 -0800, Darrick J. Wong wrote:
> On Wed, Mar 01, 2023 at 05:37:19AM +0000, Allison Henderson wrote:
> > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Add the seventh and final chapter of the online fsck
> > > documentation,
> > > where we talk about future functionality that can tie in with the
> > > functionality provided by the online fsck patchset.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  .../filesystems/xfs-online-fsck-design.rst         |  155
> > > ++++++++++++++++++++
> > >  1 file changed, 155 insertions(+)
> > > 
> > > 
> > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > index 05b9411fac7f..41291edb02b9 100644
> > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > @@ -4067,6 +4067,8 @@ The extra flexibility enables several new
> > > use
> > > cases:
> > >    (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby
> > > committing all
> > >    of the updates to the original file, or none of them.
> > >  
> > > +.. _swapext_if_unchanged:
> > > +
> > >  - **Transactional file updates**: The same mechanism as above,
> > > but
> > > the caller
> > >    only wants the commit to occur if the original file's contents
> > > have not
> > >    changed.
> > > @@ -4818,3 +4820,156 @@ and report what has been lost.
> > >  For media errors in blocks owned by files, the lack of parent
> > > pointers means
> > >  that the entire filesystem must be walked to report the file
> > > paths
> > > and offsets
> > >  corresponding to the media error.
> > > +
> > > +7. Conclusion and Future Work
> > > +=============================
> > > +
> > > +It is hoped that the reader of this document has followed the
> > > designs laid out
> > > +in this document and now has some familiarity with how XFS
> > > performs
> > > online
> > > +rebuilding of its metadata indices, and how filesystem users can
> > > interact with
> > > +that functionality.
> > > +Although the scope of this work is daunting, it is hoped that
> > > this
> > > guide will
> > > +make it easier for code readers to understand what has been
> > > built,
> > > for whom it
> > > +has been built, and why.
> > > +Please feel free to contact the XFS mailing list with questions.
> > > +
> > > +FIEXCHANGE_RANGE
> > > +----------------
> > > +
> > > +As discussed earlier, a second frontend to the atomic extent
> > > swap
> > > mechanism is
> > > +a new ioctl call that userspace programs can use to commit
> > > updates
> > > to files
> > > +atomically.
> > > +This frontend has been out for review for several years now,
> > > though
> > > the
> > > +necessary refinements to online repair and lack of customer
> > > demand
> > > mean that
> > > +the proposal has not been pushed very hard.
> 
> Note: The "Extent Swapping with Regular User Files" section has moved
> here.
> 
> > > +Vectorized Scrub
> > > +----------------
> > > +
> > > +As it turns out, the :ref:`refactoring <scrubrepair>` of repair
> > > items mentioned
> > > +earlier was a catalyst for enabling a vectorized scrub system
> > > call.
> > > +Since 2018, the cost of making a kernel call has increased
> > > considerably on some
> > > +systems to mitigate the effects of speculative execution
> > > attacks.
> > > +This incentivizes program authors to make as few system calls as
> > > possible to
> > > +reduce the number of times an execution path crosses a security
> > > boundary.
> > > +
> > > +With vectorized scrub, userspace pushes to the kernel the
> > > identity
> > > of a
> > > +filesystem object, a list of scrub types to run against that
> > > object,
> > > and a
> > > +simple representation of the data dependencies between the
> > > selected
> > > scrub
> > > +types.
> > > +The kernel executes as much of the caller's plan as it can until
> > > it
> > > hits a
> > > +dependency that cannot be satisfied due to a corruption, and
> > > tells
> > > userspace
> > > +how much was accomplished.
> > > +It is hoped that ``io_uring`` will pick up enough of this
> > > functionality that
> > > +online fsck can use that instead of adding a separate vectored
> > > scrub
> > > system
> > > +call to XFS.
> > > +
> > > +The relevant patchsets are the
> > > +`kernel vectorized scrub
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=vectorized-scrub>`_
> > > +and
> > > +`userspace vectorized scrub
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > it/log/?h=vectorized-scrub>`_
> > > +series.
> > > +
> > > +Quality of Service Targets for Scrub
> > > +------------------------------------
> > > +
> > > +One serious shortcoming of the online fsck code is that the
> > > amount
> > > of time that
> > > +it can spend in the kernel holding resource locks is basically
> > > unbounded.
> > > +Userspace is allowed to send a fatal signal to the process which
> > > will cause
> > > +``xfs_scrub`` to exit when it reaches a good stopping point, but
> > > there's no way
> > > +for userspace to provide a time budget to the kernel.
> > > +Given that the scrub codebase has helpers to detect fatal
> > > signals,
> > > it shouldn't
> > > +be too much work to allow userspace to specify a timeout for a
> > > scrub/repair
> > > +operation and abort the operation if it exceeds budget.
> > > +However, most repair functions have the property that once they
> > > begin to touch
> > > +ondisk metadata, the operation cannot be cancelled cleanly,
> > > after
> > > which a QoS
> > > +timeout is no longer useful.
> > > +
> > > +Defragmenting Free Space
> > > +------------------------
> > > +
> > > +Over the years, many XFS users have requested the creation of a
> > > program to
> > > +clear a portion of the physical storage underlying a filesystem
> > > so
> > > that it
> > > +becomes a contiguous chunk of free space.
> > > +Call this free space defragmenter ``clearspace`` for short.
> > > +
> > > +The first piece the ``clearspace`` program needs is the ability
> > > to
> > > read the
> > > +reverse mapping index from userspace.
> > > +This already exists in the form of the ``FS_IOC_GETFSMAP``
> > > ioctl.
> > > +The second piece it needs is a new fallocate mode
> > > +(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in
> > > a
> > > region and
> > > +maps it to a file.
> > > +Call this file the "space collector" file.
> > > +The third piece is the ability to force an online repair.
> > > +
> > > +To clear all the metadata out of a portion of physical storage,
> > > clearspace
> > > +uses the new fallocate map-freespace call to map any free space
> > > in
> > > that region
> > > +to the space collector file.
> > > +Next, clearspace finds all metadata blocks in that region by way
> > > of
> > > +``GETFSMAP`` and issues forced repair requests on the data
> > > structure.
> > > +This often results in the metadata being rebuilt somewhere that
> > > is
> > > not being
> > > +cleared.
> > > +After each relocation, clearspace calls the "map free space"
> > > function again to
> > > +collect any newly freed space in the region being cleared.
> > > +
> > > +To clear all the file data out of a portion of the physical
> > > storage,
> > > clearspace
> > > +uses the FSMAP information to find relevant file data blocks.
> > > +Having identified a good target, it uses the ``FICLONERANGE``
> > > call
> > > on that part
> > > +of the file to try to share the physical space with a dummy
> > > file.
> > > +Cloning the extent means that the original owners cannot
> > > overwrite
> > > the
> > > +contents; any changes will be written somewhere else via copy-
> > > on-
> > > write.
> > > +Clearspace makes its own copy of the frozen extent in an area
> > > that
> > > is not being
> > > +cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent
> > > swap
> > > +<swapext_if_unchanged>` feature) to change the target file's
> > > data
> > > extent
> > > +mapping away from the area being cleared.
> > > +When all other mappings have been moved, clearspace reflinks the
> > > space into the
> > > +space collector file so that it becomes unavailable.
> > > +
> > > +There are further optimizations that could apply to the above
> > > algorithm.
> > > +To clear a piece of physical storage that has a high sharing
> > > factor,
> > > it is
> > > +strongly desirable to retain this sharing factor.
> > > +In fact, these extents should be moved first to maximize sharing
> > > factor after
> > > +the operation completes.
> > > +To make this work smoothly, clearspace needs a new ioctl
> > > +(``FS_IOC_GETREFCOUNTS``) to report reference count information
> > > to
> > > userspace.
> > > +With the refcount information exposed, clearspace can quickly
> > > find
> > > the longest,
> > > +most shared data extents in the filesystem, and target them
> > > first.
> > > +
> > 
> > 
> > > +**Question**: How might the filesystem move inode chunks?
> > > +
> > > +*Answer*: 
> > "In order to move inode chunks.."
> 
> Done.
> 
> > > Dave Chinner has a prototype that creates a new file with the old
> > > +contents and then locklessly runs around the filesystem updating
> > > directory
> > > +entries.
> > > +The operation cannot complete if the filesystem goes down.
> > > +That problem isn't totally insurmountable: create an inode
> > > remapping
> > > table
> > > +hidden behind a jump label, and a log item that tracks the
> > > kernel
> > > walking the
> > > +filesystem to update directory entries.
> > > +The trouble is, the kernel can't do anything about open files,
> > > since
> > > it cannot
> > > +revoke them.
> > > +
> > 
> > 
> > > +**Question**: Can static keys be used to add a revoke bailout
> > > return
> > > to
> > > +*every* code path coming in from userspace?
> > > +
> > > +*Answer*: In principle, yes.
> > > +This 
> > 
> > "It is also possible to use static keys to add a revoke bailout
> > return
> > to each code path coming in from userspace.  This..."
> 
> I think this change would make the answer redundant with the
> question.
Sorry, I meant for the quotations to replace everything between the
line breaks.  So from Q through the answer, just to break out of the
Q&A format.

I sort of feel like if a document leaves the reader with questions that
they didn't have before they started reading, then ideally we should
simply just incorporate the answer in the document.  Just makes the
read easier imho.

> 
> "Can static keys be used to minimize the runtime cost of supporting
> ``revoke()`` on XFS files?"
> 
> "Yes.  Until the first revocation, the bailout code need not be in
> the
> call path at all."

That's an implied Q&A format, but I suppose it's not a big deal either
way though.

> 
> > > would eliminate the overhead of the check until a revocation
> > > happens.
> > > +It's not clear what we do to a revoked file after all the
> > > callers
> > > are finished
> > > +with it, however.
> > > +
> > > +The relevant patchsets are the
> > > +`kernel freespace defrag
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > log/?h=defrag-freespace>`_
> > > +and
> > > +`userspace freespace defrag
> > > +<
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > it/log/?h=defrag-freespace>`_
> > > +series.
> > 
> > I guess since they're just future ideas just light documentation is
> > fine.  Other than cleaning out the Q & A's, I think it looks pretty
> > good.
> 
> Ok.  Thank you x100000000 for being the first person to publicly
> comment
> on the entire document!

Sure, glad to help!  :-)

Allison

> 
> --D
> 
> > Allison
> > 
> > > +
> > > +Shrinking Filesystems
> > > +---------------------
> > > +
> > > +Removing the end of the filesystem ought to be a simple matter
> > > of
> > > evacuating
> > > +the data and metadata at the end of the filesystem, and handing
> > > the
> > > freed space
> > > +to the shrink code.
> > > +That requires an evacuation of the space at end of the
> > > filesystem,
> > > which is a
> > > +use of free space defragmentation!
> > > 
> > 


^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH v24.3 12/14] xfs: document directory tree repairs
  2023-03-03 23:50           ` Allison Henderson
@ 2023-03-04  2:19             ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-04  2:19 UTC (permalink / raw)
  To: Allison Henderson
  Cc: david, Catherine Hoang, linux-fsdevel, hch, linux-xfs, willy,
	Chandan Babu

On Fri, Mar 03, 2023 at 11:50:57PM +0000, Allison Henderson wrote:
> On Wed, 2023-03-01 at 16:14 -0800, Darrick J. Wong wrote:
> > On Sat, Feb 25, 2023 at 07:33:23AM +0000, Allison Henderson wrote:
> > > On Thu, 2023-02-02 at 18:12 -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Directory tree repairs are the least complete part of online
> > > > fsck,
> > > > due
> > > > to the lack of directory parent pointers.  However, even without
> > > > that
> > > > feature, we can still make some corrections to the directory tree
> > > > --
> > > > we
> > > > can salvage as many directory entries as we can from a damaged
> > > > directory, and we can reattach orphaned inodes to the lost+found,
> > > > just
> > > > as xfs_repair does now.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > > v24.2: updated with my latest thoughts about how to use parent
> > > > pointers
> > > > v24.3: updated to reflect the online fsck code I built for parent
> > > > pointers
> > > > ---
> > > >  .../filesystems/xfs-online-fsck-design.rst         |  410
> > > > ++++++++++++++++++++
> > > >  1 file changed, 410 insertions(+)
> > > > 
> > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > index af7755fe0107..51d040e4a2d0 100644
> > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > @@ -4359,3 +4359,413 @@ The proposed patchset is the
> > > >  `extended attribute repair
> > > >  <
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > > log/?h=repair-xattrs>`_
> > > >  series.
> > > > +
> > > > +Fixing Directories
> > > > +------------------
> > > > +
> > > > +Fixing directories is difficult with currently available
> > > > filesystem
> > > > features,
> > > > +since directory entries are not redundant.
> > > > +The offline repair tool scans all inodes to find files with
> > > > nonzero
> > > > link count,
> > > > +and then it scans all directories to establish parentage of
> > > > those
> > > > linked files.
> > > > +Damaged files and directories are zapped, and files with no
> > > > parent
> > > > are
> > > > +moved to the ``/lost+found`` directory.
> > > > +It does not try to salvage anything.
> > > > +
> > > > +The best that online repair can do at this time is to read
> > > > directory
> > > > data
> > > > +blocks and salvage any dirents that look plausible, correct link
> > > > counts, and
> > > > +move orphans back into the directory tree.
> > > > +The salvage process is discussed in the case study at the end of
> > > > this section.
> > > > +The :ref:`file link count fsck <nlinks>` code takes care of
> > > > fixing
> > > > link counts
> > > > +and moving orphans to the ``/lost+found`` directory.
> > > > +
> > > > +Case Study: Salvaging Directories
> > > > +`````````````````````````````````
> > > > +
> > > > +Unlike extended attributes, directory blocks are all the same
> > > > size,
> > > > so
> > > > +salvaging directories is straightforward:
> > > > +
> > > > +1. Find the parent of the directory.
> > > > +   If the dotdot entry is not unreadable, try to confirm that
> > > > the
> > > > alleged
> > > > +   parent has a child entry pointing back to the directory being
> > > > repaired.
> > > > +   Otherwise, walk the filesystem to find it.
> > > > +
> > > > +2. Walk the first partition of data fork of the directory to
> > > > find
> > > > the directory
> > > > +   entry data blocks.
> > > > +   When one is found,
> > > > +
> > > > +   a. Walk the directory data block to find candidate entries.
> > > > +      When an entry is found:
> > > > +
> > > > +      i. Check the name for problems, and ignore the name if
> > > > there
> > > > are.
> > > > +
> > > > +      ii. Retrieve the inumber and grab the inode.
> > > > +          If that succeeds, add the name, inode number, and file
> > > > type to the
> > > > +          staging xfarray and xblob.
> > > > +
> > > > +3. If the memory usage of the xfarray and xfblob exceed a
> > > > certain
> > > > amount of
> > > > +   memory or there are no more directory data blocks to examine,
> > > > unlock the
> > > > +   directory and add the staged dirents into the temporary
> > > > directory.
> > > > +   Truncate the staging files.
> > > > +
> > > > +4. Use atomic extent swapping to exchange the new and old
> > > > directory
> > > > structures.
> > > > +   The old directory blocks are now attached to the temporary
> > > > file.
> > > > +
> > > > +5. Reap the temporary file.
> > > > +
> > > 
> > > 
> > > 
> > > > +**Future Work Question**: Should repair revalidate the dentry
> > > > cache
> > > > when
> > > > +rebuilding a directory?
> > > > +
> > > > +*Answer*: Yes, though the current dentry cache code doesn't
> > > > provide
> > > > a means
> > > > +to walk every dentry of a specific directory.
> > > > +If the cache contains an entry that the salvaging code does not
> > > > find, the
> > > > +repair cannot proceed.
> > > > +
> > > > +**Future Work Question**: Can the dentry cache know about a
> > > > directory entry
> > > > +that cannot be salvaged?
> > > > +
> > > > +*Answer*: In theory, the dentry cache should be a subset of the
> > > > directory
> > > > +entries on disk because there's no way to load a dentry without
> > > > having
> > > > +something to read in the directory.
> > > > +However, it is possible for a coherency problem to be introduced
> > > > if
> > > > the ondisk
> > > > +structures becomes corrupt *after* the cache loads.
> > > > +In theory it is necessary to scan all dentry cache entries for a
> > > > directory to
> > > > +ensure that one of the following apply:
> > > 
> > > "Currently the dentry cache code doesn't provide a means to walk
> > > every
> > > dentry of a specific directory.  This makes validation of the
> > > rebuilt
> > > directory difficult, and it is possible that an ondisk structure to
> > > become corrupt *after* the cache loads.  Walking the dentry cache
> > > is
> > > currently being considered as a future improvement.  This will also
> > > enable the ability to report which entries were not salvageable
> > > since
> > > these will be the subset of entries that are absent after the walk.
> > > This improvement will ensure that one of the following apply:"
> > 
> > The thing is -- I'm not considering restructuring the dentry cache. 
> > The
> > cache key is a one-way hash function of the parent_ino and the dirent
> > name, and I can't even imagine how one would support using that for
> > arbitrary lookups or walks.
> > 
> > This is the giant hole in all of the online repair code -- the design
> > of
> > the dentry cache is such that we can't invalidate the entire cache. 
> > We
> > also cannot walk it to perform targeted invalidation of just the
> > pieces
> > we want.  If after a repair the cache contains a dentry that isn't
> > backed by an actual ondisk directory entry ... kaboom.
> > 
> > The one thing I'll grant you is that I don't think it's likely that
> > the
> > dentry cache will get populated with some information and later the
> > ondisk directory bitrots undetectably.
> > 
> > > ?
> > > 
> > > I just think it reads cleaner.  I realize this is an area that
> > > still
> > > sort of in flux, but definitely before we call the document done we
> > > should probably strip out the Q's and just document the A's.  If
> > > someone re-raises the Q's we can always refer to the archives and
> > > then
> > > have the discussion on the mailing list.  But I think the document
> > > should maintain the goal of making clear whatever the current plan
> > > is
> > > just to keep it reading cleanly. 
> > 
> > Yeah, I'll shorten this section so that it only mentions these things
> > once and clearly states that I have no solution.
> I see, yes I got the impression from the original phrasing that is was
> an intended "todo", so clarifying that its not should help. 

Ahh, ok. :)

> > 
> > > > +
> > > > +1. The cached dentry reflects an ondisk dirent in the new
> > > > directory.
> > > > +
> > > > +2. The cached dentry no longer has a corresponding ondisk dirent
> > > > in
> > > > the new
> > > > +   directory and the dentry can be purged from the cache.
> > > > +
> > > > +3. The cached dentry no longer has an ondisk dirent but the
> > > > dentry
> > > > cannot be
> > > > +   purged.
> > > 
> > > > +   This is bad.
> > > These entries are irrecoverable, but can now be reported.
> > > 
> > > 
> > > 
> > > > +
> > > > +As mentioned above, the dentry cache does not have a means to
> > > > walk
> > > > all the
> > > > +dentries with a particular directory as a parent.
> > > > +This makes detecting situations #2 and #3 impossible, and
> > > > remains an
> > > > +interesting question for research.
> > > I think the above paraphrase makes this last bit redundant.
> > 
> > N
> Not sure if this is "no" or an unfinished thought?

N[ot sure either.] :(

N[ot remembering what I was thinking here.]

N[ever mind].

<giggle>

> > 
> > > > +
> > > > +The proposed patchset is the
> > > > +`directory repair
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > > log/?h=repair-dirs>`_
> > > > +series.
> > > > +
> > > > +Parent Pointers
> > > > +```````````````
> > > > +
> > > "Generally speaking, a parent pointer is any kind of metadata that
> > > enables an inode to locate its parent with out having to traverse
> > > the
> > > directory tree from the root."
> > > 
> > > > +The lack of secondary directory metadata hinders directory tree
> > > "Without them, the lack of secondary..." 
> > 
> > Ok.  I want to reword the first sentence slightly, yielding this:
> > 
> > "A parent pointer is a piece of file metadata that enables a user to
> > locate the file's parent directory without having to traverse the
> > directory tree from the root.  Without them, reconstruction of
> > directory
> > trees is hindered in much the same way that the historic lack of
> > reverse
> > space mapping information once hindered reconstruction of filesystem
> > space metadata.  The parent pointer feature, however, makes total
> > directory reconstruction
> > possible."
> 
> Alrighty, that sounds good
> 
> > 
> > But that's a much better start to the paragraph, thank you.
> > 
> > > > reconstruction
> > > > +in much the same way that the historic lack of reverse space
> > > > mapping
> > > > +information once hindered reconstruction of filesystem space
> > > > metadata.
> > > > +The parent pointer feature, however, makes total directory
> > > > reconstruction
> > > > +possible.
> > > > +
> > > 
> > > History side bar the below chunk...
> > 
> > Done.
> > 
> > > > +Directory parent pointers were first proposed as an XFS feature
> > > > more
> > > > than a
> > > > +decade ago by SGI.
> > > > +Each link from a parent directory to a child file is mirrored
> > > > with
> > > > an extended
> > > > +attribute in the child that could be used to identify the parent
> > > > directory.
> > > > +Unfortunately, this early implementation had major shortcomings
> > > > and
> > > > was never
> > > > +merged into Linux XFS:
> > > > +
> > > > +1. The XFS codebase of the late 2000s did not have the
> > > > infrastructure to
> > > > +   enforce strong referential integrity in the directory tree.
> > > > +   It did not guarantee that a change in a forward link would
> > > > always
> > > > be
> > > > +   followed up with the corresponding change to the reverse
> > > > links.
> > > > +
> > > > +2. Referential integrity was not integrated into offline repair.
> > > > +   Checking and repairs were performed on mounted filesystems
> > > > without taking
> > > > +   any kernel or inode locks to coordinate access.
> > > > +   It is not clear how this actually worked properly.
> > > > +
> > > > +3. The extended attribute did not record the name of the
> > > > directory
> > > > entry in the
> > > > +   parent, so the SGI parent pointer implementation cannot be
> > > > used
> > > > to reconnect
> > > > +   the directory tree.
> > > > +
> > > > +4. Extended attribute forks only support 65,536 extents, which
> > > > means
> > > > that
> > > > +   parent pointer attribute creation is likely to fail at some
> > > > point
> > > > before the
> > > > +   maximum file link count is achieved.
> > > 
> > > 
> > > "The original parent pointer design was too unstable for something
> > > like
> > > a file system repair to depend on."
> > 
> > Er... I think this is addressed by #2 above?
> Sorry, I meant for the history side bar to go through the list, and
> then add that quotation to connect the paragraphs.  In a way, simply
> talking about the new improvements below implies everything that the
> old design lacked.

*OH* ok, I think I understand now.  You're suggesting this sentence as
an introduction to the paragraph below, not as something to be appended
to point #4.  That makes more sense, I'll go add that, thanks!

> > 
> > > > +
> > > > +Allison Henderson, Chandan Babu, and Catherine Hoang are working
> > > > on
> > > > a second
> > > > +implementation that solves all shortcomings of the first.
> > > > +During 2022, Allison introduced log intent items to track
> > > > physical
> > > > +manipulations of the extended attribute structures.
> > > > +This solves the referential integrity problem by making it
> > > > possible
> > > > to commit
> > > > +a dirent update and a parent pointer update in the same
> > > > transaction.
> > > > +Chandan increased the maximum extent counts of both data and
> > > > attribute forks,
> > > 
> > > > +thereby addressing the fourth problem.
> > > which ensures the parent pointer creation will succeed even if the
> > > max
> > > extent count is reached.
> > 
> > The max extent count cannot be exceeded, but the nrext64 feature
> > ensures
> > that the xattr structure can grow enough to handle maximal
> > hardlinking.
> > 
> > "Chandan increased the maximum extent counts of both data and
> > attribute
> > forks, thereby ensuring that the extended attribute structure can
> > grow
> > to handle the maximum hardlink count of any file."
> 
> Ok, sounds good.
> 
> > 
> > > > +
> > > > +To solve the third problem, parent pointers include the dirent
> > > > name
> > > "Lastly, the new design includes the dirent name..."
> > 
> > <nod>
> > 
> > > > and
> > > > +location of the entry within the parent directory.
> > > > +In other words, child files use extended attributes to store
> > > > pointers to
> > > > +parents in the form ``(parent_inum, parent_gen, dirent_pos) →
> > > > (dirent_name)``.
> > > This parts still in flux, so probably this will have to get updated
> > > later...
> > 
> > Yep, I'll add a note about that.
> > 
> > > > +
> > > > +On a filesystem with parent pointers, the directory checking
> > > > process
> > > > can be
> > > > +strengthened to ensure that the target of each dirent also
> > > > contains
> > > > a parent
> > > > +pointer pointing back to the dirent.
> > > > +Likewise, each parent pointer can be checked by ensuring that
> > > > the
> > > > target of
> > > > +each parent pointer is a directory and that it contains a dirent
> > > > matching
> > > > +the parent pointer.
> > > > +Both online and offline repair can use this strategy.
> > 
> > I moved this paragraph up to become the second paragraph, and now it
> > reads:
> > 
> > "XFS parent pointers include the dirent name and location of the
> > entry
> > within the parent directory.  In other words, child files use
> > extended
> > attributes to store pointers to parents in the form ``(parent_inum,
> > parent_gen, dirent_pos) → (dirent_name)``.  The directory checking
> > process can be strengthened to ensure that the target of each dirent
> > also contains a parent pointer pointing back to the dirent. 
> > Likewise,
> > each parent pointer can be checked by ensuring that the target of
> > each
> > parent pointer is a directory and that it contains a dirent matching
> > the
> > parent pointer.  Both online and offline repair can use this
> > strategy.
> > 
> > Note: The ondisk format of parent pointers is not yet finalized."
> > 
> > After which comes the historical sidebar.
> Alrighty, I think that's fine for now
> 
> > 
> > > > +
> > > > +Case Study: Repairing Directories with Parent Pointers
> > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > +
> > > > +Directory rebuilding uses a :ref:`coordinated inode scan
> > > > <iscan>`
> > > > and
> > > > +a :ref:`directory entry live update hook <liveupdate>` as
> > > > follows:
> > > > +
> > > > +1. Set up a temporary directory for generating the new directory
> > > > structure,
> > > > +   an xfblob for storing entry names, and an xfarray for
> > > > stashing
> > > > directory
> > > > +   updates.
> > > > +
> > > > +2. Set up an inode scanner and hook into the directory entry
> > > > code to
> > > > receive
> > > > +   updates on directory operations.
> > > > +
> > > > +3. For each parent pointer found in each file scanned, decide if
> > > > the
> > > > parent
> > > > +   pointer references the directory of interest.
> > > > +   If so:
> > > > +
> > > > +   a. Stash an addname entry for this dirent in the xfarray for
> > > > later.
> > > > +
> > > > +   b. When finished scanning that file, flush the stashed
> > > > updates to
> > > > the
> > > > +      temporary directory.
> > > > +
> > > > +4. For each live directory update received via the hook, decide
> > > > if
> > > > the child
> > > > +   has already been scanned.
> > > > +   If so:
> > > > +
> > > > +   a. Stash an addname or removename entry for this dirent
> > > > update in
> > > > the
> > > > +      xfarray for later.
> > > > +      We cannot write directly to the temporary directory
> > > > because
> > > > hook
> > > > +      functions are not allowed to modify filesystem metadata.
> > > > +      Instead, we stash updates in the xfarray and rely on the
> > > > scanner thread
> > > > +      to apply the stashed updates to the temporary directory.
> > > > +
> > > > +5. When the scan is complete, atomically swap the contents of
> > > > the
> > > > temporary
> > > > +   directory and the directory being repaired.
> > > > +   The temporary directory now contains the damaged directory
> > > > structure.
> > > > +
> > > > +6. Reap the temporary directory.
> > > > +
> > > > +7. Update the dirent position field of parent pointers as
> > > > necessary.
> > > > +   This may require the queuing of a substantial number of xattr
> > > > log
> > > > intent
> > > > +   items.
> > > > +
> > > > +The proposed patchset is the
> > > > +`parent pointers directory repair
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > > log/?h=pptrs-online-dir-repair>`_
> > > > +series.
> > > > +
> > > > +**Unresolved Question**: How will repair ensure that the
> > > > ``dirent_pos`` fields
> > > > +match in the reconstructed directory?
> > > > +
> > > > +*Answer*: There are a few ways to solve this problem:
> > > > +
> > > > +1. The field could be designated advisory, since the other three
> > > > values are
> > > > +   sufficient to find the entry in the parent.
> > > > +   However, this makes indexed key lookup impossible while
> > > > repairs
> > > > are ongoing.
> > > > +
> > > > +2. We could allow creating directory entries at specified
> > > > offsets,
> > > > which solves
> > > > +   the referential integrity problem but runs the risk that
> > > > dirent
> > > > creation
> > > > +   will fail due to conflicts with the free space in the
> > > > directory.
> > > > +
> > > > +   These conflicts could be resolved by appending the directory
> > > > entry and
> > > > +   amending the xattr code to support updating an xattr key and
> > > > reindexing the
> > > > +   dabtree, though this would have to be performed with the
> > > > parent
> > > > directory
> > > > +   still locked.
> > > > +
> > > > +3. Same as above, but remove the old parent pointer entry and
> > > > add a
> > > > new one
> > > > +   atomically.
> > > > +
> > > > +4. Change the ondisk xattr format to ``(parent_inum, name) →
> > > > (parent_gen)``,
> > > > +   which would provide the attr name uniqueness that we require,
> > > > without
> > > > +   forcing repair code to update the dirent position.
> > > > +   Unfortunately, this requires changes to the xattr code to
> > > > support
> > > > attr
> > > > +   names as long as 263 bytes.
> > > > +
> > > > +5. Change the ondisk xattr format to ``(parent_inum, hash(name))
> > > > →
> > > > +   (name, parent_gen)``.
> > > > +   If the hash is sufficiently resistant to collisions (e.g.
> > > > sha256)
> > > > then
> > > > +   this should provide the attr name uniqueness that we require.
> > > > +   Names shorter than 247 bytes could be stored directly.
> > > I think the RFC deluge is the same question but more context, so
> > > probably this section will follow what we decide there.  I will
> > > save
> > > commentary to keep the discussion in the same thread...
> > > 
> > > I'll just link it here for anyone else following this for now...
> > > https://www.spinics.net/lists/linux-xfs/msg69397.html
> > 
> > Yes, the deluge has much more detailed information.  I'll add this
> > link
> > (for now) to the doc.
> > 
> > > > +
> > > > +Case Study: Repairing Parent Pointers
> > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > +
> > > > +Online reconstruction of a file's parent pointer information
> > > > works
> > > > similarly to
> > > > +directory reconstruction:
> > > > +
> > > > +1. Set up a temporary file for generating a new extended
> > > > attribute
> > > > structure,
> > > > +   an xfblob for storing parent pointer names, and an xfarray
> > > > for
> > > > stashing
> > > > +   parent pointer updates.
> > > we did talk about blobs in patch 6 though it took me a moment to
> > > remember... if there's a way to link or tag it, that would be
> > > helpful
> > > for with the quick refresh.  kinda like wikipedia hyperlinks, you
> > > really only need like the first line or two to get it snap back
> > 
> > There is; I'll put in a backreference.
> > 
> > > > +
> > > > +2. Set up an inode scanner and hook into the directory entry
> > > > code to
> > > > receive
> > > > +   updates on directory operations.
> > > > +
> > > > +3. For each directory entry found in each directory scanned,
> > > > decide
> > > > if the
> > > > +   dirent references the file of interest.
> > > > +   If so:
> > > > +
> > > > +   a. Stash an addpptr entry for this parent pointer in the
> > > > xfblob
> > > > and xfarray
> > > > +      for later.
> > > > +
> > > > +   b. When finished scanning the directory, flush the stashed
> > > > updates to the
> > > > +      temporary directory.
> > > > +
> > > > +4. For each live directory update received via the hook, decide
> > > > if
> > > > the parent
> > > > +   has already been scanned.
> > > > +   If so:
> > > > +
> > > > +   a. Stash an addpptr or removepptr entry for this dirent
> > > > update in
> > > > the
> > > > +      xfarray for later.
> > > > +      We cannot write parent pointers directly to the temporary
> > > > file
> > > > because
> > > > +      hook functions are not allowed to modify filesystem
> > > > metadata.
> > > > +      Instead, we stash updates in the xfarray and rely on the
> > > > scanner thread
> > > > +      to apply the stashed parent pointer updates to the
> > > > temporary
> > > > file.
> > > > +
> > > > +5. Copy all non-parent pointer extended attributes to the
> > > > temporary
> > > > file.
> > > > +
> > > > +6. When the scan is complete, atomically swap the attribute fork
> > > > of
> > > > the
> > > > +   temporary file and the file being repaired.
> > > > +   The temporary file now contains the damaged extended
> > > > attribute
> > > > structure.
> > > > +
> > > > +7. Reap the temporary file.
> > > Seems like it should work
> > 
> > Let's hope so!
> > 
> > > > +
> > > > +The proposed patchset is the
> > > > +`parent pointers repair
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > > log/?h=pptrs-online-parent-repair>`_
> > > > +series.
> > > > +
> > > > +Digression: Offline Checking of Parent Pointers
> > > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > +
> > > > +Examining parent pointers in offline repair works differently
> > > > because corrupt
> > > > +files are erased long before directory tree connectivity checks
> > > > are
> > > > performed.
> > > > +Parent pointer checks are therefore a second pass to be added to
> > > > the
> > > > existing
> > > > +connectivity checks:
> > > > +
> > > > +1. After the set of surviving files has been established (i.e.
> > > > phase
> > > > 6),
> > > > +   walk the surviving directories of each AG in the filesystem.
> > > > +   This is already performed as part of the connectivity checks.
> > > > +
> > > > +2. For each directory entry found, record the name in an xfblob,
> > > > and
> > > > store
> > > > +   ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)``
> > > > tuples
> > > > in a
> > > > +   per-AG in-memory slab.
> > > > +
> > > > +3. For each AG in the filesystem,
> > > > +
> > > > +   a. Sort the per-AG tuples in order of child_ag_inum,
> > > > parent_inum,
> > > > and
> > > > +      dirent_pos.
> > > > +
> > > > +   b. For each inode in the AG,
> > > > +
> > > > +      1. Scan the inode for parent pointers.
> > > > +         Record the names in a per-file xfblob, and store
> > > > ``(parent_inum,
> > > > +         parent_gen, dirent_pos)`` tuples in a per-file slab.
> > > > +
> > > > +      2. Sort the per-file tuples in order of parent_inum, and
> > > > dirent_pos.
> > > > +
> > > > +      3. Position one slab cursor at the start of the inode's
> > > > records in the
> > > > +         per-AG tuple slab.
> > > > +         This should be trivial since the per-AG tuples are in
> > > > child
> > > > inumber
> > > > +         order.
> > > > +
> > > > +      4. Position a second slab cursor at the start of the per-
> > > > file
> > > > tuple slab.
> > > > +
> > > > +      5. Iterate the two cursors in lockstep, comparing the
> > > > parent_ino and
> > > > +         dirent_pos fields of the records under each cursor.
> > > > +
> > > > +         a. Tuples in the per-AG list but not the per-file list
> > > > are
> > > > missing and
> > > > +            need to be written to the inode.
> > > > +
> > > > +         b. Tuples in the per-file list but not the per-AG list
> > > > are
> > > > dangling
> > > > +            and need to be removed from the inode.
> > > > +
> > > > +         c. For tuples in both lists, update the parent_gen and
> > > > name
> > > > components
> > > > +            of the parent pointer if necessary.
> > > > +
> > > > +4. Move on to examining link counts, as we do today.
> > > > +
> > > > +The proposed patchset is the
> > > > +`offline parent pointers repair
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > > it/log/?h=pptrs-repair>`_
> > > > +series.
> > > > +
> > > > +Rebuilding directories from parent pointers in offline repair is
> > > > very
> > > > +challenging because it currently uses a single-pass scan of the
> > > > filesystem
> > > > +during phase 3 to decide which files are corrupt enough to be
> > > > zapped.
> > > > +This scan would have to be converted into a multi-pass scan:
> > > > +
> > > > +1. The first pass of the scan zaps corrupt inodes, forks, and
> > > > attributes
> > > > +   much as it does now.
> > > > +   Corrupt directories are noted but not zapped.
> > > > +
> > > > +2. The next pass records parent pointers pointing to the
> > > > directories
> > > > noted
> > > > +   as being corrupt in the first pass.
> > > > +   This second pass may have to happen after the phase 4 scan
> > > > for
> > > > duplicate
> > > > +   blocks, if phase 4 is also capable of zapping directories.
> > > > +
> > > > +3. The third pass resets corrupt directories to an empty
> > > > shortform
> > > > directory.
> > > > +   Free space metadata has not been ensured yet, so repair
> > > > cannot
> > > > yet use the
> > > > +   directory building code in libxfs.
> > > > +
> > > > +4. At the start of phase 6, space metadata have been rebuilt.
> > > > +   Use the parent pointer information recorded during step 2 to
> > > > reconstruct
> > > > +   the dirents and add them to the now-empty directories.
> > > > +
> > > > +This code has not yet been constructed.
> > > > +
> > > > +.. _orphanage:
> > > > +
> > > > +The Orphanage
> > > > +-------------
> > > > +
> > > > +Filesystems present files as a directed, and hopefully acyclic,
> > > > graph.
> > > > +In other words, a tree.
> > > > +The root of the filesystem is a directory, and each entry in a
> > > > directory points
> > > > +downwards either to more subdirectories or to non-directory
> > > > files.
> > > > +Unfortunately, a disruption in the directory graph pointers
> > > > result
> > > > in a
> > > > +disconnected graph, which makes files impossible to access via
> > > > regular path
> > > > +resolution.
> > > > +The directory parent pointer online scrub code can detect a
> > > > dotdot
> > > > entry
> > > > +pointing to a parent directory that doesn't have a link back to
> > > > the
> > > > child
> > > > +directory, and the file link count checker can detect a file
> > > > that
> > > > isn't pointed
> > > > +to by any directory in the filesystem.
> > > > +If the file in question has a positive link count, the file in
> > > > question is an
> > > > +orphan.
> > > 
> > > Hmm, I kinda felt like this should have flowed into something like:
> > > "now that we have parent pointers, we can reparent them instead of
> > > putting them in the orphanage..."
> > 
> > That's only true if we actually *find* the relevant forward or back
> > pointers.  If a file has positive link count but there aren't any
> > links
> > to it from anywhere, we still have to dump it in the /lost+found.
> > 
> > Parent pointers make it a lot less likely that we'll have to put a
> > file
> > in the /lost+found, but it's still possible.
> > 
> > I think I'll change this paragraph to start:
> > 
> > "Without parent pointers, the directory parent pointer online scrub
> > code
> > can detect a dotdot entry pointing to a parent directory..."
> > 
> > and then add a new paragraph:
> > 
> > "With parent pointers, directories can be rebuilt by scanning parent
> > pointers and parent pointers can be rebuilt by scanning directories.
> > This should reduce the incidence of files ending up in
> > ``/lost+found``."
> I see, ok i think that sounds good then.

<nod>

--D

> Allison
> > 
> > > ?
> > > > +
> > > > +When orphans are found, they should be reconnected to the
> > > > directory
> > > > tree.
> > > > +Offline fsck solves the problem by creating a directory
> > > > ``/lost+found`` to
> > > > +serve as an orphanage, and linking orphan files into the
> > > > orphanage
> > > > by using the
> > > > +inumber as the name.
> > > > +Reparenting a file to the orphanage does not reset any of its
> > > > permissions or
> > > > +ACLs.
> > > > +
> > > > +This process is more involved in the kernel than it is in
> > > > userspace.
> > > > +The directory and file link count repair setup functions must
> > > > use
> > > > the regular
> > > > +VFS mechanisms to create the orphanage directory with all the
> > > > necessary
> > > > +security attributes and dentry cache entries, just like a
> > > > regular
> > > > directory
> > > > +tree modification.
> > > > +
> > > > +Orphaned files are adopted by the orphanage as follows:
> > > > +
> > > > +1. Call ``xrep_orphanage_try_create`` at the start of the scrub
> > > > setup function
> > > > +   to try to ensure that the lost and found directory actually
> > > > exists.
> > > > +   This also attaches the orphanage directory to the scrub
> > > > context.
> > > > +
> > > > +2. If the decision is made to reconnect a file, take the IOLOCK
> > > > of
> > > > both the
> > > > +   orphanage and the file being reattached.
> > > > +   The ``xrep_orphanage_iolock_two`` function follows the inode
> > > > locking
> > > > +   strategy discussed earlier.
> > > > +
> > > > +3. Call ``xrep_orphanage_compute_blkres`` and
> > > > ``xrep_orphanage_compute_name``
> > > > +   to compute the new name in the orphanage and the block
> > > > reservation required.
> > > > +
> > > > +4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to
> > > > the
> > > > repair
> > > > +   transaction.
> > > > +
> > > > +5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file
> > > > into
> > > > the lost
> > > > +   and found, and update the kernel dentry cache.
> > > > +
> > > > +The proposed patches are in the
> > > > +`orphanage adoption
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > > log/?h=repair-orphanage>`_
> > > > +series.
> > > 
> > > Certainly we'll need to come back and update all the parts that
> > > would
> > > be affected by the RFC, but otherwise looks ok.  It seems trying to
> > > document code before it's written tends to cause things to go
> > > around
> > > for a while, since we really just cant know how stable a design is
> > > until it's been through at least a few prototypes.
> > 
> > Agreed!
> > 
> > --D
> > 
> > > Allison
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 13/14] xfs: document the userspace fsck driver program
  2023-03-03 23:51         ` Allison Henderson
@ 2023-03-04  2:25           ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-04  2:25 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Fri, Mar 03, 2023 at 11:51:02PM +0000, Allison Henderson wrote:
> On Wed, 2023-03-01 at 16:27 -0800, Darrick J. Wong wrote:
> > On Wed, Mar 01, 2023 at 05:36:59AM +0000, Allison Henderson wrote:
> > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Add the sixth chapter of the online fsck design documentation,
> > > > where
> > > > we discuss the details of the data structures and algorithms used
> > > > by
> > > > the
> > > > driver program xfs_scrub.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  .../filesystems/xfs-online-fsck-design.rst         |  313
> > > > ++++++++++++++++++++
> > > >  1 file changed, 313 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > index 2e20314f1831..05b9411fac7f 100644
> > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > @@ -300,6 +300,9 @@ The seven phases are as follows:
> > > >  7. Re-check the summary counters and presents the caller with a
> > > > summary of
> > > >     space usage and file counts.
> > > >  
> > > > +This allocation of responsibilities will be :ref:`revisited
> > > > <scrubcheck>`
> > > > +later in this document.
> > > > +
> > > >  Steps for Each Scrub Item
> > > >  -------------------------
> > > >  
> > > > @@ -4505,3 +4508,313 @@ The proposed patches are in the
> > > >  `orphanage adoption
> > > >  <
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > > log/?h=repair-orphanage>`_
> > > >  series.
> > > > +
> > > > +6. Userspace Algorithms and Data Structures
> > > > +===========================================
> > > > +
> > > > +This section discusses the key algorithms and data structures of
> > > > the
> > > > userspace
> > > > +program, ``xfs_scrub``, that provide the ability to drive
> > > > metadata
> > > > checks and
> > > > +repairs in the kernel, verify file data, and look for other
> > > > potential problems.
> > > > +
> > > > +.. _scrubcheck:
> > > > +
> > > > +Checking Metadata
> > > > +-----------------
> > > > +
> > > > +Recall the :ref:`phases of fsck work<scrubphases>` outlined
> > > > earlier.
> > > > +That structure follows naturally from the data dependencies
> > > > designed
> > > > into the
> > > > +filesystem from its beginnings in 1993.
> > > > +In XFS, there are several groups of metadata dependencies:
> > > > +
> > > > +a. Filesystem summary counts depend on consistency within the
> > > > inode
> > > > indices,
> > > > +   the allocation group space btrees, and the realtime volume
> > > > space
> > > > +   information.
> > > > +
> > > > +b. Quota resource counts depend on consistency within the quota
> > > > file
> > > > data
> > > > +   forks, inode indices, inode records, and the forks of every
> > > > file
> > > > on the
> > > > +   system.
> > > > +
> > > > +c. The naming hierarchy depends on consistency within the
> > > > directory
> > > > and
> > > > +   extended attribute structures.
> > > > +   This includes file link counts.
> > > > +
> > > > +d. Directories, extended attributes, and file data depend on
> > > > consistency within
> > > > +   the file forks that map directory and extended attribute data
> > > > to
> > > > physical
> > > > +   storage media.
> > > > +
> > > > +e. The file forks depends on consistency within inode records
> > > > and
> > > > the space
> > > > +   metadata indices of the allocation groups and the realtime
> > > > volume.
> > > > +   This includes quota and realtime metadata files.
> > > > +
> > > > +f. Inode records depends on consistency within the inode
> > > > metadata
> > > > indices.
> > > > +
> > > > +g. Realtime space metadata depend on the inode records and data
> > > > forks of the
> > > > +   realtime metadata inodes.
> > > > +
> > > > +h. The allocation group metadata indices (free space, inodes,
> > > > reference count,
> > > > +   and reverse mapping btrees) depend on consistency within the
> > > > AG
> > > > headers and
> > > > +   between all the AG metadata btrees.
> > > > +
> > > > +i. ``xfs_scrub`` depends on the filesystem being mounted and
> > > > kernel
> > > > support
> > > > +   for online fsck functionality.
> > > > +
> > > > +Therefore, a metadata dependency graph is a convenient way to
> > > > schedule checking
> > > > +operations in the ``xfs_scrub`` program:
> > > > +
> > > > +- Phase 1 checks that the provided path maps to an XFS
> > > > filesystem
> > > > and detect
> > > > +  the kernel's scrubbing abilities, which validates group (i).
> > > > +
> > > > +- Phase 2 scrubs groups (g) and (h) in parallel using a threaded
> > > > workqueue.
> > > > +
> > > > +- Phase 3 checks groups (f), (e), and (d), in that order.
> > > > +  These groups are all file metadata, which means that inodes
> > > > are
> > > > scanned in
> > > > +  parallel.
> > > ...When things are done in order, then they are done in serial
> > > right?
> > > Things done in parallel are done at the same time.  Either the
> > > phrase
> > > "in that order" needs to go away, or the last line needs to drop
> > 
> > Each inode is processed in parallel, but individual inodes are
> > processed
> > in f-e-d order.
> > 
> > "Phase 3 scans inodes in parallel.  For each inode, groups (f), (e),
> > and
> > (d) are checked, in that order."
> Ohh, ok.  Now that I re-read it, it makes sense but lets keep the new
> one
> 
> > 
> > > > +
> > > > +- Phase 4 repairs everything in groups (i) through (d) so that
> > > > phases 5 and 6
> > > > +  may run reliably.
> > > > +
> > > > +- Phase 5 starts by checking groups (b) and (c) in parallel
> > > > before
> > > > moving on
> > > > +  to checking names.
> > > > +
> > > > +- Phase 6 depends on groups (i) through (b) to find file data
> > > > blocks
> > > > to verify,
> > > > +  to read them, and to report which blocks of which files are
> > > > affected.
> > > > +
> > > > +- Phase 7 checks group (a), having validated everything else.
> > > > +
> > > > +Notice that the data dependencies between groups are enforced by
> > > > the
> > > > structure
> > > > +of the program flow.
> > > > +
> > > > +Parallel Inode Scans
> > > > +--------------------
> > > > +
> > > > +An XFS filesystem can easily contain hundreds of millions of
> > > > inodes.
> > > > +Given that XFS targets installations with large high-performance
> > > > storage,
> > > > +it is desirable to scrub inodes in parallel to minimize runtime,
> > > > particularly
> > > > +if the program has been invoked manually from a command line.
> > > > +This requires careful scheduling to keep the threads as evenly
> > > > loaded as
> > > > +possible.
> > > > +
> > > > +Early iterations of the ``xfs_scrub`` inode scanner naïvely
> > > > created
> > > > a single
> > > > +workqueue and scheduled a single workqueue item per AG.
> > > > +Each workqueue item walked the inode btree (with
> > > > ``XFS_IOC_INUMBERS``) to find
> > > > +inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to
> > > > gather enough
> > > > +information to construct file handles.
> > > > +The file handle was then passed to a function to generate scrub
> > > > items for each
> > > > +metadata object of each inode.
> > > > +This simple algorithm leads to thread balancing problems in
> > > > phase 3
> > > > if the
> > > > +filesystem contains one AG with a few large sparse files and the
> > > > rest of the
> > > > +AGs contain many smaller files.
> > > > +The inode scan dispatch function was not sufficiently granular;
> > > > it
> > > > should have
> > > > +been dispatching at the level of individual inodes, or, to
> > > > constrain
> > > > memory
> > > > +consumption, inode btree records.
> > > > +
> > > > +Thanks to Dave Chinner, bounded workqueues in userspace enable
> > > > ``xfs_scrub`` to
> > > > +avoid this problem with ease by adding a second workqueue.
> > > > +Just like before, the first workqueue is seeded with one
> > > > workqueue
> > > > item per AG,
> > > > +and it uses INUMBERS to find inode btree chunks.
> > > > +The second workqueue, however, is configured with an upper bound
> > > > on
> > > > the number
> > > > +of items that can be waiting to be run.
> > > > +Each inode btree chunk found by the first workqueue's workers
> > > > are
> > > > queued to the
> > > > +second workqueue, and it is this second workqueue that queries
> > > > BULKSTAT,
> > > > +creates a file handle, and passes it to a function to generate
> > > > scrub
> > > > items for
> > > > +each metadata object of each inode.
> > > > +If the second workqueue is too full, the workqueue add function
> > > > blocks the
> > > > +first workqueue's workers until the backlog eases.
> > > > +This doesn't completely solve the balancing problem, but reduces
> > > > it
> > > > enough to
> > > > +move on to more pressing issues.
> > > > +
> > > > +The proposed patchsets are the scrub
> > > > +`performance tweaks
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > > it/log/?h=scrub-performance-tweaks>`_
> > > > +and the
> > > > +`inode scan rebalance
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > > it/log/?h=scrub-iscan-rebalance>`_
> > > > +series.
> > > > +
> > > > +.. _scrubrepair:
> > > > +
> > > > +Scheduling Repairs
> > > > +------------------
> > > > +
> > > > +During phase 2, corruptions and inconsistencies reported in any
> > > > AGI
> > > > header or
> > > > +inode btree are repaired immediately, because phase 3 relies on
> > > > proper
> > > > +functioning of the inode indices to find inodes to scan.
> > > > +Failed repairs are rescheduled to phase 4.
> > > > +Problems reported in any other space metadata are deferred to
> > > > phase
> > > > 4.
> > > > +Optimization opportunities are always deferred to phase 4, no
> > > > matter
> > > > their
> > > > +origin.
> > > > +
> > > > +During phase 3, corruptions and inconsistencies reported in any
> > > > part
> > > > of a
> > > > +file's metadata are repaired immediately if all space metadata
> > > > were
> > > > validated
> > > > +during phase 2.
> > > > +Repairs that fail or cannot be repaired immediately are
> > > > scheduled
> > > > for phase 4.
> > > > +
> > > > +In the original design of ``xfs_scrub``, it was thought that
> > > > repairs
> > > > would be
> > > > +so infrequent that the ``struct xfs_scrub_metadata`` objects
> > > > used to
> > > > +communicate with the kernel could also be used as the primary
> > > > object
> > > > to
> > > > +schedule repairs.
> > > > +With recent increases in the number of optimizations possible
> > > > for a
> > > > given
> > > > +filesystem object, it became much more memory-efficient to track
> > > > all
> > > > eligible
> > > > +repairs for a given filesystem object with a single repair item.
> > > > +Each repair item represents a single lockable object -- AGs,
> > > > metadata files,
> > > > +individual inodes, or a class of summary information.
> > > > +
> > > > +Phase 4 is responsible for scheduling a lot of repair work in as
> > > > quick a
> > > > +manner as is practical.
> > > > +The :ref:`data dependencies <scrubcheck>` outlined earlier still
> > > > apply, which
> > > > +means that ``xfs_scrub`` must try to complete the repair work
> > > > scheduled by
> > > > +phase 2 before trying repair work scheduled by phase 3.
> > > > +The repair process is as follows:
> > > > +
> > > > +1. Start a round of repair with a workqueue and enough workers
> > > > to
> > > > keep the CPUs
> > > > +   as busy as the user desires.
> > > > +
> > > > +   a. For each repair item queued by phase 2,
> > > > +
> > > > +      i.   Ask the kernel to repair everything listed in the
> > > > repair
> > > > item for a
> > > > +           given filesystem object.
> > > > +
> > > > +      ii.  Make a note if the kernel made any progress in
> > > > reducing
> > > > the number
> > > > +           of repairs needed for this object.
> > > > +
> > > > +      iii. If the object no longer requires repairs, revalidate
> > > > all
> > > > metadata
> > > > +           associated with this object.
> > > > +           If the revalidation succeeds, drop the repair item.
> > > > +           If not, requeue the item for more repairs.
> > > > +
> > > > +   b. If any repairs were made, jump back to 1a to retry all the
> > > > phase 2 items.
> > > > +
> > > > +   c. For each repair item queued by phase 3,
> > > > +
> > > > +      i.   Ask the kernel to repair everything listed in the
> > > > repair
> > > > item for a
> > > > +           given filesystem object.
> > > > +
> > > > +      ii.  Make a note if the kernel made any progress in
> > > > reducing
> > > > the number
> > > > +           of repairs needed for this object.
> > > > +
> > > > +      iii. If the object no longer requires repairs, revalidate
> > > > all
> > > > metadata
> > > > +           associated with this object.
> > > > +           If the revalidation succeeds, drop the repair item.
> > > > +           If not, requeue the item for more repairs.
> > > > +
> > > > +   d. If any repairs were made, jump back to 1c to retry all the
> > > > phase 3 items.
> > > > +
> > > > +2. If step 1 made any repair progress of any kind, jump back to
> > > > step
> > > > 1 to start
> > > > +   another round of repair.
> > > > +
> > > > +3. If there are items left to repair, run them all serially one
> > > > more
> > > > time.
> > > > +   Complain if the repairs were not successful, since this is
> > > > the
> > > > last chance
> > > > +   to repair anything.
> > > > +
> > > > +Corruptions and inconsistencies encountered during phases 5 and
> > > > 7
> > > > are repaired
> > > > +immediately.
> > > > +Corrupt file data blocks reported by phase 6 cannot be recovered
> > > > by
> > > > the
> > > > +filesystem.
> > > > +
> > > > +The proposed patchsets are the
> > > > +`repair warning improvements
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > > it/log/?h=scrub-better-repair-warnings>`_,
> > > > +refactoring of the
> > > > +`repair data dependency
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > > it/log/?h=scrub-repair-data-deps>`_
> > > > +and
> > > > +`object tracking
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > > it/log/?h=scrub-object-tracking>`_,
> > > > +and the
> > > > +`repair scheduling
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > > it/log/?h=scrub-repair-scheduling>`_
> > > > +improvement series.
> > > > +
> > > > +Checking Names for Confusable Unicode Sequences
> > > > +-----------------------------------------------
> > > > +
> > > > +If ``xfs_scrub`` succeeds in validating the filesystem metadata
> > > > by
> > > > the end of
> > > > +phase 4, it moves on to phase 5, which checks for suspicious
> > > > looking
> > > > names in
> > > > +the filesystem.
> > > > +These names consist of the filesystem label, names in directory
> > > > entries, and
> > > > +the names of extended attributes.
> > > > +Like most Unix filesystems, XFS imposes the sparest of
> > > > constraints
> > > > on the
> > > > +contents of a name -- slashes and null bytes are not allowed in
> > > > directory
> > > > +entries; and null bytes are not allowed in extended attributes
> > > > and
> > > maybe say "standard user accessible extended attributes"
> > 
> > "userspace visible"?
> Thats fine, mostly I meant to exclude parent pointers, but I've seen
> other ideas that talk about using xattrs to store binary metadata, so
> pptrs may not be the last to do this.

Yeah.  I think Andrey's fsverity mechanism is preparing to store merkle
tree data in the format:

   (merkle tree block number) -> (pile of hashes or whatever)

So there's more coming. :)

--D

> > 
> > I'll list-ify this too:
> > 
> > Like most Unix filesystems, XFS imposes the sparest of constraints on
> > the contents of a name:
> > 
> > - slashes and null bytes are not allowed in directory entries;
> > 
> > - null bytes are not allowed in userspace-visible extended
> > attributes;
> > 
> > - null bytes are not allowed in the filesystem label
> Ok, I think that works
> 
> > 
> > > > the
> > > > +filesystem label.
> > > > +Directory entries and attribute keys store the length of the
> > > > name
> > > > explicitly
> > > > +ondisk, which means that nulls are not name terminators.
> > > > +For this section, the term "naming domain" refers to any place
> > > > where
> > > > names are
> > > > +presented together -- all the names in a directory, or all the
> > > > attributes of a
> > > > +file.
> > > > +
> > > > +Although the Unix naming constraints are very permissive, the
> > > > reality of most
> > > > +modern-day Linux systems is that programs work with Unicode
> > > > character code
> > > > +points to support international languages.
> > > > +These programs typically encode those code points in UTF-8 when
> > > > interfacing
> > > > +with the C library because the kernel expects null-terminated
> > > > names.
> > > > +In the common case, therefore, names found in an XFS filesystem
> > > > are
> > > > actually
> > > > +UTF-8 encoded Unicode data.
> > > > +
> > > > +To maximize its expressiveness, the Unicode standard defines
> > > > separate control
> > > > +points for various characters that render similarly or
> > > > identically
> > > > in writing
> > > > +systems around the world.
> > > > +For example, the character "Cyrillic Small Letter A" U+0430 "а"
> > > > often renders
> > > > +identically to "Latin Small Letter A" U+0061 "a".
> > > 
> > > 
> > > > +
> > > > +The standard also permits characters to be constructed in
> > > > multiple
> > > > ways --
> > > > +either by using a defined code point, or by combining one code
> > > > point
> > > > with
> > > > +various combining marks.
> > > > +For example, the character "Angstrom Sign U+212B "Å" can also be
> > > > expressed
> > > > +as "Latin Capital Letter A" U+0041 "A" followed by "Combining
> > > > Ring
> > > > Above"
> > > > +U+030A "◌̊".
> > > > +Both sequences render identically.
> > > > +
> > > > +Like the standards that preceded it, Unicode also defines
> > > > various
> > > > control
> > > > +characters to alter the presentation of text.
> > > > +For example, the character "Right-to-Left Override" U+202E can
> > > > trick
> > > > some
> > > > +programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as
> > > > "mootxt.png".
> > > > +A second category of rendering problems involves whitespace
> > > > characters.
> > > > +If the character "Zero Width Space" U+200B is encountered in a
> > > > file
> > > > name, the
> > > > +name will render identically to a name that does not have the
> > > > zero
> > > > width
> > > > +space.
> > > > +
> > > > +If two names within a naming domain have different byte
> > > > sequences
> > > > but render
> > > > +identically, a user may be confused by it.
> > > > +The kernel, in its indifference to upper level encoding schemes,
> > > > permits this.
> > > > +Most filesystem drivers persist the byte sequence names that are
> > > > given to them
> > > > +by the VFS.
> > > > +
> > > > +Techniques for detecting confusable names are explained in great
> > > > detail in
> > > > +sections 4 and 5 of the
> > > > +`Unicode Security Mechanisms
> > > > <https://unicode.org/reports/tr39/>`_
> > > > +document.
> > > I don't know that we need this much detail on character rendering. 
> > > I
> > > think the example above is enough to make the point that character
> > > strings can differ in binary, but render the same, so we need to
> > > deal
> > > with that.  So I think that's really all the justification we need
> > > for
> > > the NFD usage
> > 
> > I want to leave the link in, because TR39 is the canonical source for
> > information about confusability detection.  That is the location
> > where
> > the Unicode folks publish everything they currently know on the
> > topic.
> 
> Sure, maybe just keep the last line then.
> 
> Allison
> 
> > 
> > > > +``xfs_scrub``, when it detects UTF-8 encoding in use on a
> > > > system,
> > > > uses the
> > > When ``xfs_scrub`` detects UTF-8 encoding, it uses the...
> > 
> > Changed, thanks.
> > 
> > > > +Unicode normalization form NFD in conjunction with the
> > > > confusable
> > > > name
> > > > +detection component of
> > > > +`libicu <https://github.com/unicode-org/icu>`_
> > > > +to identify names with a directory or within a file's extended
> > > > attributes that
> > > > +could be confused for each other.
> > > > +Names are also checked for control characters, non-rendering
> > > > characters, and
> > > > +mixing of bidirectional characters.
> > > > +All of these potential issues are reported to the system
> > > > administrator during
> > > > +phase 5.
> > > > +
> > > > +Media Verification of File Data Extents
> > > > +---------------------------------------
> > > > +
> > > > +The system administrator can elect to initiate a media scan of
> > > > all
> > > > file data
> > > > +blocks.
> > > > +This scan after validation of all filesystem metadata (except
> > > > for
> > > > the summary
> > > > +counters) as phase 6.
> > > > +The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the
> > > > filesystem space map
> > > > +to find areas that are allocated to file data fork extents.
> > > > +Gaps betweeen data fork extents that are smaller than 64k are
> > > > treated as if
> > > > +they were data fork extents to reduce the command setup
> > > > overhead.
> > > > +When the space map scan accumulates a region larger than 32MB, a
> > > > media
> > > > +verification request is sent to the disk as a directio read of
> > > > the
> > > > raw block
> > > > +device.
> > > > +
> > > > +If the verification read fails, ``xfs_scrub`` retries with
> > > > single-
> > > > block reads
> > > > +to narrow down the failure to the specific region of the media
> > > > and
> > > > recorded.
> > > > +When it has finished issuing verification requests, it again
> > > > uses
> > > > the space
> > > > +mapping ioctl to map the recorded media errors back to metadata
> > > > structures
> > > > +and report what has been lost.
> > > > +For media errors in blocks owned by files, the lack of parent
> > > > pointers means
> > > > +that the entire filesystem must be walked to report the file
> > > > paths
> > > > and offsets
> > > > +corresponding to the media error.
> > > > 
> > > This last bit will need to be updated after we come to a decision
> > > with
> > > the rfc
> > 
> > I'll at least update it since this doc is now pretty deep into the
> > pptrs
> > stuff:
> > 
> > "For media errors in blocks owned by files, parent pointers can be
> > used
> > to construct file paths from inode numbers for user-friendly
> > reporting."
> > 
> > > Other than that, I think it looks pretty good.
> > 
> > Woot.
> > 
> > --D
> > 
> > > Allison
> > > 
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 14/14] xfs: document future directions of online fsck
  2023-03-03 23:51         ` Allison Henderson
@ 2023-03-04  2:28           ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-04  2:28 UTC (permalink / raw)
  To: Allison Henderson
  Cc: Catherine Hoang, david, willy, linux-xfs, Chandan Babu,
	linux-fsdevel, hch

On Fri, Mar 03, 2023 at 11:51:05PM +0000, Allison Henderson wrote:
> On Wed, 2023-03-01 at 16:39 -0800, Darrick J. Wong wrote:
> > On Wed, Mar 01, 2023 at 05:37:19AM +0000, Allison Henderson wrote:
> > > On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Add the seventh and final chapter of the online fsck
> > > > documentation,
> > > > where we talk about future functionality that can tie in with the
> > > > functionality provided by the online fsck patchset.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  .../filesystems/xfs-online-fsck-design.rst         |  155
> > > > ++++++++++++++++++++
> > > >  1 file changed, 155 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > index 05b9411fac7f..41291edb02b9 100644
> > > > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > > > @@ -4067,6 +4067,8 @@ The extra flexibility enables several new
> > > > use
> > > > cases:
> > > >    (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby
> > > > committing all
> > > >    of the updates to the original file, or none of them.
> > > >  
> > > > +.. _swapext_if_unchanged:
> > > > +
> > > >  - **Transactional file updates**: The same mechanism as above,
> > > > but
> > > > the caller
> > > >    only wants the commit to occur if the original file's contents
> > > > have not
> > > >    changed.
> > > > @@ -4818,3 +4820,156 @@ and report what has been lost.
> > > >  For media errors in blocks owned by files, the lack of parent
> > > > pointers means
> > > >  that the entire filesystem must be walked to report the file
> > > > paths
> > > > and offsets
> > > >  corresponding to the media error.
> > > > +
> > > > +7. Conclusion and Future Work
> > > > +=============================
> > > > +
> > > > +It is hoped that the reader of this document has followed the
> > > > designs laid out
> > > > +in this document and now has some familiarity with how XFS
> > > > performs
> > > > online
> > > > +rebuilding of its metadata indices, and how filesystem users can
> > > > interact with
> > > > +that functionality.
> > > > +Although the scope of this work is daunting, it is hoped that
> > > > this
> > > > guide will
> > > > +make it easier for code readers to understand what has been
> > > > built,
> > > > for whom it
> > > > +has been built, and why.
> > > > +Please feel free to contact the XFS mailing list with questions.
> > > > +
> > > > +FIEXCHANGE_RANGE
> > > > +----------------
> > > > +
> > > > +As discussed earlier, a second frontend to the atomic extent
> > > > swap
> > > > mechanism is
> > > > +a new ioctl call that userspace programs can use to commit
> > > > updates
> > > > to files
> > > > +atomically.
> > > > +This frontend has been out for review for several years now,
> > > > though
> > > > the
> > > > +necessary refinements to online repair and lack of customer
> > > > demand
> > > > mean that
> > > > +the proposal has not been pushed very hard.
> > 
> > Note: The "Extent Swapping with Regular User Files" section has moved
> > here.
> > 
> > > > +Vectorized Scrub
> > > > +----------------
> > > > +
> > > > +As it turns out, the :ref:`refactoring <scrubrepair>` of repair
> > > > items mentioned
> > > > +earlier was a catalyst for enabling a vectorized scrub system
> > > > call.
> > > > +Since 2018, the cost of making a kernel call has increased
> > > > considerably on some
> > > > +systems to mitigate the effects of speculative execution
> > > > attacks.
> > > > +This incentivizes program authors to make as few system calls as
> > > > possible to
> > > > +reduce the number of times an execution path crosses a security
> > > > boundary.
> > > > +
> > > > +With vectorized scrub, userspace pushes to the kernel the
> > > > identity
> > > > of a
> > > > +filesystem object, a list of scrub types to run against that
> > > > object,
> > > > and a
> > > > +simple representation of the data dependencies between the
> > > > selected
> > > > scrub
> > > > +types.
> > > > +The kernel executes as much of the caller's plan as it can until
> > > > it
> > > > hits a
> > > > +dependency that cannot be satisfied due to a corruption, and
> > > > tells
> > > > userspace
> > > > +how much was accomplished.
> > > > +It is hoped that ``io_uring`` will pick up enough of this
> > > > functionality that
> > > > +online fsck can use that instead of adding a separate vectored
> > > > scrub
> > > > system
> > > > +call to XFS.
> > > > +
> > > > +The relevant patchsets are the
> > > > +`kernel vectorized scrub
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > > log/?h=vectorized-scrub>`_
> > > > +and
> > > > +`userspace vectorized scrub
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > > it/log/?h=vectorized-scrub>`_
> > > > +series.
> > > > +
> > > > +Quality of Service Targets for Scrub
> > > > +------------------------------------
> > > > +
> > > > +One serious shortcoming of the online fsck code is that the
> > > > amount
> > > > of time that
> > > > +it can spend in the kernel holding resource locks is basically
> > > > unbounded.
> > > > +Userspace is allowed to send a fatal signal to the process which
> > > > will cause
> > > > +``xfs_scrub`` to exit when it reaches a good stopping point, but
> > > > there's no way
> > > > +for userspace to provide a time budget to the kernel.
> > > > +Given that the scrub codebase has helpers to detect fatal
> > > > signals,
> > > > it shouldn't
> > > > +be too much work to allow userspace to specify a timeout for a
> > > > scrub/repair
> > > > +operation and abort the operation if it exceeds budget.
> > > > +However, most repair functions have the property that once they
> > > > begin to touch
> > > > +ondisk metadata, the operation cannot be cancelled cleanly,
> > > > after
> > > > which a QoS
> > > > +timeout is no longer useful.
> > > > +
> > > > +Defragmenting Free Space
> > > > +------------------------
> > > > +
> > > > +Over the years, many XFS users have requested the creation of a
> > > > program to
> > > > +clear a portion of the physical storage underlying a filesystem
> > > > so
> > > > that it
> > > > +becomes a contiguous chunk of free space.
> > > > +Call this free space defragmenter ``clearspace`` for short.
> > > > +
> > > > +The first piece the ``clearspace`` program needs is the ability
> > > > to
> > > > read the
> > > > +reverse mapping index from userspace.
> > > > +This already exists in the form of the ``FS_IOC_GETFSMAP``
> > > > ioctl.
> > > > +The second piece it needs is a new fallocate mode
> > > > +(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in
> > > > a
> > > > region and
> > > > +maps it to a file.
> > > > +Call this file the "space collector" file.
> > > > +The third piece is the ability to force an online repair.
> > > > +
> > > > +To clear all the metadata out of a portion of physical storage,
> > > > clearspace
> > > > +uses the new fallocate map-freespace call to map any free space
> > > > in
> > > > that region
> > > > +to the space collector file.
> > > > +Next, clearspace finds all metadata blocks in that region by way
> > > > of
> > > > +``GETFSMAP`` and issues forced repair requests on the data
> > > > structure.
> > > > +This often results in the metadata being rebuilt somewhere that
> > > > is
> > > > not being
> > > > +cleared.
> > > > +After each relocation, clearspace calls the "map free space"
> > > > function again to
> > > > +collect any newly freed space in the region being cleared.
> > > > +
> > > > +To clear all the file data out of a portion of the physical
> > > > storage,
> > > > clearspace
> > > > +uses the FSMAP information to find relevant file data blocks.
> > > > +Having identified a good target, it uses the ``FICLONERANGE``
> > > > call
> > > > on that part
> > > > +of the file to try to share the physical space with a dummy
> > > > file.
> > > > +Cloning the extent means that the original owners cannot
> > > > overwrite
> > > > the
> > > > +contents; any changes will be written somewhere else via copy-
> > > > on-
> > > > write.
> > > > +Clearspace makes its own copy of the frozen extent in an area
> > > > that
> > > > is not being
> > > > +cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent
> > > > swap
> > > > +<swapext_if_unchanged>` feature) to change the target file's
> > > > data
> > > > extent
> > > > +mapping away from the area being cleared.
> > > > +When all other mappings have been moved, clearspace reflinks the
> > > > space into the
> > > > +space collector file so that it becomes unavailable.
> > > > +
> > > > +There are further optimizations that could apply to the above
> > > > algorithm.
> > > > +To clear a piece of physical storage that has a high sharing
> > > > factor,
> > > > it is
> > > > +strongly desirable to retain this sharing factor.
> > > > +In fact, these extents should be moved first to maximize sharing
> > > > factor after
> > > > +the operation completes.
> > > > +To make this work smoothly, clearspace needs a new ioctl
> > > > +(``FS_IOC_GETREFCOUNTS``) to report reference count information
> > > > to
> > > > userspace.
> > > > +With the refcount information exposed, clearspace can quickly
> > > > find
> > > > the longest,
> > > > +most shared data extents in the filesystem, and target them
> > > > first.
> > > > +
> > > 
> > > 
> > > > +**Question**: How might the filesystem move inode chunks?
> > > > +
> > > > +*Answer*: 
> > > "In order to move inode chunks.."
> > 
> > Done.
> > 
> > > > Dave Chinner has a prototype that creates a new file with the old
> > > > +contents and then locklessly runs around the filesystem updating
> > > > directory
> > > > +entries.
> > > > +The operation cannot complete if the filesystem goes down.
> > > > +That problem isn't totally insurmountable: create an inode
> > > > remapping
> > > > table
> > > > +hidden behind a jump label, and a log item that tracks the
> > > > kernel
> > > > walking the
> > > > +filesystem to update directory entries.
> > > > +The trouble is, the kernel can't do anything about open files,
> > > > since
> > > > it cannot
> > > > +revoke them.
> > > > +
> > > 
> > > 
> > > > +**Question**: Can static keys be used to add a revoke bailout
> > > > return
> > > > to
> > > > +*every* code path coming in from userspace?
> > > > +
> > > > +*Answer*: In principle, yes.
> > > > +This 
> > > 
> > > "It is also possible to use static keys to add a revoke bailout
> > > return
> > > to each code path coming in from userspace.  This..."
> > 
> > I think this change would make the answer redundant with the
> > question.
> Sorry, I meant for the quotations to replace everything between the
> line breaks.  So from Q through the answer, just to break out of the
> Q&A format.
> 
> I sort of feel like if a document leaves the reader with questions that
> they didn't have before they started reading, then ideally we should
> simply just incorporate the answer in the document.  Just makes the
> read easier imho.

Oh, I see.  Let me think about that over the weekend.  These are all
highly speculative questions about prototype code that nobody's really
worked through yet, so they need to make it clear that we're not talking
about anything close to future features.

--D

> > 
> > "Can static keys be used to minimize the runtime cost of supporting
> > ``revoke()`` on XFS files?"
> > 
> > "Yes.  Until the first revocation, the bailout code need not be in
> > the
> > call path at all."
> 
> That's an implied Q&A format, but I suppose it's not a big deal either
> way though.
> 
> > 
> > > > would eliminate the overhead of the check until a revocation
> > > > happens.
> > > > +It's not clear what we do to a revoked file after all the
> > > > callers
> > > > are finished
> > > > +with it, however.
> > > > +
> > > > +The relevant patchsets are the
> > > > +`kernel freespace defrag
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > > > log/?h=defrag-freespace>`_
> > > > +and
> > > > +`userspace freespace defrag
> > > > +<
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > > > it/log/?h=defrag-freespace>`_
> > > > +series.
> > > 
> > > I guess since they're just future ideas just light documentation is
> > > fine.  Other than cleaning out the Q & A's, I think it looks pretty
> > > good.
> > 
> > Ok.  Thank you x100000000 for being the first person to publicly
> > comment
> > on the entire document!
> 
> Sure, glad to help!  :-)
> 
> Allison
> 
> > 
> > --D
> > 
> > > Allison
> > > 
> > > > +
> > > > +Shrinking Filesystems
> > > > +---------------------
> > > > +
> > > > +Removing the end of the filesystem ought to be a simple matter
> > > > of
> > > > evacuating
> > > > +the data and metadata at the end of the filesystem, and handing
> > > > the
> > > > freed space
> > > > +to the shrink code.
> > > > +That requires an evacuation of the space at end of the
> > > > filesystem,
> > > > which is a
> > > > +use of free space defragmentation!
> > > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCHSET v24.3 00/14] xfs: design documentation for online fsck
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
                     ` (13 preceding siblings ...)
  2022-12-30 22:10   ` [PATCH 12/14] xfs: document directory tree repairs Darrick J. Wong
@ 2023-03-07  1:30   ` Darrick J. Wong
  2023-03-07  1:30   ` Darrick J. Wong
  15 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:30 UTC (permalink / raw)
  To: djwong
  Cc: Allison Henderson, linux-xfs, willy, chandan.babu,
	allison.henderson, linux-fsdevel, hch, catherine.hoang, david

Hi all,

To prepare the XFS community and potential patch reviewers for the
upstream submission of the online fsck feature, I decided to write a
document capturing the broader picture behind the online repair
development effort.  The document begins by defining the problems that
online fsck aims to solve and outlining specific use cases for the
functionality.

Using that as a base, the rest of the design document presents the high
level algorithms that fulfill the goals set out at the start and the
interactions between the large pieces of the system.  Case studies round
out the design documentation by adding the details of exactly how
specific parts of the online fsck code integrate the algorithms with the
filesystem.

The goal of this effort is to help the XFS community understand how the
gigantic online repair patchset works.  The questions I submit to the
community reviewers are:

1. As you read the design doc (and later the code), do you feel that you
   understand what's going on well enough to try to fix a bug if you
   found one?

2. What sorts of interactions between systems (or between scrub and the
   rest of the kernel) am I missing?

3. Do you feel confident enough in the implementation as it is now that
   the benefits of merging the feature (as EXPERIMENTAL) outweigh any
   potential disruptions to XFS at large?

4. Are there problematic interactions between subsystems that ought to
   be cleared up before merging?

5. Can I just merge all of this?

I intend to commit this document to the kernel's documentation directory
when we start merging the patchset, albeit without the links to
git.kernel.org.  A much more readable version of this is posted at:
https://djwong.org/docs/xfs-online-fsck-design/

v2: add missing sections about: all the in-kernel data structures and
    new apis that the scrub and repair functions use; how xattrs and
    directories are checked; how space btree records are checked; and
    add more details to the parts where all these bits tie together.
    Proofread for verb tense inconsistencies and eliminate vague 'we'
    usage.  Move all the discussion of what we can do with pageable
    kernel memory into a single source file and section.  Document where
    log incompat feature locks fit into the locking model.

v3: resync with 6.0, fix a few typos, begin discussion of the merging
    plan for this megapatchset.  Bump to v24 to match the kernel code

v24.3: add review comments from Allison Henderson

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=docs-online-fsck-design
---
 Documentation/filesystems/index.rst                |    1 
 .../filesystems/xfs-online-fsck-design.rst         | 5315 ++++++++++++++++++++
 .../filesystems/xfs-self-describing-metadata.rst   |    1 
 3 files changed, 5317 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-online-fsck-design.rst


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCHSET v24.3 00/14] xfs: design documentation for online fsck
  2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
                     ` (14 preceding siblings ...)
  2023-03-07  1:30   ` [PATCHSET v24.3 00/14] xfs: design documentation for online fsck Darrick J. Wong
@ 2023-03-07  1:30   ` Darrick J. Wong
  2023-03-07  1:30     ` [PATCH 01/14] xfs: document the motivation for online fsck design Darrick J. Wong
                       ` (13 more replies)
  15 siblings, 14 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:30 UTC (permalink / raw)
  To: djwong
  Cc: Allison Henderson, linux-xfs, willy, chandan.babu,
	allison.henderson, linux-fsdevel, hch, catherine.hoang, david

Hi all,

To prepare the XFS community and potential patch reviewers for the
upstream submission of the online fsck feature, I decided to write a
document capturing the broader picture behind the online repair
development effort.  The document begins by defining the problems that
online fsck aims to solve and outlining specific use cases for the
functionality.

Using that as a base, the rest of the design document presents the high
level algorithms that fulfill the goals set out at the start and the
interactions between the large pieces of the system.  Case studies round
out the design documentation by adding the details of exactly how
specific parts of the online fsck code integrate the algorithms with the
filesystem.

The goal of this effort is to help the XFS community understand how the
gigantic online repair patchset works.  The questions I submit to the
community reviewers are:

1. As you read the design doc (and later the code), do you feel that you
   understand what's going on well enough to try to fix a bug if you
   found one?

2. What sorts of interactions between systems (or between scrub and the
   rest of the kernel) am I missing?

3. Do you feel confident enough in the implementation as it is now that
   the benefits of merging the feature (as EXPERIMENTAL) outweigh any
   potential disruptions to XFS at large?

4. Are there problematic interactions between subsystems that ought to
   be cleared up before merging?

5. Can I just merge all of this?

I intend to commit this document to the kernel's documentation directory
when we start merging the patchset, albeit without the links to
git.kernel.org.  A much more readable version of this is posted at:
https://djwong.org/docs/xfs-online-fsck-design/

v2: add missing sections about: all the in-kernel data structures and
    new apis that the scrub and repair functions use; how xattrs and
    directories are checked; how space btree records are checked; and
    add more details to the parts where all these bits tie together.
    Proofread for verb tense inconsistencies and eliminate vague 'we'
    usage.  Move all the discussion of what we can do with pageable
    kernel memory into a single source file and section.  Document where
    log incompat feature locks fit into the locking model.

v3: resync with 6.0, fix a few typos, begin discussion of the merging
    plan for this megapatchset.  Bump to v24 to match the kernel code

v24.3: add review comments from Allison Henderson

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=docs-online-fsck-design
---
 Documentation/filesystems/index.rst                |    1 
 .../filesystems/xfs-online-fsck-design.rst         | 5315 ++++++++++++++++++++
 .../filesystems/xfs-self-describing-metadata.rst   |    1 
 3 files changed, 5317 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-online-fsck-design.rst


^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 01/14] xfs: document the motivation for online fsck design
  2023-03-07  1:30   ` Darrick J. Wong
@ 2023-03-07  1:30     ` Darrick J. Wong
  2023-03-07  1:31     ` [PATCH 02/14] xfs: document the general theory underlying " Darrick J. Wong
                       ` (12 subsequent siblings)
  13 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:30 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Start the first chapter of the online fsck design documentation.
This covers the motivations for creating this in the first place.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 Documentation/filesystems/index.rst                |    1 
 .../filesystems/xfs-online-fsck-design.rst         |  212 ++++++++++++++++++++
 2 files changed, 213 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-online-fsck-design.rst


diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index bee63d42e5ec..fbb2b5ada95b 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -123,4 +123,5 @@ Documentation for filesystem implementations.
    vfat
    xfs-delayed-logging-design
    xfs-self-describing-metadata
+   xfs-online-fsck-design
    zonefs
diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
new file mode 100644
index 000000000000..07c7b4cde18f
--- /dev/null
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -0,0 +1,212 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _xfs_online_fsck_design:
+
+..
+        Mapping of heading styles within this document:
+        Heading 1 uses "====" above and below
+        Heading 2 uses "===="
+        Heading 3 uses "----"
+        Heading 4 uses "````"
+        Heading 5 uses "^^^^"
+        Heading 6 uses "~~~~"
+        Heading 7 uses "...."
+
+        Sections are manually numbered because apparently that's what everyone
+        does in the kernel.
+
+======================
+XFS Online Fsck Design
+======================
+
+This document captures the design of the online filesystem check feature for
+XFS.
+The purpose of this document is threefold:
+
+- To help kernel distributors understand exactly what the XFS online fsck
+  feature is, and issues about which they should be aware.
+
+- To help people reading the code to familiarize themselves with the relevant
+  concepts and design points before they start digging into the code.
+
+- To help developers maintaining the system by capturing the reasons
+  supporting higher level decision making.
+
+As the online fsck code is merged, the links in this document to topic branches
+will be replaced with links to code.
+
+This document is licensed under the terms of the GNU Public License, v2.
+The primary author is Darrick J. Wong.
+
+This design document is split into seven parts.
+Part 1 defines what fsck tools are and the motivations for writing a new one.
+Parts 2 and 3 present a high level overview of how online fsck process works
+and how it is tested to ensure correct functionality.
+Part 4 discusses the user interface and the intended usage modes of the new
+program.
+Parts 5 and 6 show off the high level components and how they fit together, and
+then present case studies of how each repair function actually works.
+Part 7 sums up what has been discussed so far and speculates about what else
+might be built atop online fsck.
+
+.. contents:: Table of Contents
+   :local:
+
+1. What is a Filesystem Check?
+==============================
+
+A Unix filesystem has four main responsibilities:
+
+- Provide a hierarchy of names through which application programs can associate
+  arbitrary blobs of data for any length of time,
+
+- Virtualize physical storage media across those names, and
+
+- Retrieve the named data blobs at any time.
+
+- Examine resource usage.
+
+Metadata directly supporting these functions (e.g. files, directories, space
+mappings) are sometimes called primary metadata.
+Secondary metadata (e.g. reverse mapping and directory parent pointers) support
+operations internal to the filesystem, such as internal consistency checking
+and reorganization.
+Summary metadata, as the name implies, condense information contained in
+primary metadata for performance reasons.
+
+The filesystem check (fsck) tool examines all the metadata in a filesystem
+to look for errors.
+In addition to looking for obvious metadata corruptions, fsck also
+cross-references different types of metadata records with each other to look
+for inconsistencies.
+People do not like losing data, so most fsck tools also contains some ability
+to correct any problems found.
+As a word of caution -- the primary goal of most Linux fsck tools is to restore
+the filesystem metadata to a consistent state, not to maximize the data
+recovered.
+That precedent will not be challenged here.
+
+Filesystems of the 20th century generally lacked any redundancy in the ondisk
+format, which means that fsck can only respond to errors by erasing files until
+errors are no longer detected.
+More recent filesystem designs contain enough redundancy in their metadata that
+it is now possible to regenerate data structures when non-catastrophic errors
+occur; this capability aids both strategies.
+
++--------------------------------------------------------------------------+
+| **Note**:                                                                |
++--------------------------------------------------------------------------+
+| System administrators avoid data loss by increasing the number of        |
+| separate storage systems through the creation of backups; and they avoid |
+| downtime by increasing the redundancy of each storage system through the |
+| creation of RAID arrays.                                                 |
+| fsck tools address only the first problem.                               |
++--------------------------------------------------------------------------+
+
+TLDR; Show Me the Code!
+-----------------------
+
+Code is posted to the kernel.org git trees as follows:
+`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
+`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
+`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
+Each kernel patchset adding an online repair function will use the same branch
+name across the kernel, xfsprogs, and fstests git repos.
+
+Existing Tools
+--------------
+
+The online fsck tool described here will be the third tool in the history of
+XFS (on Linux) to check and repair filesystems.
+Two programs precede it:
+
+The first program, ``xfs_check``, was created as part of the XFS debugger
+(``xfs_db``) and can only be used with unmounted filesystems.
+It walks all metadata in the filesystem looking for inconsistencies in the
+metadata, though it lacks any ability to repair what it finds.
+Due to its high memory requirements and inability to repair things, this
+program is now deprecated and will not be discussed further.
+
+The second program, ``xfs_repair``, was created to be faster and more robust
+than the first program.
+Like its predecessor, it can only be used with unmounted filesystems.
+It uses extent-based in-memory data structures to reduce memory consumption,
+and tries to schedule readahead IO appropriately to reduce I/O waiting time
+while it scans the metadata of the entire filesystem.
+The most important feature of this tool is its ability to respond to
+inconsistencies in file metadata and directory tree by erasing things as needed
+to eliminate problems.
+Space usage metadata are rebuilt from the observed file metadata.
+
+Problem Statement
+-----------------
+
+The current XFS tools leave several problems unsolved:
+
+1. **User programs** suddenly **lose access** to the filesystem when unexpected
+   shutdowns occur as a result of silent corruptions in the metadata.
+   These occur **unpredictably** and often without warning.
+
+2. **Users** experience a **total loss of service** during the recovery period
+   after an **unexpected shutdown** occurs.
+
+3. **Users** experience a **total loss of service** if the filesystem is taken
+   offline to **look for problems** proactively.
+
+4. **Data owners** cannot **check the integrity** of their stored data without
+   reading all of it.
+   This may expose them to substantial billing costs when a linear media scan
+   performed by the storage system administrator might suffice.
+
+5. **System administrators** cannot **schedule** a maintenance window to deal
+   with corruptions if they **lack the means** to assess filesystem health
+   while the filesystem is online.
+
+6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem
+   health when doing so requires **manual intervention** and downtime.
+
+7. **Users** can be tricked into **doing things they do not desire** when
+   malicious actors **exploit quirks of Unicode** to place misleading names
+   in directories.
+
+Given this definition of the problems to be solved and the actors who would
+benefit, the proposed solution is a third fsck tool that acts on a running
+filesystem.
+
+This new third program has three components: an in-kernel facility to check
+metadata, an in-kernel facility to repair metadata, and a userspace driver
+program to drive fsck activity on a live filesystem.
+``xfs_scrub`` is the name of the driver program.
+The rest of this document presents the goals and use cases of the new fsck
+tool, describes its major design points in connection to those goals, and
+discusses the similarities and differences with existing tools.
+
++--------------------------------------------------------------------------+
+| **Note**:                                                                |
++--------------------------------------------------------------------------+
+| Throughout this document, the existing offline fsck tool can also be     |
+| referred to by its current name "``xfs_repair``".                        |
+| The userspace driver program for the new online fsck tool can be         |
+| referred to as "``xfs_scrub``".                                          |
+| The kernel portion of online fsck that validates metadata is called      |
+| "online scrub", and portion of the kernel that fixes metadata is called  |
+| "online repair".                                                         |
++--------------------------------------------------------------------------+
+
+The naming hierarchy is broken up into objects known as directories and files
+and the physical space is split into pieces known as allocation groups.
+Sharding enables better performance on highly parallel systems and helps to
+contain the damage when corruptions occur.
+The division of the filesystem into principal objects (allocation groups and
+inodes) means that there are ample opportunities to perform targeted checks and
+repairs on a subset of the filesystem.
+
+While this is going on, other parts continue processing IO requests.
+Even if a piece of filesystem metadata can only be regenerated by scanning the
+entire system, the scan can still be done in the background while other file
+operations continue.
+
+In summary, online fsck takes advantage of resource sharding and redundant
+metadata to enable targeted checking and repair operations while the system
+is running.
+This capability will be coupled to automatic system management so that
+autonomous self-healing of XFS maximizes service availability.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 02/14] xfs: document the general theory underlying online fsck design
  2023-03-07  1:30   ` Darrick J. Wong
  2023-03-07  1:30     ` [PATCH 01/14] xfs: document the motivation for online fsck design Darrick J. Wong
@ 2023-03-07  1:31     ` Darrick J. Wong
  2023-03-07  1:31     ` [PATCH 03/14] xfs: document the testing plan for online fsck Darrick J. Wong
                       ` (11 subsequent siblings)
  13 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:31 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Start the second chapter of the online fsck design documentation.
This covers the general theory underlying how online fsck works.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  404 ++++++++++++++++++++
 1 file changed, 404 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 07c7b4cde18f..0846935325b2 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -210,3 +210,407 @@ metadata to enable targeted checking and repair operations while the system
 is running.
 This capability will be coupled to automatic system management so that
 autonomous self-healing of XFS maximizes service availability.
+
+2. Theory of Operation
+======================
+
+Because it is necessary for online fsck to lock and scan live metadata objects,
+online fsck consists of three separate code components.
+The first is the userspace driver program ``xfs_scrub``, which is responsible
+for identifying individual metadata items, scheduling work items for them,
+reacting to the outcomes appropriately, and reporting results to the system
+administrator.
+The second and third are in the kernel, which implements functions to check
+and repair each type of online fsck work item.
+
++------------------------------------------------------------------+
+| **Note**:                                                        |
++------------------------------------------------------------------+
+| For brevity, this document shortens the phrase "online fsck work |
+| item" to "scrub item".                                           |
++------------------------------------------------------------------+
+
+Scrub item types are delineated in a manner consistent with the Unix design
+philosophy, which is to say that each item should handle one aspect of a
+metadata structure, and handle it well.
+
+Scope
+-----
+
+In principle, online fsck should be able to check and to repair everything that
+the offline fsck program can handle.
+However, online fsck cannot be running 100% of the time, which means that
+latent errors may creep in after a scrub completes.
+If these errors cause the next mount to fail, offline fsck is the only
+solution.
+This limitation means that maintenance of the offline fsck tool will continue.
+A second limitation of online fsck is that it must follow the same resource
+sharing and lock acquisition rules as the regular filesystem.
+This means that scrub cannot take *any* shortcuts to save time, because doing
+so could lead to concurrency problems.
+In other words, online fsck is not a complete replacement for offline fsck, and
+a complete run of online fsck may take longer than online fsck.
+However, both of these limitations are acceptable tradeoffs to satisfy the
+different motivations of online fsck, which are to **minimize system downtime**
+and to **increase predictability of operation**.
+
+.. _scrubphases:
+
+Phases of Work
+--------------
+
+The userspace driver program ``xfs_scrub`` splits the work of checking and
+repairing an entire filesystem into seven phases.
+Each phase concentrates on checking specific types of scrub items and depends
+on the success of all previous phases.
+The seven phases are as follows:
+
+1. Collect geometry information about the mounted filesystem and computer,
+   discover the online fsck capabilities of the kernel, and open the
+   underlying storage devices.
+
+2. Check allocation group metadata, all realtime volume metadata, and all quota
+   files.
+   Each metadata structure is scheduled as a separate scrub item.
+   If corruption is found in the inode header or inode btree and ``xfs_scrub``
+   is permitted to perform repairs, then those scrub items are repaired to
+   prepare for phase 3.
+   Repairs are implemented by using the information in the scrub item to
+   resubmit the kernel scrub call with the repair flag enabled; this is
+   discussed in the next section.
+   Optimizations and all other repairs are deferred to phase 4.
+
+3. Check all metadata of every file in the filesystem.
+   Each metadata structure is also scheduled as a separate scrub item.
+   If repairs are needed and ``xfs_scrub`` is permitted to perform repairs,
+   and there were no problems detected during phase 2, then those scrub items
+   are repaired immediately.
+   Optimizations, deferred repairs, and unsuccessful repairs are deferred to
+   phase 4.
+
+4. All remaining repairs and scheduled optimizations are performed during this
+   phase, if the caller permits them.
+   Before starting repairs, the summary counters are checked and any necessary
+   repairs are performed so that subsequent repairs will not fail the resource
+   reservation step due to wildly incorrect summary counters.
+   Unsuccesful repairs are requeued as long as forward progress on repairs is
+   made somewhere in the filesystem.
+   Free space in the filesystem is trimmed at the end of phase 4 if the
+   filesystem is clean.
+
+5. By the start of this phase, all primary and secondary filesystem metadata
+   must be correct.
+   Summary counters such as the free space counts and quota resource counts
+   are checked and corrected.
+   Directory entry names and extended attribute names are checked for
+   suspicious entries such as control characters or confusing Unicode sequences
+   appearing in names.
+
+6. If the caller asks for a media scan, read all allocated and written data
+   file extents in the filesystem.
+   The ability to use hardware-assisted data file integrity checking is new
+   to online fsck; neither of the previous tools have this capability.
+   If media errors occur, they will be mapped to the owning files and reported.
+
+7. Re-check the summary counters and presents the caller with a summary of
+   space usage and file counts.
+
+Steps for Each Scrub Item
+-------------------------
+
+The kernel scrub code uses a three-step strategy for checking and repairing
+the one aspect of a metadata object represented by a scrub item:
+
+1. The scrub item of interest is checked for corruptions; opportunities for
+   optimization; and for values that are directly controlled by the system
+   administrator but look suspicious.
+   If the item is not corrupt or does not need optimization, resource are
+   released and the positive scan results are returned to userspace.
+   If the item is corrupt or could be optimized but the caller does not permit
+   this, resources are released and the negative scan results are returned to
+   userspace.
+   Otherwise, the kernel moves on to the second step.
+
+2. The repair function is called to rebuild the data structure.
+   Repair functions generally choose rebuild a structure from other metadata
+   rather than try to salvage the existing structure.
+   If the repair fails, the scan results from the first step are returned to
+   userspace.
+   Otherwise, the kernel moves on to the third step.
+
+3. In the third step, the kernel runs the same checks over the new metadata
+   item to assess the efficacy of the repairs.
+   The results of the reassessment are returned to userspace.
+
+Classification of Metadata
+--------------------------
+
+Each type of metadata object (and therefore each type of scrub item) is
+classified as follows:
+
+Primary Metadata
+````````````````
+
+Metadata structures in this category should be most familiar to filesystem
+users either because they are directly created by the user or they index
+objects created by the user
+Most filesystem objects fall into this class:
+
+- Free space and reference count information
+
+- Inode records and indexes
+
+- Storage mapping information for file data
+
+- Directories
+
+- Extended attributes
+
+- Symbolic links
+
+- Quota limits
+
+Scrub obeys the same rules as regular filesystem accesses for resource and lock
+acquisition.
+
+Primary metadata objects are the simplest for scrub to process.
+The principal filesystem object (either an allocation group or an inode) that
+owns the item being scrubbed is locked to guard against concurrent updates.
+The check function examines every record associated with the type for obvious
+errors and cross-references healthy records against other metadata to look for
+inconsistencies.
+Repairs for this class of scrub item are simple, since the repair function
+starts by holding all the resources acquired in the previous step.
+The repair function scans available metadata as needed to record all the
+observations needed to complete the structure.
+Next, it stages the observations in a new ondisk structure and commits it
+atomically to complete the repair.
+Finally, the storage from the old data structure are carefully reaped.
+
+Because ``xfs_scrub`` locks a primary object for the duration of the repair,
+this is effectively an offline repair operation performed on a subset of the
+filesystem.
+This minimizes the complexity of the repair code because it is not necessary to
+handle concurrent updates from other threads, nor is it necessary to access
+any other part of the filesystem.
+As a result, indexed structures can be rebuilt very quickly, and programs
+trying to access the damaged structure will be blocked until repairs complete.
+The only infrastructure needed by the repair code are the staging area for
+observations and a means to write new structures to disk.
+Despite these limitations, the advantage that online repair holds is clear:
+targeted work on individual shards of the filesystem avoids total loss of
+service.
+
+This mechanism is described in section 2.1 ("Off-Line Algorithm") of
+V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
+Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_,
+*Extending Database Technology*, pp. 293-309, 1992.
+
+Most primary metadata repair functions stage their intermediate results in an
+in-memory array prior to formatting the new ondisk structure, which is very
+similar to the list-based algorithm discussed in section 2.3 ("List-Based
+Algorithms") of Srinivasan.
+However, any data structure builder that maintains a resource lock for the
+duration of the repair is *always* an offline algorithm.
+
+Secondary Metadata
+``````````````````
+
+Metadata structures in this category reflect records found in primary metadata,
+but are only needed for online fsck or for reorganization of the filesystem.
+
+Secondary metadata include:
+
+- Reverse mapping information
+
+- Directory parent pointers
+
+This class of metadata is difficult for scrub to process because scrub attaches
+to the secondary object but needs to check primary metadata, which runs counter
+to the usual order of resource acquisition.
+Frequently, this means that full filesystems scans are necessary to rebuild the
+metadata.
+Check functions can be limited in scope to reduce runtime.
+Repairs, however, require a full scan of primary metadata, which can take a
+long time to complete.
+Under these conditions, ``xfs_scrub`` cannot lock resources for the entire
+duration of the repair.
+
+Instead, repair functions set up an in-memory staging structure to store
+observations.
+Depending on the requirements of the specific repair function, the staging
+index will either have the same format as the ondisk structure or a design
+specific to that repair function.
+The next step is to release all locks and start the filesystem scan.
+When the repair scanner needs to record an observation, the staging data are
+locked long enough to apply the update.
+While the filesystem scan is in progress, the repair function hooks the
+filesystem so that it can apply pending filesystem updates to the staging
+information.
+Once the scan is done, the owning object is re-locked, the live data is used to
+write a new ondisk structure, and the repairs are committed atomically.
+The hooks are disabled and the staging staging area is freed.
+Finally, the storage from the old data structure are carefully reaped.
+
+Introducing concurrency helps online repair avoid various locking problems, but
+comes at a high cost to code complexity.
+Live filesystem code has to be hooked so that the repair function can observe
+updates in progress.
+The staging area has to become a fully functional parallel structure so that
+updates can be merged from the hooks.
+Finally, the hook, the filesystem scan, and the inode locking model must be
+sufficiently well integrated that a hook event can decide if a given update
+should be applied to the staging structure.
+
+In theory, the scrub implementation could apply these same techniques for
+primary metadata, but doing so would make it massively more complex and less
+performant.
+Programs attempting to access the damaged structures are not blocked from
+operation, which may cause application failure or an unplanned filesystem
+shutdown.
+
+Inspiration for the secondary metadata repair strategy was drawn from section
+2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
+and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
+Creating Indexes for Very Large Tables Without Quiescing Updates"
+<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
+
+The sidecar index mentioned above bears some resemblance to the side file
+method mentioned in Srinivasan and Mohan.
+Their method consists of an index builder that extracts relevant record data to
+build the new structure as quickly as possible; and an auxiliary structure that
+captures all updates that would be committed to the index by other threads were
+the new index already online.
+After the index building scan finishes, the updates recorded in the side file
+are applied to the new index.
+To avoid conflicts between the index builder and other writer threads, the
+builder maintains a publicly visible cursor that tracks the progress of the
+scan through the record space.
+To avoid duplication of work between the side file and the index builder, side
+file updates are elided when the record ID for the update is greater than the
+cursor position within the record ID space.
+
+To minimize changes to the rest of the codebase, XFS online repair keeps the
+replacement index hidden until it's completely ready to go.
+In other words, there is no attempt to expose the keyspace of the new index
+while repair is running.
+The complexity of such an approach would be very high and perhaps more
+appropriate to building *new* indices.
+
+**Future Work Question**: Can the full scan and live update code used to
+facilitate a repair also be used to implement a comprehensive check?
+
+*Answer*: In theory, yes.  Check would be much stronger if each scrub function
+employed these live scans to build a shadow copy of the metadata and then
+compared the shadow records to the ondisk records.
+However, doing that is a fair amount more work than what the checking functions
+do now.
+The live scans and hooks were developed much later.
+That in turn increases the runtime of those scrub functions.
+
+Summary Information
+```````````````````
+
+Metadata structures in this last category summarize the contents of primary
+metadata records.
+These are often used to speed up resource usage queries, and are many times
+smaller than the primary metadata which they represent.
+
+Examples of summary information include:
+
+- Summary counts of free space and inodes
+
+- File link counts from directories
+
+- Quota resource usage counts
+
+Check and repair require full filesystem scans, but resource and lock
+acquisition follow the same paths as regular filesystem accesses.
+
+The superblock summary counters have special requirements due to the underlying
+implementation of the incore counters, and will be treated separately.
+Check and repair of the other types of summary counters (quota resource counts
+and file link counts) employ the same filesystem scanning and hooking
+techniques as outlined above, but because the underlying data are sets of
+integer counters, the staging data need not be a fully functional mirror of the
+ondisk structure.
+
+Inspiration for quota and file link count repair strategies were drawn from
+sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View
+Maintenace") of G.  Graefe, `"Concurrent Queries and Updates in Summary Views
+and Their Indexes"
+<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
+
+Since quotas are non-negative integer counts of resource usage, online
+quotacheck can use the incremental view deltas described in section 2.14 to
+track pending changes to the block and inode usage counts in each transaction,
+and commit those changes to a dquot side file when the transaction commits.
+Delta tracking is necessary for dquots because the index builder scans inodes,
+whereas the data structure being rebuilt is an index of dquots.
+Link count checking combines the view deltas and commit step into one because
+it sets attributes of the objects being scanned instead of writing them to a
+separate data structure.
+Each online fsck function will be discussed as case studies later in this
+document.
+
+Risk Management
+---------------
+
+During the development of online fsck, several risk factors were identified
+that may make the feature unsuitable for certain distributors and users.
+Steps can be taken to mitigate or eliminate those risks, though at a cost to
+functionality.
+
+- **Decreased performance**: Adding metadata indices to the filesystem
+  increases the time cost of persisting changes to disk, and the reverse space
+  mapping and directory parent pointers are no exception.
+  System administrators who require the maximum performance can disable the
+  reverse mapping features at format time, though this choice dramatically
+  reduces the ability of online fsck to find inconsistencies and repair them.
+
+- **Incorrect repairs**: As with all software, there might be defects in the
+  software that result in incorrect repairs being written to the filesystem.
+  Systematic fuzz testing (detailed in the next section) is employed by the
+  authors to find bugs early, but it might not catch everything.
+  The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB``
+  and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to
+  accept this risk.
+  The xfsprogs build system has a configure option (``--enable-scrub=no``) that
+  disables building of the ``xfs_scrub`` binary, though this is not a risk
+  mitigation if the kernel functionality remains enabled.
+
+- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
+  repairable.
+  If the keyspaces of several metadata indices overlap in some manner but a
+  coherent narrative cannot be formed from records collected, then the repair
+  fails.
+  To reduce the chance that a repair will fail with a dirty transaction and
+  render the filesystem unusable, the online repair functions have been
+  designed to stage and validate all new records before committing the new
+  structure.
+
+- **Misbehavior**: Online fsck requires many privileges -- raw IO to block
+  devices, opening files by handle, ignoring Unix discretionary access control,
+  and the ability to perform administrative changes.
+  Running this automatically in the background scares people, so the systemd
+  background service is configured to run with only the privileges required.
+  Obviously, this cannot address certain problems like the kernel crashing or
+  deadlocking, but it should be sufficient to prevent the scrub process from
+  escaping and reconfiguring the system.
+  The cron job does not have this protection.
+
+- **Fuzz Kiddiez**: There are many people now who seem to think that running
+  automated fuzz testing of ondisk artifacts to find mischevious behavior and
+  spraying exploit code onto the public mailing list for instant zero-day
+  disclosure is somehow of some social benefit.
+  In the view of this author, the benefit is realized only when the fuzz
+  operators help to **fix** the flaws, but this opinion apparently is not
+  widely shared among security "researchers".
+  The XFS maintainers' continuing ability to manage these events presents an
+  ongoing risk to the stability of the development process.
+  Automated testing should front-load some of the risk while the feature is
+  considered EXPERIMENTAL.
+
+Many of these risks are inherent to software programming.
+Despite this, it is hoped that this new functionality will prove useful in
+reducing unexpected downtime.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 03/14] xfs: document the testing plan for online fsck
  2023-03-07  1:30   ` Darrick J. Wong
  2023-03-07  1:30     ` [PATCH 01/14] xfs: document the motivation for online fsck design Darrick J. Wong
  2023-03-07  1:31     ` [PATCH 02/14] xfs: document the general theory underlying " Darrick J. Wong
@ 2023-03-07  1:31     ` Darrick J. Wong
  2023-03-07  1:31     ` [PATCH 04/14] xfs: document the user interface " Darrick J. Wong
                       ` (10 subsequent siblings)
  13 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:31 UTC (permalink / raw)
  To: djwong
  Cc: Allison Henderson, linux-xfs, willy, chandan.babu,
	allison.henderson, linux-fsdevel, hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Start the third chapter of the online fsck design documentation.  This
covers the testing plan to make sure that both online and offline fsck
can detect arbitrary problems and correct them without making things
worse.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
---
 .../filesystems/xfs-online-fsck-design.rst         |  186 ++++++++++++++++++++
 1 file changed, 186 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 0846935325b2..ed9b83c4dbf7 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -614,3 +614,189 @@ functionality.
 Many of these risks are inherent to software programming.
 Despite this, it is hoped that this new functionality will prove useful in
 reducing unexpected downtime.
+
+3. Testing Plan
+===============
+
+As stated before, fsck tools have three main goals:
+
+1. Detect inconsistencies in the metadata;
+
+2. Eliminate those inconsistencies; and
+
+3. Minimize further loss of data.
+
+Demonstrations of correct operation are necessary to build users' confidence
+that the software behaves within expectations.
+Unfortunately, it was not really feasible to perform regular exhaustive testing
+of every aspect of a fsck tool until the introduction of low-cost virtual
+machines with high-IOPS storage.
+With ample hardware availability in mind, the testing strategy for the online
+fsck project involves differential analysis against the existing fsck tools and
+systematic testing of every attribute of every type of metadata object.
+Testing can be split into four major categories, as discussed below.
+
+Integrated Testing with fstests
+-------------------------------
+
+The primary goal of any free software QA effort is to make testing as
+inexpensive and widespread as possible to maximize the scaling advantages of
+community.
+In other words, testing should maximize the breadth of filesystem configuration
+scenarios and hardware setups.
+This improves code quality by enabling the authors of online fsck to find and
+fix bugs early, and helps developers of new features to find integration
+issues earlier in their development effort.
+
+The Linux filesystem community shares a common QA testing suite,
+`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
+functional and regression testing.
+Even before development work began on online fsck, fstests (when run on XFS)
+would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
+scratch filesystems between each test.
+This provides a level of assurance that the kernel and the fsck tools stay in
+alignment about what constitutes consistent metadata.
+During development of the online checking code, fstests was modified to run
+``xfs_scrub -n`` between each test to ensure that the new checking code
+produces the same results as the two existing fsck tools.
+
+To start development of online repair, fstests was modified to run
+``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
+This ensures that offline repair does not crash, leave a corrupt filesystem
+after it exists, or trigger complaints from the online check.
+This also established a baseline for what can and cannot be repaired offline.
+To complete the first phase of development of online repair, fstests was
+modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
+This enables a comparison of the effectiveness of online repair as compared to
+the existing offline repair tools.
+
+General Fuzz Testing of Metadata Blocks
+---------------------------------------
+
+XFS benefits greatly from having a very robust debugging tool, ``xfs_db``.
+
+Before development of online fsck even began, a set of fstests were created
+to test the rather common fault that entire metadata blocks get corrupted.
+This required the creation of fstests library code that can create a filesystem
+containing every possible type of metadata object.
+Next, individual test cases were created to create a test filesystem, identify
+a single block of a specific type of metadata object, trash it with the
+existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
+particular metadata validation strategy.
+
+This earlier test suite enabled XFS developers to test the ability of the
+in-kernel validation functions and the ability of the offline fsck tool to
+detect and eliminate the inconsistent metadata.
+This part of the test suite was extended to cover online fsck in exactly the
+same manner.
+
+In other words, for a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem:
+
+  * Write garbage to it
+
+  * Test the reactions of:
+
+    1. The kernel verifiers to stop obviously bad metadata
+    2. Offline repair (``xfs_repair``) to detect and fix
+    3. Online repair (``xfs_scrub``) to detect and fix
+
+Targeted Fuzz Testing of Metadata Records
+-----------------------------------------
+
+The testing plan for online fsck includes extending the existing fs testing
+infrastructure to provide a much more powerful facility: targeted fuzz testing
+of every metadata field of every metadata object in the filesystem.
+``xfs_db`` can modify every field of every metadata structure in every
+block in the filesystem to simulate the effects of memory corruption and
+software bugs.
+Given that fstests already contains the ability to create a filesystem
+containing every metadata format known to the filesystem, ``xfs_db`` can be
+used to perform exhaustive fuzz testing!
+
+For a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem...
+
+  * For each record inside that metadata object...
+
+    * For each field inside that record...
+
+      * For each conceivable type of transformation that can be applied to a bit field...
+
+        1. Clear all bits
+        2. Set all bits
+        3. Toggle the most significant bit
+        4. Toggle the middle bit
+        5. Toggle the least significant bit
+        6. Add a small quantity
+        7. Subtract a small quantity
+        8. Randomize the contents
+
+        * ...test the reactions of:
+
+          1. The kernel verifiers to stop obviously bad metadata
+          2. Offline checking (``xfs_repair -n``)
+          3. Offline repair (``xfs_repair``)
+          4. Online checking (``xfs_scrub -n``)
+          5. Online repair (``xfs_scrub``)
+          6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
+
+This is quite the combinatoric explosion!
+
+Fortunately, having this much test coverage makes it easy for XFS developers to
+check the responses of XFS' fsck tools.
+Since the introduction of the fuzz testing framework, these tests have been
+used to discover incorrect repair code and missing functionality for entire
+classes of metadata objects in ``xfs_repair``.
+The enhanced testing was used to finalize the deprecation of ``xfs_check`` by
+confirming that ``xfs_repair`` could detect at least as many corruptions as
+the older tool.
+
+These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
+allow the online fsck developers to compare online fsck against offline fsck,
+and they enable XFS developers to find deficiencies in the code base.
+
+Proposed patchsets include
+`general fuzzer improvements
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
+`fuzzing baselines
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
+and `improvements in fuzz testing comprehensiveness
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
+
+Stress Testing
+--------------
+
+A unique requirement to online fsck is the ability to operate on a filesystem
+concurrently with regular workloads.
+Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
+impact on the running system, the online repair code should never introduce
+inconsistencies into the filesystem metadata, and regular workloads should
+never notice resource starvation.
+To verify that these conditions are being met, fstests has been enhanced in
+the following ways:
+
+* For each scrub item type, create a test to exercise checking that item type
+  while running ``fsstress``.
+* For each scrub item type, create a test to exercise repairing that item type
+  while running ``fsstress``.
+* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
+  filesystem doesn't cause problems.
+* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
+  force-repairing the whole filesystem doesn't cause problems.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+  freezing and thawing the filesystem.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+  remounting the filesystem read-only and read-write.
+* The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)
+
+Success is defined by the ability to run all of these tests without observing
+any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
+check warnings, or any other sort of mischief.
+
+Proposed patchsets include `general stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
+and the `evolution of existing per-function stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 04/14] xfs: document the user interface for online fsck
  2023-03-07  1:30   ` Darrick J. Wong
                       ` (2 preceding siblings ...)
  2023-03-07  1:31     ` [PATCH 03/14] xfs: document the testing plan for online fsck Darrick J. Wong
@ 2023-03-07  1:31     ` Darrick J. Wong
  2023-03-07  1:31     ` [PATCH 05/14] xfs: document the filesystem metadata checking strategy Darrick J. Wong
                       ` (9 subsequent siblings)
  13 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:31 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Start the fourth chapter of the online fsck design documentation, which
discusses the user interface and the background scrubbing service.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  113 ++++++++++++++++++++
 1 file changed, 113 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index ed9b83c4dbf7..1411c09b9677 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -800,3 +800,116 @@ Proposed patchsets include `general stress testing
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
 and the `evolution of existing per-function stress testing
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
+
+4. User Interface
+=================
+
+The primary user of online fsck is the system administrator, just like offline
+repair.
+Online fsck presents two modes of operation to administrators:
+A foreground CLI process for online fsck on demand, and a background service
+that performs autonomous checking and repair.
+
+Checking on Demand
+------------------
+
+For administrators who want the absolute freshest information about the
+metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on
+a command line.
+The program checks every piece of metadata in the filesystem while the
+administrator waits for the results to be reported, just like the existing
+``xfs_repair`` tool.
+Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
+option to increase the verbosity of the information reported.
+
+A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
+correction capabilities of the hardware to check data file contents.
+The media scan is not enabled by default because it may dramatically increase
+program runtime and consume a lot of bandwidth on older storage hardware.
+
+The output of a foreground invocation is captured in the system log.
+
+The ``xfs_scrub_all`` program walks the list of mounted filesystems and
+initiates ``xfs_scrub`` for each of them in parallel.
+It serializes scans for any filesystems that resolve to the same top level
+kernel block device to prevent resource overconsumption.
+
+Background Service
+------------------
+
+To reduce the workload of system administrators, the ``xfs_scrub`` package
+provides a suite of `systemd <https://systemd.io/>`_ timers and services that
+run online fsck automatically on weekends by default.
+The background service configures scrub to run with as little privilege as
+possible, the lowest CPU and IO priority, and in a CPU-constrained single
+threaded mode.
+This can be tuned by the systemd administrator at any time to suit the latency
+and throughput requirements of customer workloads.
+
+The output of the background service is also captured in the system log.
+If desired, reports of failures (either due to inconsistencies or mere runtime
+errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment
+variable in the following service files:
+
+* ``xfs_scrub_fail@.service``
+* ``xfs_scrub_media_fail@.service``
+* ``xfs_scrub_all_fail.service``
+
+The decision to enable the background scan is left to the system administrator.
+This can be done by enabling either of the following services:
+
+* ``xfs_scrub_all.timer`` on systemd systems
+* ``xfs_scrub_all.cron`` on non-systemd systems
+
+This automatic weekly scan is configured out of the box to perform an
+additional media scan of all file data once per month.
+This is less foolproof than, say, storing file data block checksums, but much
+more performant if application software provides its own integrity checking,
+redundancy can be provided elsewhere above the filesystem, or the storage
+device's integrity guarantees are deemed sufficient.
+
+The systemd unit file definitions have been subjected to a security audit
+(as of systemd 249) to ensure that the xfs_scrub processes have as little
+access to the rest of the system as possible.
+This was performed via ``systemd-analyze security``, after which privileges
+were restricted to the minimum required, sandboxing was set up to the maximal
+extent possible with sandboxing and system call filtering; and access to the
+filesystem tree was restricted to the minimum needed to start the program and
+access the filesystem being scanned.
+The service definition files restrict CPU usage to 80% of one CPU core, and
+apply as nice of a priority to IO and CPU scheduling as possible.
+This measure was taken to minimize delays in the rest of the filesystem.
+No such hardening has been performed for the cron job.
+
+Proposed patchset:
+`Enabling the xfs_scrub background service
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.
+
+Health Reporting
+----------------
+
+XFS caches a summary of each filesystem's health status in memory.
+The information is updated whenever ``xfs_scrub`` is run, or whenever
+inconsistencies are detected in the filesystem metadata during regular
+operations.
+System administrators should use the ``health`` command of ``xfs_spaceman`` to
+download this information into a human-readable format.
+If problems have been observed, the administrator can schedule a reduced
+service window to run the online repair tool to correct the problem.
+Failing that, the administrator can decide to schedule a maintenance window to
+run the traditional offline repair tool to correct the problem.
+
+**Future Work Question**: Should the health reporting integrate with the new
+inotify fs error notification system?
+Would it be helpful for sysadmins to have a daemon to listen for corruption
+notifications and initiate a repair?
+
+*Answer*: These questions remain unanswered, but should be a part of the
+conversation with early adopters and potential downstream users of XFS.
+
+Proposed patchsets include
+`wiring up health reports to correction returns
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
+and
+`preservation of sickness info during memory reclaim
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 05/14] xfs: document the filesystem metadata checking strategy
  2023-03-07  1:30   ` Darrick J. Wong
                       ` (3 preceding siblings ...)
  2023-03-07  1:31     ` [PATCH 04/14] xfs: document the user interface " Darrick J. Wong
@ 2023-03-07  1:31     ` Darrick J. Wong
  2023-03-07  1:31     ` [PATCH 06/14] xfs: document how online fsck deals with eventual consistency Darrick J. Wong
                       ` (8 subsequent siblings)
  13 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:31 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Begin the fifth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
kernel to examine filesystem metadata and cross-reference it around the
filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  587 ++++++++++++++++++++
 .../filesystems/xfs-self-describing-metadata.rst   |    1 
 2 files changed, 588 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 1411c09b9677..4a19c70434aa 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -913,3 +913,590 @@ Proposed patchsets include
 and
 `preservation of sickness info during memory reclaim
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
+
+5. Kernel Algorithms and Data Structures
+========================================
+
+This section discusses the key algorithms and data structures of the kernel
+code that provide the ability to check and repair metadata while the system
+is running.
+The first chapters in this section reveal the pieces that provide the
+foundation for checking metadata.
+The remainder of this section presents the mechanisms through which XFS
+regenerates itself.
+
+Self Describing Metadata
+------------------------
+
+Starting with XFS version 5 in 2012, XFS updated the format of nearly every
+ondisk block header to record a magic number, a checksum, a universally
+"unique" identifier (UUID), an owner code, the ondisk address of the block,
+and a log sequence number.
+When loading a block buffer from disk, the magic number, UUID, owner, and
+ondisk address confirm that the retrieved block matches the specific owner of
+the current filesystem, and that the information contained in the block is
+supposed to be found at the ondisk address.
+The first three components enable checking tools to disregard alleged metadata
+that doesn't belong to the filesystem, and the fourth component enables the
+filesystem to detect lost writes.
+
+Whenever a file system operation modifies a block, the change is submitted
+to the log as part of a transaction.
+The log then processes these transactions marking them done once they are
+safely persisted to storage.
+The logging code maintains the checksum and the log sequence number of the last
+transactional update.
+Checksums are useful for detecting torn writes and other discrepancies that can
+be introduced between the computer and its storage devices.
+Sequence number tracking enables log recovery to avoid applying out of date
+log updates to the filesystem.
+
+These two features improve overall runtime resiliency by providing a means for
+the filesystem to detect obvious corruption when reading metadata blocks from
+disk, but these buffer verifiers cannot provide any consistency checking
+between metadata structures.
+
+For more information, please see the documentation for
+Documentation/filesystems/xfs-self-describing-metadata.rst
+
+Reverse Mapping
+---------------
+
+The original design of XFS (circa 1993) is an improvement upon 1980s Unix
+filesystem design.
+In those days, storage density was expensive, CPU time was scarce, and
+excessive seek time could kill performance.
+For performance reasons, filesystem authors were reluctant to add redundancy to
+the filesystem, even at the cost of data integrity.
+Filesystems designers in the early 21st century choose different strategies to
+increase internal redundancy -- either storing nearly identical copies of
+metadata, or more space-efficient encoding techniques.
+
+For XFS, a different redundancy strategy was chosen to modernize the design:
+a secondary space usage index that maps allocated disk extents back to their
+owners.
+By adding a new index, the filesystem retains most of its ability to scale
+well to heavily threaded workloads involving large datasets, since the primary
+file metadata (the directory tree, the file block map, and the allocation
+groups) remain unchanged.
+Like any system that improves redundancy, the reverse-mapping feature increases
+overhead costs for space mapping activities.
+However, it has two critical advantages: first, the reverse index is key to
+enabling online fsck and other requested functionality such as free space
+defragmentation, better media failure reporting, and filesystem shrinking.
+Second, the different ondisk storage format of the reverse mapping btree
+defeats device-level deduplication because the filesystem requires real
+redundancy.
+
++--------------------------------------------------------------------------+
+| **Sidebar**:                                                             |
++--------------------------------------------------------------------------+
+| A criticism of adding the secondary index is that it does nothing to     |
+| improve the robustness of user data storage itself.                      |
+| This is a valid point, but adding a new index for file data block        |
+| checksums increases write amplification by turning data overwrites into  |
+| copy-writes, which age the filesystem prematurely.                       |
+| In keeping with thirty years of precedent, users who want file data      |
+| integrity can supply as powerful a solution as they require.             |
+| As for metadata, the complexity of adding a new secondary index of space |
+| usage is much less than adding volume management and storage device      |
+| mirroring to XFS itself.                                                 |
+| Perfection of RAID and volume management are best left to existing       |
+| layers in the kernel.                                                    |
++--------------------------------------------------------------------------+
+
+The information captured in a reverse space mapping record is as follows:
+
+.. code-block:: c
+
+	struct xfs_rmap_irec {
+	    xfs_agblock_t    rm_startblock;   /* extent start block */
+	    xfs_extlen_t     rm_blockcount;   /* extent length */
+	    uint64_t         rm_owner;        /* extent owner */
+	    uint64_t         rm_offset;       /* offset within the owner */
+	    unsigned int     rm_flags;        /* state flags */
+	};
+
+The first two fields capture the location and size of the physical space,
+in units of filesystem blocks.
+The owner field tells scrub which metadata structure or file inode have been
+assigned this space.
+For space allocated to files, the offset field tells scrub where the space was
+mapped within the file fork.
+Finally, the flags field provides extra information about the space usage --
+is this an attribute fork extent?  A file mapping btree extent?  Or an
+unwritten data extent?
+
+Online filesystem checking judges the consistency of each primary metadata
+record by comparing its information against all other space indices.
+The reverse mapping index plays a key role in the consistency checking process
+because it contains a centralized alternate copy of all space allocation
+information.
+Program runtime and ease of resource acquisition are the only real limits to
+what online checking can consult.
+For example, a file data extent mapping can be checked against:
+
+* The absence of an entry in the free space information.
+* The absence of an entry in the inode index.
+* The absence of an entry in the reference count data if the file is not
+  marked as having shared extents.
+* The correspondence of an entry in the reverse mapping information.
+
+There are several observations to make about reverse mapping indices:
+
+1. Reverse mappings can provide a positive affirmation of correctness if any of
+   the above primary metadata are in doubt.
+   The checking code for most primary metadata follows a path similar to the
+   one outlined above.
+
+2. Proving the consistency of secondary metadata with the primary metadata is
+   difficult because that requires a full scan of all primary space metadata,
+   which is very time intensive.
+   For example, checking a reverse mapping record for a file extent mapping
+   btree block requires locking the file and searching the entire btree to
+   confirm the block.
+   Instead, scrub relies on rigorous cross-referencing during the primary space
+   mapping structure checks.
+
+3. Consistency scans must use non-blocking lock acquisition primitives if the
+   required locking order is not the same order used by regular filesystem
+   operations.
+   For example, if the filesystem normally takes a file ILOCK before taking
+   the AGF buffer lock but scrub wants to take a file ILOCK while holding
+   an AGF buffer lock, scrub cannot block on that second acquisition.
+   This means that forward progress during this part of a scan of the reverse
+   mapping data cannot be guaranteed if system load is heavy.
+
+In summary, reverse mappings play a key role in reconstruction of primary
+metadata.
+The details of how these records are staged, written to disk, and committed
+into the filesystem are covered in subsequent sections.
+
+Checking and Cross-Referencing
+------------------------------
+
+The first step of checking a metadata structure is to examine every record
+contained within the structure and its relationship with the rest of the
+system.
+XFS contains multiple layers of checking to try to prevent inconsistent
+metadata from wreaking havoc on the system.
+Each of these layers contributes information that helps the kernel to make
+three decisions about the health of a metadata structure:
+
+- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
+- Is this structure inconsistent with the rest of the system
+  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
+- Is there so much damage around the filesystem that cross-referencing is not
+  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
+- Can the structure be optimized to improve performance or reduce the size of
+  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
+- Does the structure contain data that is not inconsistent but deserves review
+  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
+
+The following sections describe how the metadata scrubbing process works.
+
+Metadata Buffer Verification
+````````````````````````````
+
+The lowest layer of metadata protection in XFS are the metadata verifiers built
+into the buffer cache.
+These functions perform inexpensive internal consistency checking of the block
+itself, and answer these questions:
+
+- Does the block belong to this filesystem?
+
+- Does the block belong to the structure that asked for the read?
+  This assumes that metadata blocks only have one owner, which is always true
+  in XFS.
+
+- Is the type of data stored in the block within a reasonable range of what
+  scrub is expecting?
+
+- Does the physical location of the block match the location it was read from?
+
+- Does the block checksum match the data?
+
+The scope of the protections here are very limited -- verifiers can only
+establish that the filesystem code is reasonably free of gross corruption bugs
+and that the storage system is reasonably competent at retrieval.
+Corruption problems observed at runtime cause the generation of health reports,
+failed system calls, and in the extreme case, filesystem shutdowns if the
+corrupt metadata force the cancellation of a dirty transaction.
+
+Every online fsck scrubbing function is expected to read every ondisk metadata
+block of a structure in the course of checking the structure.
+Corruption problems observed during a check are immediately reported to
+userspace as corruption; during a cross-reference, they are reported as a
+failure to cross-reference once the full examination is complete.
+Reads satisfied by a buffer already in cache (and hence already verified)
+bypass these checks.
+
+Internal Consistency Checks
+```````````````````````````
+
+After the buffer cache, the next level of metadata protection is the internal
+record verification code built into the filesystem.
+These checks are split between the buffer verifiers, the in-filesystem users of
+the buffer cache, and the scrub code itself, depending on the amount of higher
+level context required.
+The scope of checking is still internal to the block.
+These higher level checking functions answer these questions:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- If the block contains records, do the records fit within the block?
+
+- If the block tracks internal free space information, is it consistent with
+  the record areas?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+Record checks in this category are more rigorous and more time-intensive.
+For example, block pointers and inumbers are checked to ensure that they point
+within the dynamically allocated parts of an allocation group and within
+the filesystem.
+Names are checked for invalid characters, and flags are checked for invalid
+combinations.
+Other record attributes are checked for sensible values.
+Btree records spanning an interval of the btree keyspace are checked for
+correct order and lack of mergeability (except for file fork mappings).
+For performance reasons, regular code may skip some of these checks unless
+debugging is enabled or a write is about to occur.
+Scrub functions, of course, must check all possible problems.
+
+Validation of Userspace-Controlled Record Attributes
+````````````````````````````````````````````````````
+
+Various pieces of filesystem metadata are directly controlled by userspace.
+Because of this nature, validation work cannot be more precise than checking
+that a value is within the possible range.
+These fields include:
+
+- Superblock fields controlled by mount options
+- Filesystem labels
+- File timestamps
+- File permissions
+- File size
+- File flags
+- Names present in directory entries, extended attribute keys, and filesystem
+  labels
+- Extended attribute key namespaces
+- Extended attribute values
+- File data block contents
+- Quota limits
+- Quota timer expiration (if resource usage exceeds the soft limit)
+
+Cross-Referencing Space Metadata
+````````````````````````````````
+
+After internal block checks, the next higher level of checking is
+cross-referencing records between metadata structures.
+For regular runtime code, the cost of these checks is considered to be
+prohibitively expensive, but as scrub is dedicated to rooting out
+inconsistencies, it must pursue all avenues of inquiry.
+The exact set of cross-referencing is highly dependent on the context of the
+data structure being checked.
+
+The XFS btree code has keyspace scanning functions that online fsck uses to
+cross reference one structure with another.
+Specifically, scrub can scan the key space of an index to determine if that
+keyspace is fully, sparsely, or not at all mapped to records.
+For the reverse mapping btree, it is possible to mask parts of the key for the
+purposes of performing a keyspace scan so that scrub can decide if the rmap
+btree contains records mapping a certain extent of physical space without the
+sparsenses of the rest of the rmap keyspace getting in the way.
+
+Btree blocks undergo the following checks before cross-referencing:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- Do the records fit within the block?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+- Are the name hashes in the correct order?
+
+- Do node pointers within the btree point to valid block addresses for the type
+  of btree?
+
+- Do child pointers point towards the leaves?
+
+- Do sibling pointers point across the same level?
+
+- For each node block record, does the record key accurate reflect the contents
+  of the child block?
+
+Space allocation records are cross-referenced as follows:
+
+1. Any space mentioned by any metadata structure are cross-referenced as
+   follows:
+
+   - Does the reverse mapping index list only the appropriate owner as the
+     owner of each block?
+
+   - Are none of the blocks claimed as free space?
+
+   - If these aren't file data blocks, are none of the blocks claimed as space
+     shared by different owners?
+
+2. Btree blocks are cross-referenced as follows:
+
+   - Everything in class 1 above.
+
+   - If there's a parent node block, do the keys listed for this block match the
+     keyspace of this block?
+
+   - Do the sibling pointers point to valid blocks?  Of the same level?
+
+   - Do the child pointers point to valid blocks?  Of the next level down?
+
+3. Free space btree records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Does the reverse mapping index list no owners of this space?
+
+   - Is this space not claimed by the inode index for inodes?
+
+   - Is it not mentioned by the reference count index?
+
+   - Is there a matching record in the other free space btree?
+
+4. Inode btree records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Is there a matching record in free inode btree?
+
+   - Do cleared bits in the holemask correspond with inode clusters?
+
+   - Do set bits in the freemask correspond with inode records with zero link
+     count?
+
+5. Inode records are cross-referenced as follows:
+
+   - Everything in class 1.
+
+   - Do all the fields that summarize information about the file forks actually
+     match those forks?
+
+   - Does each inode with zero link count correspond to a record in the free
+     inode btree?
+
+6. File fork space mapping records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Is this space not mentioned by the inode btrees?
+
+   - If this is a CoW fork mapping, does it correspond to a CoW entry in the
+     reference count btree?
+
+7. Reference count records are cross-referenced as follows:
+
+   - Everything in class 1 and 2 above.
+
+   - Within the space subkeyspace of the rmap btree (that is to say, all
+     records mapped to a particular space extent and ignoring the owner info),
+     are there the same number of reverse mapping records for each block as the
+     reference count record claims?
+
+Proposed patchsets are the series to find gaps in
+`refcount btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
+`inode btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
+`rmap btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
+to find
+`mergeable records
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
+and to
+`improve cross referencing with rmap
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
+before starting a repair.
+
+Checking Extended Attributes
+````````````````````````````
+
+Extended attributes implement a key-value store that enable fragments of data
+to be attached to any file.
+Both the kernel and userspace can access the keys and values, subject to
+namespace and privilege restrictions.
+Most typically these fragments are metadata about the file -- origins, security
+contexts, user-supplied labels, indexing information, etc.
+
+Names can be as long as 255 bytes and can exist in several different
+namespaces.
+Values can be as large as 64KB.
+A file's extended attributes are stored in blocks mapped by the attr fork.
+The mappings point to leaf blocks, remote value blocks, or dabtree blocks.
+Block 0 in the attribute fork is always the top of the structure, but otherwise
+each of the three types of blocks can be found at any offset in the attr fork.
+Leaf blocks contain attribute key records that point to the name and the value.
+Names are always stored elsewhere in the same leaf block.
+Values that are less than 3/4 the size of a filesystem block are also stored
+elsewhere in the same leaf block.
+Remote value blocks contain values that are too large to fit inside a leaf.
+If the leaf information exceeds a single filesystem block, a dabtree (also
+rooted at block 0) is created to map hashes of the attribute names to leaf
+blocks in the attr fork.
+
+Checking an extended attribute structure is not so straightfoward due to the
+lack of separation between attr blocks and index blocks.
+Scrub must read each block mapped by the attr fork and ignore the non-leaf
+blocks:
+
+1. Walk the dabtree in the attr fork (if present) to ensure that there are no
+   irregularities in the blocks or dabtree mappings that do not point to
+   attr leaf blocks.
+
+2. Walk the blocks of the attr fork looking for leaf blocks.
+   For each entry inside a leaf:
+
+   a. Validate that the name does not contain invalid characters.
+
+   b. Read the attr value.
+      This performs a named lookup of the attr name to ensure the correctness
+      of the dabtree.
+      If the value is stored in a remote block, this also validates the
+      integrity of the remote value block.
+
+Checking and Cross-Referencing Directories
+``````````````````````````````````````````
+
+The filesystem directory tree is a directed acylic graph structure, with files
+constituting the nodes, and directory entries (dirents) constituting the edges.
+Directories are a special type of file containing a set of mappings from a
+255-byte sequence (name) to an inumber.
+These are called directory entries, or dirents for short.
+Each directory file must have exactly one directory pointing to the file.
+A root directory points to itself.
+Directory entries point to files of any type.
+Each non-directory file may have multiple directories point to it.
+
+In XFS, directories are implemented as a file containing up to three 32GB
+partitions.
+The first partition contains directory entry data blocks.
+Each data block contains variable-sized records associating a user-provided
+name with an inumber and, optionally, a file type.
+If the directory entry data grows beyond one block, the second partition (which
+exists as post-EOF extents) is populated with a block containing free space
+information and an index that maps hashes of the dirent names to directory data
+blocks in the first partition.
+This makes directory name lookups very fast.
+If this second partition grows beyond one block, the third partition is
+populated with a linear array of free space information for faster
+expansions.
+If the free space has been separated and the second partition grows again
+beyond one block, then a dabtree is used to map hashes of dirent names to
+directory data blocks.
+
+Checking a directory is pretty straightfoward:
+
+1. Walk the dabtree in the second partition (if present) to ensure that there
+   are no irregularities in the blocks or dabtree mappings that do not point to
+   dirent blocks.
+
+2. Walk the blocks of the first partition looking for directory entries.
+   Each dirent is checked as follows:
+
+   a. Does the name contain no invalid characters?
+
+   b. Does the inumber correspond to an actual, allocated inode?
+
+   c. Does the child inode have a nonzero link count?
+
+   d. If a file type is included in the dirent, does it match the type of the
+      inode?
+
+   e. If the child is a subdirectory, does the child's dotdot pointer point
+      back to the parent?
+
+   f. If the directory has a second partition, perform a named lookup of the
+      dirent name to ensure the correctness of the dabtree.
+
+3. Walk the free space list in the third partition (if present) to ensure that
+   the free spaces it describes are really unused.
+
+Checking operations involving :ref:`parents <dirparent>` and
+:ref:`file link counts <nlinks>` are discussed in more detail in later
+sections.
+
+Checking Directory/Attribute Btrees
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As stated in previous sections, the directory/attribute btree (dabtree) index
+maps user-provided names to improve lookup times by avoiding linear scans.
+Internally, it maps a 32-bit hash of the name to a block offset within the
+appropriate file fork.
+
+The internal structure of a dabtree closely resembles the btrees that record
+fixed-size metadata records -- each dabtree block contains a magic number, a
+checksum, sibling pointers, a UUID, a tree level, and a log sequence number.
+The format of leaf and node records are the same -- each entry points to the
+next level down in the hierarchy, with dabtree node records pointing to dabtree
+leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
+in the fork.
+
+Checking and cross-referencing the dabtree is very similar to what is done for
+space btrees:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- Do the records fit within the block?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+- Are the name hashes in the correct order?
+
+- Do node pointers within the dabtree point to valid fork offsets for dabtree
+  blocks?
+
+- Do leaf pointers within the dabtree point to valid fork offsets for directory
+  or attr leaf blocks?
+
+- Do child pointers point towards the leaves?
+
+- Do sibling pointers point across the same level?
+
+- For each dabtree node record, does the record key accurate reflect the
+  contents of the child dabtree block?
+
+- For each dabtree leaf record, does the record key accurate reflect the
+  contents of the directory or attr block?
+
+Cross-Referencing Summary Counters
+``````````````````````````````````
+
+XFS maintains three classes of summary counters: available resources, quota
+resource usage, and file link counts.
+
+In theory, the amount of available resources (data blocks, inodes, realtime
+extents) can be found by walking the entire filesystem.
+This would make for very slow reporting, so a transactional filesystem can
+maintain summaries of this information in the superblock.
+Cross-referencing these values against the filesystem metadata should be a
+simple matter of walking the free space and inode metadata in each AG and the
+realtime bitmap, but there are complications that will be discussed in
+:ref:`more detail <fscounters>` later.
+
+:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
+checking are sufficiently complicated to warrant separate sections.
+
+Post-Repair Reverification
+``````````````````````````
+
+After performing a repair, the checking code is run a second time to validate
+the new structure, and the results of the health assessment are recorded
+internally and returned to the calling process.
+This step is critical for enabling system administrator to monitor the status
+of the filesystem and the progress of any repairs.
+For developers, it is a useful means to judge the efficacy of error detection
+and correction in the online and offline checking tools.
diff --git a/Documentation/filesystems/xfs-self-describing-metadata.rst b/Documentation/filesystems/xfs-self-describing-metadata.rst
index b79dbf36dc94..a10c4ae6955e 100644
--- a/Documentation/filesystems/xfs-self-describing-metadata.rst
+++ b/Documentation/filesystems/xfs-self-describing-metadata.rst
@@ -1,4 +1,5 @@
 .. SPDX-License-Identifier: GPL-2.0
+.. _xfs_self_describing_metadata:
 
 ============================
 XFS Self Describing Metadata


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 06/14] xfs: document how online fsck deals with eventual consistency
  2023-03-07  1:30   ` Darrick J. Wong
                       ` (4 preceding siblings ...)
  2023-03-07  1:31     ` [PATCH 05/14] xfs: document the filesystem metadata checking strategy Darrick J. Wong
@ 2023-03-07  1:31     ` Darrick J. Wong
  2023-03-07  1:31     ` [PATCH 07/14] xfs: document pageable kernel memory Darrick J. Wong
                       ` (7 subsequent siblings)
  13 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:31 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Writes to an XFS filesystem employ an eventual consistency update model
to break up complex multistep metadata updates into small chained
transactions.  This is generally good for performance and scalability
because XFS doesn't need to prepare for enormous transactions, but it
also means that online fsck must be careful not to attempt a fsck action
unless it can be shown that there are no other threads processing a
transaction chain.  This part of the design documentation covers the
thinking behind the consistency model and how scrub deals with it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  332 ++++++++++++++++++++
 1 file changed, 332 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 4a19c70434aa..e095264b591e 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -1500,3 +1500,335 @@ This step is critical for enabling system administrator to monitor the status
 of the filesystem and the progress of any repairs.
 For developers, it is a useful means to judge the efficacy of error detection
 and correction in the online and offline checking tools.
+
+Eventual Consistency vs. Online Fsck
+------------------------------------
+
+Complex operations can make modifications to multiple per-AG data structures
+with a chain of transactions.
+These chains, once committed to the log, are restarted during log recovery if
+the system crashes while processing the chain.
+Because the AG header buffers are unlocked between transactions within a chain,
+online checking must coordinate with chained operations that are in progress to
+avoid incorrectly detecting inconsistencies due to pending chains.
+Furthermore, online repair must not run when operations are pending because
+the metadata are temporarily inconsistent with each other, and rebuilding is
+not possible.
+
+Only online fsck has this requirement of total consistency of AG metadata, and
+should be relatively rare as compared to filesystem change operations.
+Online fsck coordinates with transaction chains as follows:
+
+* For each AG, maintain a count of intent items targetting that AG.
+  The count should be bumped whenever a new item is added to the chain.
+  The count should be dropped when the filesystem has locked the AG header
+  buffers and finished the work.
+
+* When online fsck wants to examine an AG, it should lock the AG header
+  buffers to quiesce all transaction chains that want to modify that AG.
+  If the count is zero, proceed with the checking operation.
+  If it is nonzero, cycle the buffer locks to allow the chain to make forward
+  progress.
+
+This may lead to online fsck taking a long time to complete, but regular
+filesystem updates take precedence over background checking activity.
+Details about the discovery of this situation are presented in the
+:ref:`next section <chain_coordination>`, and details about the solution
+are presented :ref:`after that<intent_drains>`.
+
+.. _chain_coordination:
+
+Discovery of the Problem
+````````````````````````
+
+Midway through the development of online scrubbing, the fsstress tests
+uncovered a misinteraction between online fsck and compound transaction chains
+created by other writer threads that resulted in false reports of metadata
+inconsistency.
+The root cause of these reports is the eventual consistency model introduced by
+the expansion of deferred work items and compound transaction chains when
+reverse mapping and reflink were introduced.
+
+Originally, transaction chains were added to XFS to avoid deadlocks when
+unmapping space from files.
+Deadlock avoidance rules require that AGs only be locked in increasing order,
+which makes it impossible (say) to use a single transaction to free a space
+extent in AG 7 and then try to free a now superfluous block mapping btree block
+in AG 3.
+To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
+items to commit to freeing some space in one transaction while deferring the
+actual metadata updates to a fresh transaction.
+The transaction sequence looks like this:
+
+1. The first transaction contains a physical update to the file's block mapping
+   structures to remove the mapping from the btree blocks.
+   It then attaches to the in-memory transaction an action item to schedule
+   deferred freeing of space.
+   Concretely, each transaction maintains a list of ``struct
+   xfs_defer_pending`` objects, each of which maintains a list of ``struct
+   xfs_extent_free_item`` objects.
+   Returning to the example above, the action item tracks the freeing of both
+   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
+   AG 3.
+   Deferred frees recorded in this manner are committed in the log by creating
+   an EFI log item from the ``struct xfs_extent_free_item`` object and
+   attaching the log item to the transaction.
+   When the log is persisted to disk, the EFI item is written into the ondisk
+   transaction record.
+   EFIs can list up to 16 extents to free, all sorted in AG order.
+
+2. The second transaction contains a physical update to the free space btrees
+   of AG 3 to release the former BMBT block and a second physical update to the
+   free space btrees of AG 7 to release the unmapped file space.
+   Observe that the the physical updates are resequenced in the correct order
+   when possible.
+   Attached to the transaction is a an extent free done (EFD) log item.
+   The EFD contains a pointer to the EFI logged in transaction #1 so that log
+   recovery can tell if the EFI needs to be replayed.
+
+If the system goes down after transaction #1 is written back to the filesystem
+but before #2 is committed, a scan of the filesystem metadata would show
+inconsistent filesystem metadata because there would not appear to be any owner
+of the unmapped space.
+Happily, log recovery corrects this inconsistency for us -- when recovery finds
+an intent log item but does not find a corresponding intent done item, it will
+reconstruct the incore state of the intent item and finish it.
+In the example above, the log must replay both frees described in the recovered
+EFI to complete the recovery phase.
+
+There are subtleties to XFS' transaction chaining strategy to consider:
+
+* Log items must be added to a transaction in the correct order to prevent
+  conflicts with principal objects that are not held by the transaction.
+  In other words, all per-AG metadata updates for an unmapped block must be
+  completed before the last update to free the extent, and extents should not
+  be reallocated until that last update commits to the log.
+
+* AG header buffers are released between each transaction in a chain.
+  This means that other threads can observe an AG in an intermediate state,
+  but as long as the first subtlety is handled, this should not affect the
+  correctness of filesystem operations.
+
+* Unmounting the filesystem flushes all pending work to disk, which means that
+  offline fsck never sees the temporary inconsistencies caused by deferred
+  work item processing.
+
+In this manner, XFS employs a form of eventual consistency to avoid deadlocks
+and increase parallelism.
+
+During the design phase of the reverse mapping and reflink features, it was
+decided that it was impractical to cram all the reverse mapping updates for a
+single filesystem change into a single transaction because a single file
+mapping operation can explode into many small updates:
+
+* The block mapping update itself
+* A reverse mapping update for the block mapping update
+* Fixing the freelist
+* A reverse mapping update for the freelist fix
+
+* A shape change to the block mapping btree
+* A reverse mapping update for the btree update
+* Fixing the freelist (again)
+* A reverse mapping update for the freelist fix
+
+* An update to the reference counting information
+* A reverse mapping update for the refcount update
+* Fixing the freelist (a third time)
+* A reverse mapping update for the freelist fix
+
+* Freeing any space that was unmapped and not owned by any other file
+* Fixing the freelist (a fourth time)
+* A reverse mapping update for the freelist fix
+
+* Freeing the space used by the block mapping btree
+* Fixing the freelist (a fifth time)
+* A reverse mapping update for the freelist fix
+
+Free list fixups are not usually needed more than once per AG per transaction
+chain, but it is theoretically possible if space is very tight.
+For copy-on-write updates this is even worse, because this must be done once to
+remove the space from a staging area and again to map it into the file!
+
+To deal with this explosion in a calm manner, XFS expands its use of deferred
+work items to cover most reverse mapping updates and all refcount updates.
+This reduces the worst case size of transaction reservations by breaking the
+work into a long chain of small updates, which increases the degree of eventual
+consistency in the system.
+Again, this generally isn't a problem because XFS orders its deferred work
+items carefully to avoid resource reuse conflicts between unsuspecting threads.
+
+However, online fsck changes the rules -- remember that although physical
+updates to per-AG structures are coordinated by locking the buffers for AG
+headers, buffer locks are dropped between transactions.
+Once scrub acquires resources and takes locks for a data structure, it must do
+all the validation work without releasing the lock.
+If the main lock for a space btree is an AG header buffer lock, scrub may have
+interrupted another thread that is midway through finishing a chain.
+For example, if a thread performing a copy-on-write has completed a reverse
+mapping update but not the corresponding refcount update, the two AG btrees
+will appear inconsistent to scrub and an observation of corruption will be
+recorded.  This observation will not be correct.
+If a repair is attempted in this state, the results will be catastrophic!
+
+Several other solutions to this problem were evaluated upon discovery of this
+flaw and rejected:
+
+1. Add a higher level lock to allocation groups and require writer threads to
+   acquire the higher level lock in AG order before making any changes.
+   This would be very difficult to implement in practice because it is
+   difficult to determine which locks need to be obtained, and in what order,
+   without simulating the entire operation.
+   Performing a dry run of a file operation to discover necessary locks would
+   make the filesystem very slow.
+
+2. Make the deferred work coordinator code aware of consecutive intent items
+   targeting the same AG and have it hold the AG header buffers locked across
+   the transaction roll between updates.
+   This would introduce a lot of complexity into the coordinator since it is
+   only loosely coupled with the actual deferred work items.
+   It would also fail to solve the problem because deferred work items can
+   generate new deferred subtasks, but all subtasks must be complete before
+   work can start on a new sibling task.
+
+3. Teach online fsck to walk all transactions waiting for whichever lock(s)
+   protect the data structure being scrubbed to look for pending operations.
+   The checking and repair operations must factor these pending operations into
+   the evaluations being performed.
+   This solution is a nonstarter because it is *extremely* invasive to the main
+   filesystem.
+
+.. _intent_drains:
+
+Intent Drains
+`````````````
+
+Online fsck uses an atomic intent item counter and lock cycling to coordinate
+with transaction chains.
+There are two key properties to the drain mechanism.
+First, the counter is incremented when a deferred work item is *queued* to a
+transaction, and it is decremented after the associated intent done log item is
+*committed* to another transaction.
+The second property is that deferred work can be added to a transaction without
+holding an AG header lock, but per-AG work items cannot be marked done without
+locking that AG header buffer to log the physical updates and the intent done
+log item.
+The first property enables scrub to yield to running transaction chains, which
+is an explicit deprioritization of online fsck to benefit file operations.
+The second property of the drain is key to the correct coordination of scrub,
+since scrub will always be able to decide if a conflict is possible.
+
+For regular filesystem code, the drain works as follows:
+
+1. Call the appropriate subsystem function to add a deferred work item to a
+   transaction.
+
+2. The function calls ``xfs_drain_bump`` to increase the counter.
+
+3. When the deferred item manager wants to finish the deferred work item, it
+   calls ``->finish_item`` to complete it.
+
+4. The ``->finish_item`` implementation logs some changes and calls
+   ``xfs_drain_drop`` to decrease the sloppy counter and wake up any threads
+   waiting on the drain.
+
+5. The subtransaction commits, which unlocks the resource associated with the
+   intent item.
+
+For scrub, the drain works as follows:
+
+1. Lock the resource(s) associated with the metadata being scrubbed.
+   For example, a scan of the refcount btree would lock the AGI and AGF header
+   buffers.
+
+2. If the counter is zero (``xfs_drain_busy`` returns false), there are no
+   chains in progress and the operation may proceed.
+
+3. Otherwise, release the resources grabbed in step 1.
+
+4. Wait for the intent counter to reach zero (``xfs_drain_intents``), then go
+   back to step 1 unless a signal has been caught.
+
+To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
+be woken up whenever the intent count drops to zero.
+
+The proposed patchset is the
+`scrub intent drain series
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
+
+.. _jump_labels:
+
+Static Keys (aka Jump Label Patching)
+`````````````````````````````````````
+
+Online fsck for XFS separates the regular filesystem from the checking and
+repair code as much as possible.
+However, there are a few parts of online fsck (such as the intent drains, and
+later, live update hooks) where it is useful for the online fsck code to know
+what's going on in the rest of the filesystem.
+Since it is not expected that online fsck will be constantly running in the
+background, it is very important to minimize the runtime overhead imposed by
+these hooks when online fsck is compiled into the kernel but not actively
+running on behalf of userspace.
+Taking locks in the hot path of a writer thread to access a data structure only
+to find that no further action is necessary is expensive -- on the author's
+computer, this have an overhead of 40-50ns per access.
+Fortunately, the kernel supports dynamic code patching, which enables XFS to
+replace a static branch to hook code with ``nop`` sleds when online fsck isn't
+running.
+This sled has an overhead of however long it takes the instruction decoder to
+skip past the sled, which seems to be on the order of less than 1ns and
+does not access memory outside of instruction fetching.
+
+When online fsck enables the static key, the sled is replaced with an
+unconditional branch to call the hook code.
+The switchover is quite expensive (~22000ns) but is paid entirely by the
+program that invoked online fsck, and can be amortized if multiple threads
+enter online fsck at the same time, or if multiple filesystems are being
+checked at the same time.
+Changing the branch direction requires taking the CPU hotplug lock, and since
+CPU initialization requires memory allocation, online fsck must be careful not
+to change a static key while holding any locks or resources that could be
+accessed in the memory reclaim paths.
+To minimize contention on the CPU hotplug lock, care should be taken not to
+enable or disable static keys unnecessarily.
+
+Because static keys are intended to minimize hook overhead for regular
+filesystem operations when xfs_scrub is not running, the intended usage
+patterns are as follows:
+
+- The hooked part of XFS should declare a static-scoped static key that
+  defaults to false.
+  The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
+  The static key itself should be declared as a ``static`` variable.
+
+- When deciding to invoke code that's only used by scrub, the regular
+  filesystem should call the ``static_branch_unlikely`` predicate to avoid the
+  scrub-only hook code if the static key is not enabled.
+
+- The regular filesystem should export helper functions that call
+  ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the
+  static key.
+  Wrapper functions make it easy to compile out the relevant code if the kernel
+  distributor turns off online fsck at build time.
+
+- Scrub functions wanting to turn on scrub-only XFS functionality should call
+  the ``xchk_fsgates_enable`` from the setup function to enable a specific
+  hook.
+  This must be done before obtaining any resources that are used by memory
+  reclaim.
+  Callers had better be sure they really need the functionality gated by the
+  static key; the ``TRY_HARDER`` flag is useful here.
+
+Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to
+handle locking AGI and AGF buffers for all scrubber functions.
+If it detects a conflict between scrub and the running transactions, it will
+try to wait for intents to complete.
+If the caller of the helper has not enabled the static key, the helper will
+return -EDEADLOCK, which should result in the scrub being restarted with the
+``TRY_HARDER`` flag set.
+The scrub setup function should detect that flag, enable the static key, and
+try the scrub again.
+Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``.
+
+For more information, please see the kernel documentation of
+Documentation/staging/static-keys.rst.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 07/14] xfs: document pageable kernel memory
  2023-03-07  1:30   ` Darrick J. Wong
                       ` (5 preceding siblings ...)
  2023-03-07  1:31     ` [PATCH 06/14] xfs: document how online fsck deals with eventual consistency Darrick J. Wong
@ 2023-03-07  1:31     ` Darrick J. Wong
  2023-03-07  1:31     ` [PATCH 08/14] xfs: document btree bulk loading Darrick J. Wong
                       ` (6 subsequent siblings)
  13 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:31 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Add a discussion of pageable kernel memory, since online fsck needs
quite a bit more memory than most other parts of the filesystem to stage
records and other information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  500 ++++++++++++++++++++
 1 file changed, 500 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index e095264b591e..21f0638ab69d 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -413,6 +413,8 @@ Algorithms") of Srinivasan.
 However, any data structure builder that maintains a resource lock for the
 duration of the repair is *always* an offline algorithm.
 
+.. _secondary_metadata:
+
 Secondary Metadata
 ``````````````````
 
@@ -1832,3 +1834,501 @@ Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``.
 
 For more information, please see the kernel documentation of
 Documentation/staging/static-keys.rst.
+
+.. _xfile:
+
+Pageable Kernel Memory
+----------------------
+
+Some online checking functions work by scanning the filesystem to build a
+shadow copy of an ondisk metadata structure in memory and comparing the two
+copies.
+For online repair to rebuild a metadata structure, it must compute the record
+set that will be stored in the new structure before it can persist that new
+structure to disk.
+Ideally, repairs complete with a single atomic commit that introduces
+a new data structure.
+To meet these goals, the kernel needs to collect a large amount of information
+in a place that doesn't require the correct operation of the filesystem.
+
+Kernel memory isn't suitable because:
+
+* Allocating a contiguous region of memory to create a C array is very
+  difficult, especially on 32-bit systems.
+
+* Linked lists of records introduce double pointer overhead which is very high
+  and eliminate the possibility of indexed lookups.
+
+* Kernel memory is pinned, which can drive the system into OOM conditions.
+
+* The system might not have sufficient memory to stage all the information.
+
+At any given time, online fsck does not need to keep the entire record set in
+memory, which means that individual records can be paged out if necessary.
+Continued development of online fsck demonstrated that the ability to perform
+indexed data storage would also be very useful.
+Fortunately, the Linux kernel already has a facility for byte-addressable and
+pageable storage: tmpfs.
+In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
+to store intermediate data that doesn't need to be in memory at all times, so
+that usage precedent is already established.
+Hence, the ``xfile`` was born!
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**:                                                  |
++--------------------------------------------------------------------------+
+| The first edition of online repair inserted records into a new btree as  |
+| it found them, which failed because filesystem could shut down with a    |
+| built data structure, which would be live after recovery finished.       |
+|                                                                          |
+| The second edition solved the half-rebuilt structure problem by storing  |
+| everything in memory, but frequently ran the system out of memory.       |
+|                                                                          |
+| The third edition solved the OOM problem by using linked lists, but the  |
+| memory overhead of the list pointers was extreme.                        |
++--------------------------------------------------------------------------+
+
+xfile Access Models
+```````````````````
+
+A survey of the intended uses of xfiles suggested these use cases:
+
+1. Arrays of fixed-sized records (space management btrees, directory and
+   extended attribute entries)
+
+2. Sparse arrays of fixed-sized records (quotas and link counts)
+
+3. Large binary objects (BLOBs) of variable sizes (directory and extended
+   attribute names and values)
+
+4. Staging btrees in memory (reverse mapping btrees)
+
+5. Arbitrary contents (realtime space management)
+
+To support the first four use cases, high level data structures wrap the xfile
+to share functionality between online fsck functions.
+The rest of this section discusses the interfaces that the xfile presents to
+four of those five higher level data structures.
+The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
+study.
+
+The most general storage interface supported by the xfile enables the reading
+and writing of arbitrary quantities of data at arbitrary offsets in the xfile.
+This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions,
+which behave similarly to their userspace counterparts.
+XFS is very record-based, which suggests that the ability to load and store
+complete records is important.
+To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store``
+functions are provided to read and persist objects into an xfile.
+They are internally the same as pread and pwrite, except that they treat any
+error as an out of memory error.
+For online repair, squashing error conditions in this manner is an acceptable
+behavior because the only reaction is to abort the operation back to userspace.
+All five xfile usecases can be serviced by these four functions.
+
+However, no discussion of file access idioms is complete without answering the
+question, "But what about mmap?"
+It is convenient to access storage directly with pointers, just like userspace
+code does with regular memory.
+Online fsck must not drive the system into OOM conditions, which means that
+xfiles must be responsive to memory reclamation.
+tmpfs can only push a pagecache folio to the swap cache if the folio is neither
+pinned nor locked, which means the xfile must not pin too many folios.
+
+Short term direct access to xfile contents is done by locking the pagecache
+folio and mapping it into kernel address space.
+Programmatic access (e.g. pread and pwrite) uses this mechanism.
+Folio locks are not supposed to be held for long periods of time, so long
+term direct access to xfile contents is done by bumping the folio refcount,
+mapping it into kernel address space, and dropping the folio lock.
+These long term users *must* be responsive to memory reclaim by hooking into
+the shrinker infrastructure to know when to release folios.
+
+The ``xfile_get_page`` and ``xfile_put_page`` functions are provided to
+retrieve the (locked) folio that backs part of an xfile and to release it.
+The only code to use these folio lease functions are the xfarray
+:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
+btrees<xfbtree>`.
+
+xfile Access Coordination
+`````````````````````````
+
+For security reasons, xfiles must be owned privately by the kernel.
+They are marked ``S_PRIVATE`` to prevent interference from the security system,
+must never be mapped into process file descriptor tables, and their pages must
+never be mapped into userspace processes.
+
+To avoid locking recursion issues with the VFS, all accesses to the shmfs file
+are performed by manipulating the page cache directly.
+xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
+xfile's address space to grab writable pages, copy the caller's buffer into the
+page, and release the pages.
+xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly
+before copying the contents into the caller's buffer.
+In other words, xfiles ignore the VFS read and write code paths to avoid
+having to create a dummy ``struct kiocb`` and to avoid taking inode and
+freeze locks.
+tmpfs cannot be frozen, and xfiles must not be exposed to userspace.
+
+If an xfile is shared between threads to stage repairs, the caller must provide
+its own locks to coordinate access.
+For example, if a scrub function stores scan results in an xfile and needs
+other threads to provide updates to the scanned data, the scrub function must
+provide a lock for all threads to share.
+
+.. _xfarray:
+
+Arrays of Fixed-Sized Records
+`````````````````````````````
+
+In XFS, each type of indexed space metadata (free space, inodes, reference
+counts, file fork space, and reverse mappings) consists of a set of fixed-size
+records indexed with a classic B+ tree.
+Directories have a set of fixed-size dirent records that point to the names,
+and extended attributes have a set of fixed-size attribute keys that point to
+names and values.
+Quota counters and file link counters index records with numbers.
+During a repair, scrub needs to stage new records during the gathering step and
+retrieve them during the btree building step.
+
+Although this requirement can be satisfied by calling the read and write
+methods of the xfile directly, it is simpler for callers for there to be a
+higher level abstraction to take care of computing array offsets, to provide
+iterator functions, and to deal with sparse records and sorting.
+The ``xfarray`` abstraction presents a linear array for fixed-size records atop
+the byte-accessible xfile.
+
+.. _xfarray_access_patterns:
+
+Array Access Patterns
+^^^^^^^^^^^^^^^^^^^^^
+
+Array access patterns in online fsck tend to fall into three categories.
+Iteration of records is assumed to be necessary for all cases and will be
+covered in the next section.
+
+The first type of caller handles records that are indexed by position.
+Gaps may exist between records, and a record may be updated multiple times
+during the collection step.
+In other words, these callers want a sparse linearly addressed table file.
+The typical use case are quota records or file link count records.
+Access to array elements is performed programmatically via ``xfarray_load`` and
+``xfarray_store`` functions, which wrap the similarly-named xfile functions to
+provide loading and storing of array elements at arbitrary array indices.
+Gaps are defined to be null records, and null records are defined to be a
+sequence of all zero bytes.
+Null records are detected by calling ``xfarray_element_is_null``.
+They are created either by calling ``xfarray_unset`` to null out an existing
+record or by never storing anything to an array index.
+
+The second type of caller handles records that are not indexed by position
+and do not require multiple updates to a record.
+The typical use case here is rebuilding space btrees and key/value btrees.
+These callers can add records to the array without caring about array indices
+via the ``xfarray_append`` function, which stores a record at the end of the
+array.
+For callers that require records to be presentable in a specific order (e.g.
+rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted
+records; this function will be covered later.
+
+The third type of caller is a bag, which is useful for counting records.
+The typical use case here is constructing space extent reference counts from
+reverse mapping information.
+Records can be put in the bag in any order, they can be removed from the bag
+at any time, and uniqueness of records is left to callers.
+The ``xfarray_store_anywhere`` function is used to insert a record in any
+null record slot in the bag; and the ``xfarray_unset`` function removes a
+record from the bag.
+
+The proposed patchset is the
+`big in-memory array
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
+
+Iterating Array Elements
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most users of the xfarray require the ability to iterate the records stored in
+the array.
+Callers can probe every possible array index with the following:
+
+.. code-block:: c
+
+	xfarray_idx_t i;
+	foreach_xfarray_idx(array, i) {
+	    xfarray_load(array, i, &rec);
+
+	    /* do something with rec */
+	}
+
+All users of this idiom must be prepared to handle null records or must already
+know that there aren't any.
+
+For xfarray users that want to iterate a sparse array, the ``xfarray_iter``
+function ignores indices in the xfarray that have never been written to by
+calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas
+of the array that are not populated with memory pages.
+Once it finds a page, it will skip the zeroed areas of the page.
+
+.. code-block:: c
+
+	xfarray_idx_t i = XFARRAY_CURSOR_INIT;
+	while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
+	    /* do something with rec */
+	}
+
+.. _xfarray_sort:
+
+Sorting Array Elements
+^^^^^^^^^^^^^^^^^^^^^^
+
+During the fourth demonstration of online repair, a community reviewer remarked
+that for performance reasons, online repair ought to load batches of records
+into btree record blocks instead of inserting records into a new btree one at a
+time.
+The btree insertion code in XFS is responsible for maintaining correct ordering
+of the records, so naturally the xfarray must also support sorting the record
+set prior to bulk loading.
+
+Case Study: Sorting xfarrays
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The sorting algorithm used in the xfarray is actually a combination of adaptive
+quicksort and a heapsort subalgorithm in the spirit of
+`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
+`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux
+kernel.
+To sort records in a reasonably short amount of time, ``xfarray`` takes
+advantage of the binary subpartitioning offered by quicksort, but it also uses
+heapsort to hedge aginst performance collapse if the chosen quicksort pivots
+are poor.
+Both algorithms are (in general) O(n * lg(n)), but there is a wide performance
+gulf between the two implementations.
+
+The Linux kernel already contains a reasonably fast implementation of heapsort.
+It only operates on regular C arrays, which limits the scope of its usefulness.
+There are two key places where the xfarray uses it:
+
+* Sorting any record subset backed by a single xfile page.
+
+* Loading a small number of xfarray records from potentially disparate parts
+  of the xfarray into a memory buffer, and sorting the buffer.
+
+In other words, ``xfarray`` uses heapsort to constrain the nested recursion of
+quicksort, thereby mitigating quicksort's worst runtime behavior.
+
+Choosing a quicksort pivot is a tricky business.
+A good pivot splits the set to sort in half, leading to the divide and conquer
+behavior that is crucial to  O(n * lg(n)) performance.
+A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`)
+runtime.
+The xfarray sort routine tries to avoid picking a bad pivot by sampling nine
+records into a memory buffer and using the kernel heapsort to identify the
+median of the nine.
+
+Most modern quicksort implementations employ Tukey's "ninther" to select a
+pivot from a classic C array.
+Typical ninther implementations pick three unique triads of records, sort each
+of the triads, and then sort the middle value of each triad to determine the
+ninther value.
+As stated previously, however, xfile accesses are not entirely cheap.
+It turned out to be much more performant to read the nine elements into a
+memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
+the 4th element of that buffer as the pivot.
+Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for
+low-effort robust (resistant) location in large samples`, in *Contributions to
+Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press,
+1978), pp. 251–257.
+
+The partitioning of quicksort is fairly textbook -- rearrange the record
+subset around the pivot, then set up the current and next stack frames to
+sort with the larger and the smaller halves of the pivot, respectively.
+This keeps the stack space requirements to log2(record count).
+
+As a final performance optimization, the hi and lo scanning phase of quicksort
+keeps examined xfile pages mapped in the kernel for as long as possible to
+reduce map/unmap cycles.
+Surprisingly, this reduces overall sort runtime by nearly half again after
+accounting for the application of heapsort directly onto xfile pages.
+
+Blob Storage
+````````````
+
+Extended attributes and directories add an additional requirement for staging
+records: arbitrary byte sequences of finite length.
+Each directory entry record needs to store entry name,
+and each extended attribute needs to store both the attribute name and value.
+The names, keys, and values can consume a large amount of memory, so the
+``xfblob`` abstraction was created to simplify management of these blobs
+atop an xfile.
+
+Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve
+and persist objects.
+The store function returns a magic cookie for every object that it persists.
+Later, callers provide this cookie to the ``xblob_load`` to recall the object.
+The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
+function frees them all because compaction is not needed.
+
+The details of repairing directories and extended attributes will be discussed
+in a subsequent section about atomic extent swapping.
+However, it should be noted that these repair functions only use blob storage
+to cache a small number of entries before adding them to a temporary ondisk
+file, which is why compaction is not required.
+
+The proposed patchset is at the start of the
+`extended attribute repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
+
+.. _xfbtree:
+
+In-Memory B+Trees
+`````````````````
+
+The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that
+checking and repairing of secondary metadata commonly requires coordination
+between a live metadata scan of the filesystem and writer threads that are
+updating that metadata.
+Keeping the scan data up to date requires requires the ability to propagate
+metadata updates from the filesystem into the data being collected by the scan.
+This *can* be done by appending concurrent updates into a separate log file and
+applying them before writing the new metadata to disk, but this leads to
+unbounded memory consumption if the rest of the system is very busy.
+Another option is to skip the side-log and commit live updates from the
+filesystem directly into the scan data, which trades more overhead for a lower
+maximum memory requirement.
+In both cases, the data structure holding the scan results must support indexed
+access to perform well.
+
+Given that indexed lookups of scan data is required for both strategies, online
+fsck employs the second strategy of committing live updates directly into
+scan data.
+Because xfarrays are not indexed and do not enforce record ordering, they
+are not suitable for this task.
+Conveniently, however, XFS has a library to create and maintain ordered reverse
+mapping records: the existing rmap btree code!
+If only there was a means to create one in memory.
+
+Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a
+regular file, which means that the kernel can create byte or block addressable
+virtual address spaces at will.
+The XFS buffer cache specializes in abstracting IO to block-oriented  address
+spaces, which means that adaptation of the buffer cache to interface with
+xfiles enables reuse of the entire btree library.
+Btrees built atop an xfile are collectively known as ``xfbtrees``.
+The next few sections describe how they actually work.
+
+The proposed patchset is the
+`in-memory btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
+series.
+
+Using xfiles as a Buffer Cache Target
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Two modifications are necessary to support xfiles as a buffer cache target.
+The first is to make it possible for the ``struct xfs_buftarg`` structure to
+host the ``struct xfs_buf`` rhashtable, because normally those are held by a
+per-AG structure.
+The second change is to modify the buffer ``ioapply`` function to "read" cached
+pages from the xfile and "write" cached pages back to the xfile.
+Multiple access to individual buffers is controlled by the ``xfs_buf`` lock,
+since the xfile does not provide any locking on its own.
+With this adaptation in place, users of the xfile-backed buffer cache use
+exactly the same APIs as users of the disk-backed buffer cache.
+The separation between xfile and buffer cache implies higher memory usage since
+they do not share pages, but this property could some day enable transactional
+updates to an in-memory btree.
+Today, however, it simply eliminates the need for new code.
+
+Space Management with an xfbtree
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Space management for an xfile is very simple -- each btree block is one memory
+page in size.
+These blocks use the same header format as an on-disk btree, but the in-memory
+block verifiers ignore the checksums, assuming that xfile memory is no more
+corruption-prone than regular DRAM.
+Reusing existing code here is more important than absolute memory efficiency.
+
+The very first block of an xfile backing an xfbtree contains a header block.
+The header describes the owner, height, and the block number of the root
+xfbtree block.
+
+To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file.
+If there are no gaps, create one by extending the length of the xfile.
+Preallocate space for the block with ``xfile_prealloc``, and hand back the
+location.
+To free an xfbtree block, use ``xfile_discard`` (which internally uses
+``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
+
+Populating an xfbtree
+^^^^^^^^^^^^^^^^^^^^^
+
+An online fsck function that wants to create an xfbtree should proceed as
+follows:
+
+1. Call ``xfile_create`` to create an xfile.
+
+2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
+   pointing to the xfile.
+
+3. Pass the buffer cache target, buffer ops, and other information to
+   ``xfbtree_create`` to write an initial tree header and root block to the
+   xfile.
+   Each btree type should define a wrapper that passes necessary arguments to
+   the creation function.
+   For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
+   all the necessary details for callers.
+   A ``struct xfbtree`` object will be returned.
+
+4. Pass the xfbtree object to the btree cursor creation function for the
+   btree type.
+   Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this
+   for callers.
+
+5. Pass the btree cursor to the regular btree functions to make queries against
+   and to update the in-memory btree.
+   For example, a btree cursor for an rmap xfbtree can be passed to the
+   ``xfs_rmap_*`` functions just like any other btree cursor.
+   See the :ref:`next section<xfbtree_commit>` for information on dealing with
+   xfbtree updates that are logged to a transaction.
+
+6. When finished, delete the btree cursor, destroy the xfbtree object, free the
+   buffer target, and the destroy the xfile to release all resources.
+
+.. _xfbtree_commit:
+
+Committing Logged xfbtree Buffers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Although it is a clever hack to reuse the rmap btree code to handle the staging
+structure, the ephemeral nature of the in-memory btree block storage presents
+some challenges of its own.
+The XFS transaction manager must not commit buffer log items for buffers backed
+by an xfile because the log format does not understand updates for devices
+other than the data device.
+An ephemeral xfbtree probably will not exist by the time the AIL checkpoints
+log transactions back into the filesystem, and certainly won't exist during
+log recovery.
+For these reasons, any code updating an xfbtree in transaction context must
+remove the buffer log items from the transaction and write the updates into the
+backing xfile before committing or cancelling the transaction.
+
+The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement
+this functionality as follows:
+
+1. Find each buffer log item whose buffer targets the xfile.
+
+2. Record the dirty/ordered status of the log item.
+
+3. Detach the log item from the buffer.
+
+4. Queue the buffer to a special delwri list.
+
+5. Clear the transaction dirty flag if the only dirty log items were the ones
+   that were detached in step 3.
+
+6. Submit the delwri list to commit the changes to the xfile, if the updates
+   are being committed.
+
+After removing xfile logged buffers from the transaction in this manner, the
+transaction can be committed or cancelled.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 08/14] xfs: document btree bulk loading
  2023-03-07  1:30   ` Darrick J. Wong
                       ` (6 preceding siblings ...)
  2023-03-07  1:31     ` [PATCH 07/14] xfs: document pageable kernel memory Darrick J. Wong
@ 2023-03-07  1:31     ` Darrick J. Wong
  2023-03-07  1:31     ` [PATCH 09/14] xfs: document online file metadata repair code Darrick J. Wong
                       ` (5 subsequent siblings)
  13 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:31 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Add a discussion of the btree bulk loading code, which makes it easy to
take an in-memory recordset and write it out to disk in an efficient
manner.  This also enables atomic switchover from the old to the new
structure with minimal potential for leaking the old blocks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  665 ++++++++++++++++++++
 1 file changed, 665 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 21f0638ab69d..2baea7673498 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -2332,3 +2332,668 @@ this functionality as follows:
 
 After removing xfile logged buffers from the transaction in this manner, the
 transaction can be committed or cancelled.
+
+Bulk Loading of Ondisk B+Trees
+------------------------------
+
+As mentioned previously, early iterations of online repair built new btree
+structures by creating a new btree and adding observations individually.
+Loading a btree one record at a time had a slight advantage of not requiring
+the incore records to be sorted prior to commit, but was very slow and leaked
+blocks if the system went down during a repair.
+Loading records one at a time also meant that repair could not control the
+loading factor of the blocks in the new btree.
+
+Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for
+rebuilding a btree index from a collection of records -- bulk btree loading.
+This was implemented rather inefficiently code-wise, since ``xfs_repair``
+had separate copy-pasted implementations for each btree type.
+
+To prepare for online fsck, each of the four bulk loaders were studied, notes
+were taken, and the four were refactored into a single generic btree bulk
+loading mechanism.
+Those notes in turn have been refreshed and are presented below.
+
+Geometry Computation
+````````````````````
+
+The zeroth step of bulk loading is to assemble the entire record set that will
+be stored in the new btree, and sort the records.
+Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the
+btree from the record set, the type of btree, and any load factor preferences.
+This information is required for resource reservation.
+
+First, the geometry computation computes the minimum and maximum records that
+will fit in a leaf block from the size of a btree block and the size of the
+block header.
+Roughly speaking, the maximum number of records is::
+
+        maxrecs = (block_size - header_size) / record_size
+
+The XFS design specifies that btree blocks should be merged when possible,
+which means the minimum number of records is half of maxrecs::
+
+        minrecs = maxrecs / 2
+
+The next variable to determine is the desired loading factor.
+This must be at least minrecs and no more than maxrecs.
+Choosing minrecs is undesirable because it wastes half the block.
+Choosing maxrecs is also undesirable because adding a single record to each
+newly rebuilt leaf block will cause a tree split, which causes a noticeable
+drop in performance immediately afterwards.
+The default loading factor was chosen to be 75% of maxrecs, which provides a
+reasonably compact structure without any immediate split penalties::
+
+        default_load_factor = (maxrecs + minrecs) / 2
+
+If space is tight, the loading factor will be set to maxrecs to try to avoid
+running out of space::
+
+        leaf_load_factor = enough space ? default_load_factor : maxrecs
+
+Load factor is computed for btree node blocks using the combined size of the
+btree key and pointer as the record size::
+
+        maxrecs = (block_size - header_size) / (key_size + ptr_size)
+        minrecs = maxrecs / 2
+        node_load_factor = enough space ? default_load_factor : maxrecs
+
+Once that's done, the number of leaf blocks required to store the record set
+can be computed as::
+
+        leaf_blocks = ceil(record_count / leaf_load_factor)
+
+The number of node blocks needed to point to the next level down in the tree
+is computed as::
+
+        n_blocks = (n == 0 ? leaf_blocks : node_blocks[n])
+        node_blocks[n + 1] = ceil(n_blocks / node_load_factor)
+
+The entire computation is performed recursively until the current level only
+needs one block.
+The resulting geometry is as follows:
+
+- For AG-rooted btrees, this level is the root level, so the height of the new
+  tree is ``level + 1`` and the space needed is the summation of the number of
+  blocks on each level.
+
+- For inode-rooted btrees where the records in the top level do not fit in the
+  inode fork area, the height is ``level + 2``, the space needed is the
+  summation of the number of blocks on each level, and the inode fork points to
+  the root block.
+
+- For inode-rooted btrees where the records in the top level can be stored in
+  the inode fork area, then the root block can be stored in the inode, the
+  height is ``level + 1``, and the space needed is one less than the summation
+  of the number of blocks on each level.
+  This only becomes relevant when non-bmap btrees gain the ability to root in
+  an inode, which is a future patchset and only included here for completeness.
+
+.. _newbt:
+
+Reserving New B+Tree Blocks
+```````````````````````````
+
+Once repair knows the number of blocks needed for the new btree, it allocates
+those blocks using the free space information.
+Each reserved extent is tracked separately by the btree builder state data.
+To improve crash resilience, the reservation code also logs an Extent Freeing
+Intent (EFI) item in the same transaction as each space allocation and attaches
+its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
+If the system goes down, log recovery will use the unfinished EFIs to free the
+unused space, the free space, leaving the filesystem unchanged.
+
+Each time the btree builder claims a block for the btree from a reserved
+extent, it updates the in-memory reservation to reflect the claimed space.
+Block reservation tries to allocate as much contiguous space as possible to
+reduce the number of EFIs in play.
+
+While repair is writing these new btree blocks, the EFIs created for the space
+reservations pin the tail of the ondisk log.
+It's possible that other parts of the system will remain busy and push the head
+of the log towards the pinned tail.
+To avoid livelocking the filesystem, the EFIs must not pin the tail of the log
+for too long.
+To alleviate this problem, the dynamic relogging capability of the deferred ops
+mechanism is reused here to commit a transaction at the log head containing an
+EFD for the old EFI and new EFI at the head.
+This enables the log to release the old EFI to keep the log moving forwards.
+
+EFIs have a role to play during the commit and reaping phases; please see the
+next section and the section about :ref:`reaping<reaping>` for more details.
+
+Proposed patchsets are the
+`bitmap rework
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
+and the
+`preparation for bulk loading btrees
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_.
+
+
+Writing the New Tree
+````````````````````
+
+This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
+a block from the reserved list, writes the new btree block header, fills the
+rest of the block with records, and adds the new leaf block to a list of
+written blocks::
+
+  ┌────┐
+  │leaf│
+  │RRR │
+  └────┘
+
+Sibling pointers are set every time a new block is added to the level::
+
+  ┌────┐ ┌────┐ ┌────┐ ┌────┐
+  │leaf│→│leaf│→│leaf│→│leaf│
+  │RRR │←│RRR │←│RRR │←│RRR │
+  └────┘ └────┘ └────┘ └────┘
+
+When it finishes writing the record leaf blocks, it moves on to the node
+blocks
+To fill a node block, it walks each block in the next level down in the tree
+to compute the relevant keys and write them into the parent node::
+
+      ┌────┐       ┌────┐
+      │node│──────→│node│
+      │PP  │←──────│PP  │
+      └────┘       └────┘
+      ↙   ↘         ↙   ↘
+  ┌────┐ ┌────┐ ┌────┐ ┌────┐
+  │leaf│→│leaf│→│leaf│→│leaf│
+  │RRR │←│RRR │←│RRR │←│RRR │
+  └────┘ └────┘ └────┘ └────┘
+
+When it reaches the root level, it is ready to commit the new btree!::
+
+          ┌─────────┐
+          │  root   │
+          │   PP    │
+          └─────────┘
+          ↙         ↘
+      ┌────┐       ┌────┐
+      │node│──────→│node│
+      │PP  │←──────│PP  │
+      └────┘       └────┘
+      ↙   ↘         ↙   ↘
+  ┌────┐ ┌────┐ ┌────┐ ┌────┐
+  │leaf│→│leaf│→│leaf│→│leaf│
+  │RRR │←│RRR │←│RRR │←│RRR │
+  └────┘ └────┘ └────┘ └────┘
+
+The first step to commit the new btree is to persist the btree blocks to disk
+synchronously.
+This is a little complicated because a new btree block could have been freed
+in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to
+remove the (stale) buffer from the AIL list before it can write the new blocks
+to disk.
+Blocks are queued for IO using a delwri list and written in one large batch
+with ``xfs_buf_delwri_submit``.
+
+Once the new blocks have been persisted to disk, control returns to the
+individual repair function that called the bulk loader.
+The repair function must log the location of the new root in a transaction,
+clean up the space reservations that were made for the new btree, and reap the
+old metadata blocks:
+
+1. Commit the location of the new btree root.
+
+2. For each incore reservation:
+
+   a. Log Extent Freeing Done (EFD) items for all the space that was consumed
+      by the btree builder.  The new EFDs must point to the EFIs attached to
+      the reservation to prevent log recovery from freeing the new blocks.
+
+   b. For unclaimed portions of incore reservations, create a regular deferred
+      extent free work item to be free the unused space later in the
+      transaction chain.
+
+   c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the
+      reservation of the committing transaction.
+      If the btree loading code suspects this might be about to happen, it must
+      call ``xrep_defer_finish`` to clear out the deferred work and obtain a
+      fresh transaction.
+
+3. Clear out the deferred work a second time to finish the commit and clean
+   the repair transaction.
+
+The transaction rolling in steps 2c and 3 represent a weakness in the repair
+algorithm, because a log flush and a crash before the end of the reap step can
+result in space leaking.
+Online repair functions minimize the chances of this occuring by using very
+large transactions, which each can accomodate many thousands of block freeing
+instructions.
+Repair moves on to reaping the old blocks, which will be presented in a
+subsequent :ref:`section<reaping>` after a few case studies of bulk loading.
+
+Case Study: Rebuilding the Inode Index
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The high level process to rebuild the inode index btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec``
+   records from the inode chunk information and a bitmap of the old inode btree
+   blocks.
+
+2. Append the records to an xfarray in inode order.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+   of blocks needed for the inode btree.
+   If the free space inode btree is enabled, call it again to estimate the
+   geometry of the finobt.
+
+4. Allocate the number of blocks computed in the previous step.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+   generate the internal node blocks.
+   If the free space inode btree is enabled, call it again to load the finobt.
+
+6. Commit the location of the new btree root block(s) to the AGI.
+
+7. Reap the old btree blocks using the bitmap created in step 1.
+
+Details are as follows.
+
+The inode btree maps inumbers to the ondisk location of the associated
+inode records, which means that the inode btrees can be rebuilt from the
+reverse mapping information.
+Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the
+location of the old inode btree blocks.
+Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the
+location of at least one inode cluster buffer.
+A cluster is the smallest number of ondisk inodes that can be allocated or
+freed in a single transaction; it is never smaller than 1 fs block or 4 inodes.
+
+For the space represented by each inode cluster, ensure that there are no
+records in the free space btrees nor any records in the reference count btree.
+If there are, the space metadata inconsistencies are reason enough to abort the
+operation.
+Otherwise, read each cluster buffer to check that its contents appear to be
+ondisk inodes and to decide if the file is allocated
+(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``).
+Accumulate the results of successive inode cluster buffer reads until there is
+enough information to fill a single inode chunk record, which is 64 consecutive
+numbers in the inumber keyspace.
+If the chunk is sparse, the chunk record may include holes.
+
+Once the repair function accumulates one chunk's worth of data, it calls
+``xfarray_append`` to add the inode btree record to the xfarray.
+This xfarray is walked twice during the btree creation step -- once to populate
+the inode btree with all inode chunk records, and a second time to populate the
+free inode btree with records for chunks that have free non-sparse inodes.
+The number of records for the inode btree is the number of xfarray records,
+but the record count for the free inode btree has to be computed as inode chunk
+records are stored in the xfarray.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding the Space Reference Counts
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Reverse mapping records are used to rebuild the reference count information.
+Reference counts are required for correct operation of copy on write for shared
+file data.
+Imagine the reverse mapping entries as rectangles representing extents of
+physical blocks, and that the rectangles can be laid down to allow them to
+overlap each other.
+From the diagram below, it is apparent that a reference count record must start
+or end wherever the height of the stack changes.
+In other words, the record emission stimulus is level-triggered::
+
+                        █    ███
+              ██      █████ ████   ███        ██████
+        ██   ████     ███████████ ████     █████████
+        ████████████████████████████████ ███████████
+        ^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
+        2 1  23 21    3 43 234  2123  1 01 2  3     0
+
+The ondisk reference count btree does not store the refcount == 0 cases because
+the free space btree already records which blocks are free.
+Extents being used to stage copy-on-write operations should be the only records
+with refcount == 1.
+Single-owner file blocks aren't recorded in either the free space or the
+reference count btrees.
+
+The high level process to rebuild the reference count btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec``
+   records for any space having more than one reverse mapping and add them to
+   the xfarray.
+   Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray
+   because these are extents allocated to stage a copy on write operation and
+   are tracked in the refcount btree.
+
+   Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old
+   refcount btree blocks.
+
+2. Sort the records in physical extent order, putting the CoW staging extents
+   at the end of the xfarray.
+   This matches the sorting order of records in the refcount btree.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+   of blocks needed for the new tree.
+
+4. Allocate the number of blocks computed in the previous step.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+   generate the internal node blocks.
+
+6. Commit the location of new btree root block to the AGF.
+
+7. Reap the old btree blocks using the bitmap created in step 1.
+
+Details are as follows; the same algorithm is used by ``xfs_repair`` to
+generate refcount information from reverse mapping records.
+
+- Until the reverse mapping btree runs out of records:
+
+  - Retrieve the next record from the btree and put it in a bag.
+
+  - Collect all records with the same starting block from the btree and put
+    them in the bag.
+
+  - While the bag isn't empty:
+
+    - Among the mappings in the bag, compute the lowest block number where the
+      reference count changes.
+      This position will be either the starting block number of the next
+      unprocessed reverse mapping or the next block after the shortest mapping
+      in the bag.
+
+    - Remove all mappings from the bag that end at this position.
+
+    - Collect all reverse mappings that start at this position from the btree
+      and put them in the bag.
+
+    - If the size of the bag changed and is greater than one, create a new
+      refcount record associating the block number range that we just walked to
+      the size of the bag.
+
+The bag-like structure in this case is a type 2 xfarray as discussed in the
+:ref:`xfarray access patterns<xfarray_access_patterns>` section.
+Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and
+removed via ``xfarray_unset``.
+Bag members are examined through ``xfarray_iter`` loops.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding File Fork Mapping Indices
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The high level process to rebuild a data/attr fork mapping btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec``
+   records from the reverse mapping records for that inode and fork.
+   Append these records to an xfarray.
+   Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK``
+   records.
+
+2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+   of blocks needed for the new tree.
+
+3. Sort the records in file offset order.
+
+4. If the extent records would fit in the inode fork immediate area, commit the
+   records to that immediate area and skip to step 8.
+
+5. Allocate the number of blocks computed in the previous step.
+
+6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+   generate the internal node blocks.
+
+7. Commit the new btree root block to the inode fork immediate area.
+
+8. Reap the old btree blocks using the bitmap created in step 1.
+
+There are some complications here:
+First, it's possible to move the fork offset to adjust the sizes of the
+immediate areas if the data and attr forks are not both in BMBT format.
+Second, if there are sufficiently few fork mappings, it may be possible to use
+EXTENTS format instead of BMBT, which may require a conversion.
+Third, the incore extent map must be reloaded carefully to avoid disturbing
+any delayed allocation extents.
+
+The proposed patchset is the
+`file mapping repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
+series.
+
+.. _reaping:
+
+Reaping Old Metadata Blocks
+---------------------------
+
+Whenever online fsck builds a new data structure to replace one that is
+suspect, there is a question of how to find and dispose of the blocks that
+belonged to the old structure.
+The laziest method of course is not to deal with them at all, but this slowly
+leads to service degradations as space leaks out of the filesystem.
+Hopefully, someone will schedule a rebuild of the free space information to
+plug all those leaks.
+Offline repair rebuilds all space metadata after recording the usage of
+the files and directories that it decides not to clear, hence it can build new
+structures in the discovered free space and avoid the question of reaping.
+
+As part of a repair, online fsck relies heavily on the reverse mapping records
+to find space that is owned by the corresponding rmap owner yet truly free.
+Cross referencing rmap records with other rmap records is necessary because
+there may be other data structures that also think they own some of those
+blocks (e.g. crosslinked trees).
+Permitting the block allocator to hand them out again will not push the system
+towards consistency.
+
+For space metadata, the process of finding extents to dispose of generally
+follows this format:
+
+1. Create a bitmap of space used by data structures that must be preserved.
+   The space reservations used to create the new metadata can be used here if
+   the same rmap owner code is used to denote all of the objects being rebuilt.
+
+2. Survey the reverse mapping data to create a bitmap of space owned by the
+   same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved.
+
+3. Use the bitmap disunion operator to subtract (1) from (2).
+   The remaining set bits represent candidate extents that could be freed.
+   The process moves on to step 4 below.
+
+Repairs for file-based metadata such as extended attributes, directories,
+symbolic links, quota files and realtime bitmaps are performed by building a
+new structure attached to a temporary file and swapping the forks.
+Afterward, the mappings in the old file fork are the candidate blocks for
+disposal.
+
+The process for disposing of old extents is as follows:
+
+4. For each candidate extent, count the number of reverse mapping records for
+   the first block in that extent that do not have the same rmap owner for the
+   data structure being repaired.
+
+   - If zero, the block has a single owner and can be freed.
+
+   - If not, the block is part of a crosslinked structure and must not be
+     freed.
+
+5. Starting with the next block in the extent, figure out how many more blocks
+   have the same zero/nonzero other owner status as that first block.
+
+6. If the region is crosslinked, delete the reverse mapping entry for the
+   structure being repaired and move on to the next region.
+
+7. If the region is to be freed, mark any corresponding buffers in the buffer
+   cache as stale to prevent log writeback.
+
+8. Free the region and move on.
+
+However, there is one complication to this procedure.
+Transactions are of finite size, so the reaping process must be careful to roll
+the transactions to avoid overruns.
+Overruns come from two sources:
+
+a. EFIs logged on behalf of space that is no longer occupied
+
+b. Log items for buffer invalidations
+
+This is also a window in which a crash during the reaping process can leak
+blocks.
+As stated earlier, online repair functions use very large transactions to
+minimize the chances of this occurring.
+
+The proposed patchset is the
+`preparation for bulk loading btrees
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_
+series.
+
+Case Study: Reaping After a Regular Btree Repair
+````````````````````````````````````````````````
+
+Old reference count and inode btrees are the easiest to reap because they have
+rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount
+btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees.
+Creating a list of extents to reap the old btree blocks is quite simple,
+conceptually:
+
+1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees.
+
+2. For each reverse mapping record with an rmap owner corresponding to the
+   metadata structure being rebuilt, set the corresponding range in a bitmap.
+
+3. Walk the current data structures that have the same rmap owner.
+   For each block visited, clear that range in the above bitmap.
+
+4. Each set bit in the bitmap represents a block that could be a block from the
+   old data structures and hence is a candidate for reaping.
+   In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)``
+   are the blocks that might be freeable.
+
+If it is possible to maintain the AGF lock throughout the repair (which is the
+common case), then step 2 can be performed at the same time as the reverse
+mapping record walk that creates the records for the new btree.
+
+Case Study: Rebuilding the Free Space Indices
+`````````````````````````````````````````````
+
+The high level process to rebuild the free space indices is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore``
+   records from the gaps in the reverse mapping btree.
+
+2. Append the records to an xfarray.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+   of blocks needed for each new tree.
+
+4. Allocate the number of blocks computed in the previous step from the free
+   space information collected.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+   generate the internal node blocks for the free space by length index.
+   Call it again for the free space by block number index.
+
+6. Commit the locations of the new btree root blocks to the AGF.
+
+7. Reap the old btree blocks by looking for space that is not recorded by the
+   reverse mapping btree, the new free space btrees, or the AGFL.
+
+Repairing the free space btrees has three key complications over a regular
+btree repair:
+
+First, free space is not explicitly tracked in the reverse mapping records.
+Hence, the new free space records must be inferred from gaps in the physical
+space component of the keyspace of the reverse mapping btree.
+
+Second, free space repairs cannot use the common btree reservation code because
+new blocks are reserved out of the free space btrees.
+This is impossible when repairing the free space btrees themselves.
+However, repair holds the AGF buffer lock for the duration of the free space
+index reconstruction, so it can use the collected free space information to
+supply the blocks for the new free space btrees.
+It is not necessary to back each reserved extent with an EFI because the new
+free space btrees are constructed in what the ondisk filesystem thinks is
+unowned space.
+However, if reserving blocks for the new btrees from the collected free space
+information changes the number of free space records, repair must re-estimate
+the new free space btree geometry with the new record count until the
+reservation is sufficient.
+As part of committing the new btrees, repair must ensure that reverse mappings
+are created for the reserved blocks and that unused reserved blocks are
+inserted into the free space btrees.
+Deferrred rmap and freeing operations are used to ensure that this transition
+is atomic, similar to the other btree repair functions.
+
+Third, finding the blocks to reap after the repair is not overly
+straightforward.
+Blocks for the free space btrees and the reverse mapping btrees are supplied by
+the AGFL.
+Blocks put onto the AGFL have reverse mapping records with the owner
+``XFS_RMAP_OWN_AG``.
+This ownership is retained when blocks move from the AGFL into the free space
+btrees or the reverse mapping btrees.
+When repair walks reverse mapping records to synthesize free space records, it
+creates a bitmap (``ag_owner_bitmap``) of all the space claimed by
+``XFS_RMAP_OWN_AG`` records.
+The repair context maintains a second bitmap corresponding to the rmap btree
+blocks and the AGFL blocks (``rmap_agfl_bitmap``).
+When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
+~rmap_agfl_bitmap)`` computes the extents that are used by the old free space
+btrees.
+These blocks can then be reaped using the methods outlined above.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+.. _rmap_reap:
+
+Case Study: Reaping After Repairing Reverse Mapping Btrees
+``````````````````````````````````````````````````````````
+
+Old reverse mapping btrees are less difficult to reap after a repair.
+As mentioned in the previous section, blocks on the AGFL, the two free space
+btree blocks, and the reverse mapping btree blocks all have reverse mapping
+records with ``XFS_RMAP_OWN_AG`` as the owner.
+The full process of gathering reverse mapping records and building a new btree
+are described in the case study of
+:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that
+discussion is that the new rmap btree will not contain any records for the old
+rmap btree, nor will the old btree blocks be tracked in the free space btrees.
+The list of candidate reaping blocks is computed by setting the bits
+corresponding to the gaps in the new rmap btree records, and then clearing the
+bits corresponding to extents in the free space btrees and the current AGFL
+blocks.
+The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the
+methods outlined above.
+
+The rest of the process of rebuildng the reverse mapping btree is discussed
+in a separate :ref:`case study<rmap_repair>`.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding the AGFL
+```````````````````````````````
+
+The allocation group free block list (AGFL) is repaired as follows:
+
+1. Create a bitmap for all the space that the reverse mapping data claims is
+   owned by ``XFS_RMAP_OWN_AG``.
+
+2. Subtract the space used by the two free space btrees and the rmap btree.
+
+3. Subtract any space that the reverse mapping data claims is owned by any
+   other owner, to avoid re-adding crosslinked blocks to the AGFL.
+
+4. Once the AGFL is full, reap any blocks leftover.
+
+5. The next operation to fix the freelist will right-size the list.
+
+See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 09/14] xfs: document online file metadata repair code
  2023-03-07  1:30   ` Darrick J. Wong
                       ` (7 preceding siblings ...)
  2023-03-07  1:31     ` [PATCH 08/14] xfs: document btree bulk loading Darrick J. Wong
@ 2023-03-07  1:31     ` Darrick J. Wong
  2023-03-07  1:31     ` [PATCH 10/14] xfs: document full filesystem scans for online fsck Darrick J. Wong
                       ` (4 subsequent siblings)
  13 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:31 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Add to the fifth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
kernel to repair file metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  154 ++++++++++++++++++++
 1 file changed, 154 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 2baea7673498..83602fac7c5a 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -2997,3 +2997,157 @@ The allocation group free block list (AGFL) is repaired as follows:
 5. The next operation to fix the freelist will right-size the list.
 
 See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details.
+
+Inode Record Repairs
+--------------------
+
+Inode records must be handled carefully, because they have both ondisk records
+("dinodes") and an in-memory ("cached") representation.
+There is a very high potential for cache coherency issues if online fsck is not
+careful to access the ondisk metadata *only* when the ondisk metadata is so
+badly damaged that the filesystem cannot load the in-memory representation.
+When online fsck wants to open a damaged file for scrubbing, it must use
+specialized resource acquisition functions that return either the in-memory
+representation *or* a lock on whichever object is necessary to prevent any
+update to the ondisk location.
+
+The only repairs that should be made to the ondisk inode buffers are whatever
+is necessary to get the in-core structure loaded.
+This means fixing whatever is caught by the inode cluster buffer and inode fork
+verifiers, and retrying the ``iget`` operation.
+If the second ``iget`` fails, the repair has failed.
+
+Once the in-memory representation is loaded, repair can lock the inode and can
+subject it to comprehensive checks, repairs, and optimizations.
+Most inode attributes are easy to check and constrain, or are user-controlled
+arbitrary bit patterns; these are both easy to fix.
+Dealing with the data and attr fork extent counts and the file block counts is
+more complicated, because computing the correct value requires traversing the
+forks, or if that fails, leaving the fields invalid and waiting for the fork
+fsck functions to run.
+
+The proposed patchset is the
+`inode
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
+repair series.
+
+Quota Record Repairs
+--------------------
+
+Similar to inodes, quota records ("dquots") also have both ondisk records and
+an in-memory representation, and hence are subject to the same cache coherency
+issues.
+Somewhat confusingly, both are known as dquots in the XFS codebase.
+
+The only repairs that should be made to the ondisk quota record buffers are
+whatever is necessary to get the in-core structure loaded.
+Once the in-memory representation is loaded, the only attributes needing
+checking are obviously bad limits and timer values.
+
+Quota usage counters are checked, repaired, and discussed separately in the
+section about :ref:`live quotacheck <quotacheck>`.
+
+The proposed patchset is the
+`quota
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
+repair series.
+
+.. _fscounters:
+
+Freezing to Fix Summary Counters
+--------------------------------
+
+Filesystem summary counters track availability of filesystem resources such
+as free blocks, free inodes, and allocated inodes.
+This information could be compiled by walking the free space and inode indexes,
+but this is a slow process, so XFS maintains a copy in the ondisk superblock
+that should reflect the ondisk metadata, at least when the filesystem has been
+unmounted cleanly.
+For performance reasons, XFS also maintains incore copies of those counters,
+which are key to enabling resource reservations for active transactions.
+Writer threads reserve the worst-case quantities of resources from the
+incore counter and give back whatever they don't use at commit time.
+It is therefore only necessary to serialize on the superblock when the
+superblock is being committed to disk.
+
+The lazy superblock counter feature introduced in XFS v5 took this even further
+by training log recovery to recompute the summary counters from the AG headers,
+which eliminated the need for most transactions even to touch the superblock.
+The only time XFS commits the summary counters is at filesystem unmount.
+To reduce contention even further, the incore counter is implemented as a
+percpu counter, which means that each CPU is allocated a batch of blocks from a
+global incore counter and can satisfy small allocations from the local batch.
+
+The high-performance nature of the summary counters makes it difficult for
+online fsck to check them, since there is no way to quiesce a percpu counter
+while the system is running.
+Although online fsck can read the filesystem metadata to compute the correct
+values of the summary counters, there's no way to hold the value of a percpu
+counter stable, so it's quite possible that the counter will be out of date by
+the time the walk is complete.
+Earlier versions of online scrub would return to userspace with an incomplete
+scan flag, but this is not a satisfying outcome for a system administrator.
+For repairs, the in-memory counters must be stabilized while walking the
+filesystem metadata to get an accurate reading and install it in the percpu
+counter.
+
+To satisfy this requirement, online fsck must prevent other programs in the
+system from initiating new writes to the filesystem, it must disable background
+garbage collection threads, and it must wait for existing writer programs to
+exit the kernel.
+Once that has been established, scrub can walk the AG free space indexes, the
+inode btrees, and the realtime bitmap to compute the correct value of all
+four summary counters.
+This is very similar to a filesystem freeze, though not all of the pieces are
+necessary:
+
+- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
+  prevent other threads from thawing the filesystem, or other scrub threads
+  from initiating another fscounters freeze.
+
+- It does not quiesce the log.
+
+With this code in place, it is now possible to pause the filesystem for just
+long enough to check and correct the summary counters.
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**:                                                  |
++--------------------------------------------------------------------------+
+| The initial implementation used the actual VFS filesystem freeze         |
+| mechanism to quiesce filesystem activity.                                |
+| With the filesystem frozen, it is possible to resolve the counter values |
+| with exact precision, but there are many problems with calling the VFS   |
+| methods directly:                                                        |
+|                                                                          |
+| - Other programs can unfreeze the filesystem without our knowledge.      |
+|   This leads to incorrect scan results and incorrect repairs.            |
+|                                                                          |
+| - Adding an extra lock to prevent others from thawing the filesystem     |
+|   required the addition of a ``->freeze_super`` function to wrap         |
+|   ``freeze_fs()``.                                                       |
+|   This in turn caused other subtle problems because it turns out that    |
+|   the VFS ``freeze_super`` and ``thaw_super`` functions can drop the     |
+|   last reference to the VFS superblock, and any subsequent access        |
+|   becomes a UAF bug!                                                     |
+|   This can happen if the filesystem is unmounted while the underlying    |
+|   block device has frozen the filesystem.                                |
+|   This problem could be solved by grabbing extra references to the       |
+|   superblock, but it felt suboptimal given the other inadequacies of     |
+|   this approach.                                                         |
+|                                                                          |
+| - The log need not be quiesced to check the summary counters, but a VFS  |
+|   freeze initiates one anyway.                                           |
+|   This adds unnecessary runtime to live fscounter fsck operations.       |
+|                                                                          |
+| - Quiescing the log means that XFS flushes the (possibly incorrect)      |
+|   counters to disk as part of cleaning the log.                          |
+|                                                                          |
+| - A bug in the VFS meant that freeze could complete even when            |
+|   sync_filesystem fails to flush the filesystem and returns an error.    |
+|   This bug was fixed in Linux 5.17.                                      |
++--------------------------------------------------------------------------+
+
+The proposed patchset is the
+`summary counter cleanup
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
+series.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 10/14] xfs: document full filesystem scans for online fsck
  2023-03-07  1:30   ` Darrick J. Wong
                       ` (8 preceding siblings ...)
  2023-03-07  1:31     ` [PATCH 09/14] xfs: document online file metadata repair code Darrick J. Wong
@ 2023-03-07  1:31     ` Darrick J. Wong
  2023-03-07  1:31     ` [PATCH 11/14] xfs: document metadata file repair Darrick J. Wong
                       ` (3 subsequent siblings)
  13 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:31 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Certain parts of the online fsck code need to scan every file in the
entire filesystem.  It is not acceptable to block the entire filesystem
while this happens, which means that we need to be clever in allowing
scans to coordinate with ongoing filesystem updates.  We also need to
hook the filesystem so that regular updates propagate to the staging
records.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  681 ++++++++++++++++++++
 1 file changed, 681 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 83602fac7c5a..ef19b4debc62 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -3151,3 +3151,684 @@ The proposed patchset is the
 `summary counter cleanup
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
 series.
+
+Full Filesystem Scans
+---------------------
+
+Certain types of metadata can only be checked by walking every file in the
+entire filesystem to record observations and comparing the observations against
+what's recorded on disk.
+Like every other type of online repair, repairs are made by writing those
+observations to disk in a replacement structure and committing it atomically.
+However, it is not practical to shut down the entire filesystem to examine
+hundreds of billions of files because the downtime would be excessive.
+Therefore, online fsck must build the infrastructure to manage a live scan of
+all the files in the filesystem.
+There are two questions that need to be solved to perform a live walk:
+
+- How does scrub manage the scan while it is collecting data?
+
+- How does the scan keep abreast of changes being made to the system by other
+  threads?
+
+.. _iscan:
+
+Coordinated Inode Scans
+```````````````````````
+
+In the original Unix filesystems of the 1970s, each directory entry contained
+an index number (*inumber*) which was used as an index into on ondisk array
+(*itable*) of fixed-size records (*inodes*) describing a file's attributes and
+its data block mapping.
+This system is described by J. Lions, `"inode (5659)"
+<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on
+UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South
+Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
+`"Implementation of the File System"
+<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
+Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
+1913-4.
+
+XFS retains most of this design, except now inumbers are search keys over all
+the space in the data section filesystem.
+They form a continuous keyspace that can be expressed as a 64-bit integer,
+though the inodes themselves are sparsely distributed within the keyspace.
+Scans proceed in a linear fashion across the inumber keyspace, starting from
+``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``.
+Naturally, a scan through a keyspace requires a scan cursor object to track the
+scan progress.
+Because this keyspace is sparse, this cursor contains two parts.
+The first part of this scan cursor object tracks the inode that will be
+examined next; call this the examination cursor.
+Somewhat less obviously, the scan cursor object must also track which parts of
+the keyspace have already been visited, which is critical for deciding if a
+concurrent filesystem update needs to be incorporated into the scan data.
+Call this the visited inode cursor.
+
+Advancing the scan cursor is a multi-step process encapsulated in
+``xchk_iscan_iter``:
+
+1. Lock the AGI buffer of the AG containing the inode pointed to by the visited
+   inode cursor.
+   This guarantee that inodes in this AG cannot be allocated or freed while
+   advancing the cursor.
+
+2. Use the per-AG inode btree to look up the next inumber after the one that
+   was just visited, since it may not be keyspace adjacent.
+
+3. If there are no more inodes left in this AG:
+
+   a. Move the examination cursor to the point of the inumber keyspace that
+      corresponds to the start of the next AG.
+
+   b. Adjust the visited inode cursor to indicate that it has "visited" the
+      last possible inode in the current AG's inode keyspace.
+      XFS inumbers are segmented, so the cursor needs to be marked as having
+      visited the entire keyspace up to just before the start of the next AG's
+      inode keyspace.
+
+   c. Unlock the AGI and return to step 1 if there are unexamined AGs in the
+      filesystem.
+
+   d. If there are no more AGs to examine, set both cursors to the end of the
+      inumber keyspace.
+      The scan is now complete.
+
+4. Otherwise, there is at least one more inode to scan in this AG:
+
+   a. Move the examination cursor ahead to the next inode marked as allocated
+      by the inode btree.
+
+   b. Adjust the visited inode cursor to point to the inode just prior to where
+      the examination cursor is now.
+      Because the scanner holds the AGI buffer lock, no inodes could have been
+      created in the part of the inode keyspace that the visited inode cursor
+      just advanced.
+
+5. Get the incore inode for the inumber of the examination cursor.
+   By maintaining the AGI buffer lock until this point, the scanner knows that
+   it was safe to advance the examination cursor across the entire keyspace,
+   and that it has stabilized this next inode so that it cannot disappear from
+   the filesystem until the scan releases the incore inode.
+
+6. Drop the AGI lock and return the incore inode to the caller.
+
+Online fsck functions scan all files in the filesystem as follows:
+
+1. Start a scan by calling ``xchk_iscan_start``.
+
+2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode.
+   If one is provided:
+
+   a. Lock the inode to prevent updates during the scan.
+
+   b. Scan the inode.
+
+   c. While still holding the inode lock, adjust the visited inode cursor
+      (``xchk_iscan_mark_visited``) to point to this inode.
+
+   d. Unlock and release the inode.
+
+8. Call ``xchk_iscan_teardown`` to complete the scan.
+
+There are subtleties with the inode cache that complicate grabbing the incore
+inode for the caller.
+Obviously, it is an absolute requirement that the inode metadata be consistent
+enough to load it into the inode cache.
+Second, if the incore inode is stuck in some intermediate state, the scan
+coordinator must release the AGI and push the main filesystem to get the inode
+back into a loadable state.
+
+The proposed patches are the
+`inode scanner
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
+series.
+The first user of the new functionality is the
+`online quotacheck
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
+series.
+
+Inode Management
+````````````````
+
+In regular filesystem code, references to allocated XFS incore inodes are
+always obtained (``xfs_iget``) outside of transaction context because the
+creation of the incore context for an existing file does not require metadata
+updates.
+However, it is important to note that references to incore inodes obtained as
+part of file creation must be performed in transaction context because the
+filesystem must ensure the atomicity of the ondisk inode btree index updates
+and the initialization of the actual ondisk inode.
+
+References to incore inodes are always released (``xfs_irele``) outside of
+transaction context because there are a handful of activities that might
+require ondisk updates:
+
+- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
+  release.
+
+- Speculative preallocations need to be unreserved.
+
+- An unlinked file may have lost its last reference, in which case the entire
+  file must be inactivated, which involves releasing all of its resources in
+  the ondisk metadata and freeing the inode.
+
+These activities are collectively called inode inactivation.
+Inactivation has two parts -- the VFS part, which initiates writeback on all
+dirty file pages, and the XFS part, which cleans up XFS-specific information
+and frees the inode if it was unlinked.
+If the inode is unlinked (or unconnected after a file handle operation), the
+kernel drops the inode into the inactivation machinery immediately.
+
+During normal operation, resource acquisition for an update follows this order
+to avoid deadlocks:
+
+1. Inode reference (``iget``).
+
+2. Filesystem freeze protection, if repairing (``mnt_want_write_file``).
+
+3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO.
+
+4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that
+   can update page cache mappings.
+
+5. Log feature enablement.
+
+6. Transaction log space grant.
+
+7. Space on the data and realtime devices for the transaction.
+
+8. Incore dquot references, if a file is being repaired.
+   Note that they are not locked, merely acquired.
+
+9. Inode ``ILOCK`` for file metadata updates.
+
+10. AG header buffer locks / Realtime metadata inode ILOCK.
+
+11. Realtime metadata buffer locks, if applicable.
+
+12. Extent mapping btree blocks, if applicable.
+
+Resources are often released in the reverse order, though this is not required.
+However, online fsck differs from regular XFS operations because it may examine
+an object that normally is acquired in a later stage of the locking order, and
+then decide to cross-reference the object with an object that is acquired
+earlier in the order.
+The next few sections detail the specific ways in which online fsck takes care
+to avoid deadlocks.
+
+iget and irele During a Scrub
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An inode scan performed on behalf of a scrub operation runs in transaction
+context, and possibly with resources already locked and bound to it.
+This isn't much of a problem for ``iget`` since it can operate in the context
+of an existing transaction, as long as all of the bound resources are acquired
+before the inode reference in the regular filesystem.
+
+When the VFS ``iput`` function is given a linked inode with no other
+references, it normally puts the inode on an LRU list in the hope that it can
+save time if another process re-opens the file before the system runs out
+of memory and frees it.
+Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
+flag on the inode to cause the kernel to try to drop the inode into the
+inactivation machinery immediately.
+
+In the past, inactivation was always done from the process that dropped the
+inode, which was a problem for scrub because scrub may already hold a
+transaction, and XFS does not support nesting transactions.
+On the other hand, if there is no scrub transaction, it is desirable to drop
+otherwise unused inodes immediately to avoid polluting caches.
+To capture these nuances, the online fsck code has a separate ``xchk_irele``
+function to set or clear the ``DONTCACHE`` flag to get the required release
+behavior.
+
+Proposed patchsets include fixing
+`scrub iget usage
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
+`dir iget usage
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
+
+Locking Inodes
+^^^^^^^^^^^^^^
+
+In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks
+in a well-known order: parent → child when updating the directory tree, and
+in numerical order of the addresses of their ``struct inode`` object otherwise.
+For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page
+faults.
+If two MMAPLOCKs must be acquired, they are acquired in numerical order of
+the addresses of their ``struct address_space`` objects.
+Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be
+acquired before transactions are allocated.
+If two ILOCKs must be acquired, they are acquired in inumber order.
+
+Inode lock acquisition must be done carefully during a coordinated inode scan.
+Online fsck cannot abide these conventions, because for a directory tree
+scanner, the scrub process holds the IOLOCK of the file being scanned and it
+needs to take the IOLOCK of the file at the other end of the directory link.
+If the directory tree is corrupt because it contains a cycle, ``xfs_scrub``
+cannot use the regular inode locking functions and avoid becoming trapped in an
+ABBA deadlock.
+
+Solving both of these problems is straightforward -- any time online fsck
+needs to take a second lock of the same class, it uses trylock to avoid an ABBA
+deadlock.
+If the trylock fails, scrub drops all inode locks and use trylock loops to
+(re)acquire all necessary resources.
+Trylock loops enable scrub to check for pending fatal signals, which is how
+scrub avoids deadlocking the filesystem or becoming an unresponsive process.
+However, trylock loops means that online fsck must be prepared to measure the
+resource being scrubbed before and after the lock cycle to detect changes and
+react accordingly.
+
+.. _dirparent:
+
+Case Study: Finding a Directory Parent
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Consider the directory parent pointer repair code as an example.
+Online fsck must verify that the dotdot dirent of a directory points up to a
+parent directory, and that the parent directory contains exactly one dirent
+pointing down to the child directory.
+Fully validating this relationship (and repairing it if possible) requires a
+walk of every directory on the filesystem while holding the child locked, and
+while updates to the directory tree are being made.
+The coordinated inode scan provides a way to walk the filesystem without the
+possibility of missing an inode.
+The child directory is kept locked to prevent updates to the dotdot dirent, but
+if the scanner fails to lock a parent, it can drop and relock both the child
+and the prospective parent.
+If the dotdot entry changes while the directory is unlocked, then a move or
+rename operation must have changed the child's parentage, and the scan can
+exit early.
+
+The proposed patchset is the
+`directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
+series.
+
+.. _fshooks:
+
+Filesystem Hooks
+`````````````````
+
+The second piece of support that online fsck functions need during a full
+filesystem scan is the ability to stay informed about updates being made by
+other threads in the filesystem, since comparisons against the past are useless
+in a dynamic environment.
+Two pieces of Linux kernel infrastructure enable online fsck to monitor regular
+filesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`.
+
+Filesystem hooks convey information about an ongoing filesystem operation to
+a downstream consumer.
+In this case, the downstream consumer is always an online fsck function.
+Because multiple fsck functions can run in parallel, online fsck uses the Linux
+notifier call chain facility to dispatch updates to any number of interested
+fsck processes.
+Call chains are a dynamic list, which means that they can be configured at
+run time.
+Because these hooks are private to the XFS module, the information passed along
+contains exactly what the checking function needs to update its observations.
+
+The current implementation of XFS hooks uses SRCU notifier chains to reduce the
+impact to highly threaded workloads.
+Regular blocking notifier chains use a rwsem and seem to have a much lower
+overhead for single-threaded applications.
+However, it may turn out that the combination of blocking chains and static
+keys are a more performant combination; more study is needed here.
+
+The following pieces are necessary to hook a certain point in the filesystem:
+
+- A ``struct xfs_hooks`` object must be embedded in a convenient place such as
+  a well-known incore filesystem object.
+
+- Each hook must define an action code and a structure containing more context
+  about the action.
+
+- Hook providers should provide appropriate wrapper functions and structs
+  around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type
+  checking to ensure correct usage.
+
+- A callsite in the regular filesystem code must be chosen to call
+  ``xfs_hooks_call`` with the action code and data structure.
+  This place should be adjacent to (and not earlier than) the place where
+  the filesystem update is committed to the transaction.
+  In general, when the filesystem calls a hook chain, it should be able to
+  handle sleeping and should not be vulnerable to memory reclaim or locking
+  recursion.
+  However, the exact requirements are very dependent on the context of the hook
+  caller and the callee.
+
+- The online fsck function should define a structure to hold scan data, a lock
+  to coordinate access to the scan data, and a ``struct xfs_hook`` object.
+  The scanner function and the regular filesystem code must acquire resources
+  in the same order; see the next section for details.
+
+- The online fsck code must contain a C function to catch the hook action code
+  and data structure.
+  If the object being updated has already been visited by the scan, then the
+  hook information must be applied to the scan data.
+
+- Prior to unlocking inodes to start the scan, online fsck must call
+  ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and
+  ``xfs_hooks_add`` to enable the hook.
+
+- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
+  complete.
+
+The number of hooks should be kept to a minimum to reduce complexity.
+Static keys are used to reduce the overhead of filesystem hooks to nearly
+zero when online fsck is not running.
+
+.. _liveupdate:
+
+Live Updates During a Scan
+``````````````````````````
+
+The code paths of the online fsck scanning code and the :ref:`hooked<fshooks>`
+filesystem code look like this::
+
+            other program
+                  ↓
+            inode lock ←────────────────────┐
+                  ↓                         │
+            AG header lock                  │
+                  ↓                         │
+            filesystem function             │
+                  ↓                         │
+            notifier call chain             │    same
+                  ↓                         ├─── inode
+            scrub hook function             │    lock
+                  ↓                         │
+            scan data mutex ←──┐    same    │
+                  ↓            ├─── scan    │
+            update scan data   │    lock    │
+                  ↑            │            │
+            scan data mutex ←──┘            │
+                  ↑                         │
+            inode lock ←────────────────────┘
+                  ↑
+            scrub function
+                  ↑
+            inode scanner
+                  ↑
+            xfs_scrub
+
+These rules must be followed to ensure correct interactions between the
+checking code and the code making an update to the filesystem:
+
+- Prior to invoking the notifier call chain, the filesystem function being
+  hooked must acquire the same lock that the scrub scanning function acquires
+  to scan the inode.
+
+- The scanning function and the scrub hook function must coordinate access to
+  the scan data by acquiring a lock on the scan data.
+
+- Scrub hook function must not add the live update information to the scan
+  observations unless the inode being updated has already been scanned.
+  The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``)
+  for this.
+
+- Scrub hook functions must not change the caller's state, including the
+  transaction that it is running.
+  They must not acquire any resources that might conflict with the filesystem
+  function being hooked.
+
+- The hook function can abort the inode scan to avoid breaking the other rules.
+
+The inode scan APIs are pretty simple:
+
+- ``xchk_iscan_start`` starts a scan
+
+- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
+  returns zero if there is nothing left to scan
+
+- ``xchk_iscan_want_live_update`` to decide if an inode has already been
+  visited in the scan.
+  This is critical for hook functions to decide if they need to update the
+  in-memory scan information.
+
+- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
+  scan
+
+- ``xchk_iscan_teardown`` to finish the scan
+
+This functionality is also a part of the
+`inode scanner
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
+series.
+
+.. _quotacheck:
+
+Case Study: Quota Counter Checking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+It is useful to compare the mount time quotacheck code to the online repair
+quotacheck code.
+Mount time quotacheck does not have to contend with concurrent operations, so
+it does the following:
+
+1. Make sure the ondisk dquots are in good enough shape that all the incore
+   dquots will actually load, and zero the resource usage counters in the
+   ondisk buffer.
+
+2. Walk every inode in the filesystem.
+   Add each file's resource usage to the incore dquot.
+
+3. Walk each incore dquot.
+   If the incore dquot is not being flushed, add the ondisk buffer backing the
+   incore dquot to a delayed write (delwri) list.
+
+4. Write the buffer list to disk.
+
+Like most online fsck functions, online quotacheck can't write to regular
+filesystem objects until the newly collected metadata reflect all filesystem
+state.
+Therefore, online quotacheck records file resource usage to a shadow dquot
+index implemented with a sparse ``xfarray``, and only writes to the real dquots
+once the scan is complete.
+Handling transactional updates is tricky because quota resource usage updates
+are handled in phases to minimize contention on dquots:
+
+1. The inodes involved are joined and locked to a transaction.
+
+2. For each dquot attached to the file:
+
+   a. The dquot is locked.
+
+   b. A quota reservation is added to the dquot's resource usage.
+      The reservation is recorded in the transaction.
+
+   c. The dquot is unlocked.
+
+3. Changes in actual quota usage are tracked in the transaction.
+
+4. At transaction commit time, each dquot is examined again:
+
+   a. The dquot is locked again.
+
+   b. Quota usage changes are logged and unused reservation is given back to
+      the dquot.
+
+   c. The dquot is unlocked.
+
+For online quotacheck, hooks are placed in steps 2 and 4.
+The step 2 hook creates a shadow version of the transaction dquot context
+(``dqtrx``) that operates in a similar manner to the regular code.
+The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots.
+Notice that both hooks are called with the inode locked, which is how the
+live update coordinates with the inode scanner.
+
+The quotacheck scan looks like this:
+
+1. Set up a coordinated inode scan.
+
+2. For each inode returned by the inode scan iterator:
+
+   a. Grab and lock the inode.
+
+   b. Determine that inode's resource usage (data blocks, inode counts,
+      realtime blocks) and add that to the shadow dquots for the user, group,
+      and project ids associated with the inode.
+
+   c. Unlock and release the inode.
+
+3. For each dquot in the system:
+
+   a. Grab and lock the dquot.
+
+   b. Check the dquot against the shadow dquots created by the scan and updated
+      by the live hooks.
+
+Live updates are key to being able to walk every quota record without
+needing to hold any locks for a long duration.
+If repairs are desired, the real and shadow dquots are locked and their
+resource counts are set to the values in the shadow dquot.
+
+The proposed patchset is the
+`online quotacheck
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
+series.
+
+.. _nlinks:
+
+Case Study: File Link Count Checking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+File link count checking also uses live update hooks.
+The coordinated inode scanner is used to visit all directories on the
+filesystem, and per-file link count records are stored in a sparse ``xfarray``
+indexed by inumber.
+During the scanning phase, each entry in a directory generates observation
+data as follows:
+
+1. If the entry is a dotdot (``'..'``) entry of the root directory, the
+   directory's parent link count is bumped because the root directory's dotdot
+   entry is self referential.
+
+2. If the entry is a dotdot entry of a subdirectory, the parent's backref
+   count is bumped.
+
+3. If the entry is neither a dot nor a dotdot entry, the target file's parent
+   count is bumped.
+
+4. If the target is a subdirectory, the parent's child link count is bumped.
+
+A crucial point to understand about how the link count inode scanner interacts
+with the live update hooks is that the scan cursor tracks which *parent*
+directories have been scanned.
+In other words, the live updates ignore any update about ``A → B`` when A has
+not been scanned, even if B has been scanned.
+Furthermore, a subdirectory A with a dotdot entry pointing back to B is
+accounted as a backref counter in the shadow data for A, since child dotdot
+entries affect the parent's link count.
+Live update hooks are carefully placed in all parts of the filesystem that
+create, change, or remove directory entries, since those operations involve
+bumplink and droplink.
+
+For any file, the correct link count is the number of parents plus the number
+of child subdirectories.
+Non-directories never have children of any kind.
+The backref information is used to detect inconsistencies in the number of
+links pointing to child subdirectories and the number of dotdot entries
+pointing back.
+
+After the scan completes, the link count of each file can be checked by locking
+both the inode and the shadow data, and comparing the link counts.
+A second coordinated inode scan cursor is used for comparisons.
+Live updates are key to being able to walk every inode without needing to hold
+any locks between inodes.
+If repairs are desired, the inode's link count is set to the value in the
+shadow information.
+If no parents are found, the file must be :ref:`reparented <orphanage>` to the
+orphanage to prevent the file from being lost forever.
+
+The proposed patchset is the
+`file link count repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
+series.
+
+.. _rmap_repair:
+
+Case Study: Rebuilding Reverse Mapping Records
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most repair functions follow the same pattern: lock filesystem resources,
+walk the surviving ondisk metadata looking for replacement metadata records,
+and use an :ref:`in-memory array <xfarray>` to store the gathered observations.
+The primary advantage of this approach is the simplicity and modularity of the
+repair code -- code and data are entirely contained within the scrub module,
+do not require hooks in the main filesystem, and are usually the most efficient
+in memory use.
+A secondary advantage of this repair approach is atomicity -- once the kernel
+decides a structure is corrupt, no other threads can access the metadata until
+the kernel finishes repairing and revalidating the metadata.
+
+For repairs going on within a shard of the filesystem, these advantages
+outweigh the delays inherent in locking the shard while repairing parts of the
+shard.
+Unfortunately, repairs to the reverse mapping btree cannot use the "standard"
+btree repair strategy because it must scan every space mapping of every fork of
+every file in the filesystem, and the filesystem cannot stop.
+Therefore, rmap repair foregoes atomicity between scrub and repair.
+It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks
+<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
+scan for reverse mapping records.
+
+1. Set up an xfbtree to stage rmap records.
+
+2. While holding the locks on the AGI and AGF buffers acquired during the
+   scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW
+   staging extents, and the internal log.
+
+3. Set up an inode scanner.
+
+4. Hook into rmap updates for the AG being repaired so that the live scan data
+   can receive updates to the rmap btree from the rest of the filesystem during
+   the file scan.
+
+5. For each space mapping found in either fork of each file scanned,
+   decide if the mapping matches the AG of interest.
+   If so:
+
+   a. Create a btree cursor for the in-memory btree.
+
+   b. Use the rmap code to add the record to the in-memory btree.
+
+   c. Use the :ref:`special commit function <xfbtree_commit>` to write the
+      xfbtree changes to the xfile.
+
+6. For each live update received via the hook, decide if the owner has already
+   been scanned.
+   If so, apply the live update into the scan data:
+
+   a. Create a btree cursor for the in-memory btree.
+
+   b. Replay the operation into the in-memory btree.
+
+   c. Use the :ref:`special commit function <xfbtree_commit>` to write the
+      xfbtree changes to the xfile.
+      This is performed with an empty transaction to avoid changing the
+      caller's state.
+
+7. When the inode scan finishes, create a new scrub transaction and relock the
+   two AG headers.
+
+8. Compute the new btree geometry using the number of rmap records in the
+   shadow btree, like all other btree rebuilding functions.
+
+9. Allocate the number of blocks computed in the previous step.
+
+10. Perform the usual btree bulk loading and commit to install the new rmap
+    btree.
+
+11. Reap the old rmap btree blocks as discussed in the case study about how
+    to :ref:`reap after rmap btree repair <rmap_reap>`.
+
+12. Free the xfbtree now that it not needed.
+
+The proposed patchset is the
+`rmap repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
+series.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 11/14] xfs: document metadata file repair
  2023-03-07  1:30   ` Darrick J. Wong
                       ` (9 preceding siblings ...)
  2023-03-07  1:31     ` [PATCH 10/14] xfs: document full filesystem scans for online fsck Darrick J. Wong
@ 2023-03-07  1:31     ` Darrick J. Wong
  2023-03-07  1:31     ` [PATCH 12/14] xfs: document directory tree repairs Darrick J. Wong
                       ` (2 subsequent siblings)
  13 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:31 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

File-based metadata (such as xattrs and directories) can be extremely
large.  To reduce the memory requirements and maximize code reuse, it is
very convenient to create a temporary file, use the regular dir/attr
code to store salvaged information, and then atomically swap the extents
between the file being repaired and the temporary file.  Record the high
level concepts behind how temporary files and atomic content swapping
should work, and then present some case studies of what the actual
repair functions do.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  536 ++++++++++++++++++++
 1 file changed, 536 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index ef19b4debc62..275eca9b531e 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -3389,6 +3389,8 @@ Proposed patchsets include fixing
 `dir iget usage
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
 
+.. _ilocking:
+
 Locking Inodes
 ^^^^^^^^^^^^^^
 
@@ -3832,3 +3834,537 @@ The proposed patchset is the
 `rmap repair
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
 series.
+
+Staging Repairs with Temporary Files on Disk
+--------------------------------------------
+
+XFS stores a substantial amount of metadata in file forks: directories,
+extended attributes, symbolic link targets, free space bitmaps and summary
+information for the realtime volume, and quota records.
+File forks map 64-bit logical file fork space extents to physical storage space
+extents, similar to how a memory management unit maps 64-bit virtual addresses
+to physical memory addresses.
+Therefore, file-based tree structures (such as directories and extended
+attributes) use blocks mapped in the file fork offset address space that point
+to other blocks mapped within that same address space, and file-based linear
+structures (such as bitmaps and quota records) compute array element offsets in
+the file fork offset address space.
+
+Because file forks can consume as much space as the entire filesystem, repairs
+cannot be staged in memory, even when a paging scheme is available.
+Therefore, online repair of file-based metadata createas a temporary file in
+the XFS filesystem, writes a new structure at the correct offsets into the
+temporary file, and atomically swaps the fork mappings (and hence the fork
+contents) to commit the repair.
+Once the repair is complete, the old fork can be reaped as necessary; if the
+system goes down during the reap, the iunlink code will delete the blocks
+during log recovery.
+
+**Note**: All space usage and inode indices in the filesystem *must* be
+consistent to use a temporary file safely!
+This dependency is the reason why online repair can only use pageable kernel
+memory to stage ondisk space usage information.
+
+Swapping metadata extents with a temporary file requires the owner field of the
+block headers to match the file being repaired and not the temporary file.  The
+directory, extended attribute, and symbolic link functions were all modified to
+allow callers to specify owner numbers explicitly.
+
+There is a downside to the reaping process -- if the system crashes during the
+reap phase and the fork extents are crosslinked, the iunlink processing will
+fail because freeing space will find the extra reverse mappings and abort.
+
+Temporary files created for repair are similar to ``O_TMPFILE`` files created
+by userspace.
+They are not linked into a directory and the entire file will be reaped when
+the last reference to the file is lost.
+The key differences are that these files must have no access permission outside
+the kernel at all, they must be specially marked to prevent them from being
+opened by handle, and they must never be linked into the directory tree.
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**:                                                  |
++--------------------------------------------------------------------------+
+| In the initial iteration of file metadata repair, the damaged metadata   |
+| blocks would be scanned for salvageable data; the extents in the file    |
+| fork would be reaped; and then a new structure would be built in its     |
+| place.                                                                   |
+| This strategy did not survive the introduction of the atomic repair      |
+| requirement expressed earlier in this document.                          |
+|                                                                          |
+| The second iteration explored building a second structure at a high      |
+| offset in the fork from the salvage data, reaping the old extents, and   |
+| using a ``COLLAPSE_RANGE`` operation to slide the new extents into       |
+| place.                                                                   |
+|                                                                          |
+| This had many drawbacks:                                                 |
+|                                                                          |
+| - Array structures are linearly addressed, and the regular filesystem    |
+|   codebase does not have the concept of a linear offset that could be    |
+|   applied to the record offset computation to build an alternate copy.   |
+|                                                                          |
+| - Extended attributes are allowed to use the entire attr fork offset     |
+|   address space.                                                         |
+|                                                                          |
+| - Even if repair could build an alternate copy of a data structure in a  |
+|   different part of the fork address space, the atomic repair commit     |
+|   requirement means that online repair would have to be able to perform  |
+|   a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old     |
+|   structure was completely replaced.                                     |
+|                                                                          |
+| - A crash after construction of the secondary tree but before the range  |
+|   collapse would leave unreachable blocks in the file fork.              |
+|   This would likely confuse things further.                              |
+|                                                                          |
+| - Reaping blocks after a repair is not a simple operation, and           |
+|   initiating a reap operation from a restarted range collapse operation  |
+|   during log recovery is daunting.                                       |
+|                                                                          |
+| - Directory entry blocks and quota records record the file fork offset   |
+|   in the header area of each block.                                      |
+|   An atomic range collapse operation would have to rewrite this part of  |
+|   each block header.                                                     |
+|   Rewriting a single field in block headers is not a huge problem, but   |
+|   it's something to be aware of.                                         |
+|                                                                          |
+| - Each block in a directory or extended attributes btree index contains  |
+|   sibling and child block pointers.                                      |
+|   Were the atomic commit to use a range collapse operation, each block   |
+|   would have to be rewritten very carefully to preserve the graph        |
+|   structure.                                                             |
+|   Doing this as part of a range collapse means rewriting a large number  |
+|   of blocks repeatedly, which is not conducive to quick repairs.         |
+|                                                                          |
+| This lead to the introduction of temporary file staging.                 |
++--------------------------------------------------------------------------+
+
+Using a Temporary File
+``````````````````````
+
+Online repair code should use the ``xrep_tempfile_create`` function to create a
+temporary file inside the filesystem.
+This allocates an inode, marks the in-core inode private, and attaches it to
+the scrub context.
+These files are hidden from userspace, may not be added to the directory tree,
+and must be kept private.
+
+Temporary files only use two inode locks: the IOLOCK and the ILOCK.
+The MMAPLOCK is not needed here, because there must not be page faults from
+userspace for data fork blocks.
+The usage patterns of these two locks are the same as for any other XFS file --
+access to file data are controlled via the IOLOCK, and access to file metadata
+are controlled via the ILOCK.
+Locking helpers are provided so that the temporary file and its lock state can
+be cleaned up by the scrub context.
+To comply with the nested locking strategy laid out in the :ref:`inode
+locking<ilocking>` section, it is recommended that scrub functions use the
+xrep_tempfile_ilock*_nowait lock helpers.
+
+Data can be written to a temporary file by two means:
+
+1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular
+   temporary file from an xfile.
+
+2. The regular directory, symbolic link, and extended attribute functions can
+   be used to write to the temporary file.
+
+Once a good copy of a data file has been constructed in a temporary file, it
+must be conveyed to the file being repaired, which is the topic of the next
+section.
+
+The proposed patches are in the
+`repair temporary files
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
+series.
+
+Atomic Extent Swapping
+----------------------
+
+Once repair builds a temporary file with a new data structure written into
+it, it must commit the new changes into the existing file.
+It is not possible to swap the inumbers of two files, so instead the new
+metadata must replace the old.
+This suggests the need for the ability to swap extents, but the existing extent
+swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient
+for online repair because:
+
+a. When the reverse-mapping btree is enabled, the swap code must keep the
+   reverse mapping information up to date with every exchange of mappings.
+   Therefore, it can only exchange one mapping per transaction, and each
+   transaction is independent.
+
+b. Reverse-mapping is critical for the operation of online fsck, so the old
+   defragmentation code (which swapped entire extent forks in a single
+   operation) is not useful here.
+
+c. Defragmentation is assumed to occur between two files with identical
+   contents.
+   For this use case, an incomplete exchange will not result in a user-visible
+   change in file contents, even if the operation is interrupted.
+
+d. Online repair needs to swap the contents of two files that are by definition
+   *not* identical.
+   For directory and xattr repairs, the user-visible contents might be the
+   same, but the contents of individual blocks may be very different.
+
+e. Old blocks in the file may be cross-linked with another structure and must
+   not reappear if the system goes down mid-repair.
+
+These problems are overcome by creating a new deferred operation and a new type
+of log intent item to track the progress of an operation to exchange two file
+ranges.
+The new deferred operation type chains together the same transactions used by
+the reverse-mapping extent swap code.
+The new log item records the progress of the exchange to ensure that once an
+exchange begins, it will always run to completion, even there are
+interruptions.
+The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
+in the superblock protects these new log item records from being replayed on
+old kernels.
+
+The proposed patchset is the
+`atomic extent swap
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
+series.
+
++--------------------------------------------------------------------------+
+| **Sidebar: Using Log-Incompatible Feature Flags**                        |
++--------------------------------------------------------------------------+
+| Starting with XFS v5, the superblock contains a                          |
+| ``sb_features_log_incompat`` field to indicate that the log contains     |
+| records that might not readable by all kernels that could mount this     |
+| filesystem.                                                              |
+| In short, log incompat features protect the log contents against kernels |
+| that will not understand the contents.                                   |
+| Unlike the other superblock feature bits, log incompat bits are          |
+| ephemeral because an empty (clean) log does not need protection.         |
+| The log cleans itself after its contents have been committed into the    |
+| filesystem, either as part of an unmount or because the system is        |
+| otherwise idle.                                                          |
+| Because upper level code can be working on a transaction at the same     |
+| time that the log cleans itself, it is necessary for upper level code to |
+| communicate to the log when it is going to use a log incompatible        |
+| feature.                                                                 |
+|                                                                          |
+| The log coordinates access to incompatible features through the use of   |
+| one ``struct rw_semaphore`` for each feature.                            |
+| The log cleaning code tries to take this rwsem in exclusive mode to      |
+| clear the bit; if the lock attempt fails, the feature bit remains set.   |
+| Filesystem code signals its intention to use a log incompat feature in a |
+| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem |
+| in shared mode.                                                          |
+| The code supporting a log incompat feature should create wrapper         |
+| functions to obtain the log feature and call                             |
+| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary  |
+| superblock.                                                              |
+| The superblock update is performed transactionally, so the wrapper to    |
+| obtain log assistance must be called just prior to the creation of the   |
+| transaction that uses the functionality.                                 |
+| For a file operation, this step must happen after taking the IOLOCK      |
+| and the MMAPLOCK, but before allocating the transaction.                 |
+| When the transaction is complete, the ``xlog_drop_incompat_feat``        |
+| function is called to release the feature.                               |
+| The feature bit will not be cleared from the superblock until the log    |
+| becomes clean.                                                           |
+|                                                                          |
+| Log-assisted extended attribute updates and atomic extent swaps both use |
+| log incompat features and provide convenience wrappers around the        |
+| functionality.                                                           |
++--------------------------------------------------------------------------+
+
+Mechanics of an Atomic Extent Swap
+``````````````````````````````````
+
+Swapping entire file forks is a complex task.
+The goal is to exchange all file fork mappings between two file fork offset
+ranges.
+There are likely to be many extent mappings in each fork, and the edges of
+the mappings aren't necessarily aligned.
+Furthermore, there may be other updates that need to happen after the swap,
+such as exchanging file sizes, inode flags, or conversion of fork data to local
+format.
+This is roughly the format of the new deferred extent swap work item:
+
+.. code-block:: c
+
+	struct xfs_swapext_intent {
+	    /* Inodes participating in the operation. */
+	    struct xfs_inode    *sxi_ip1;
+	    struct xfs_inode    *sxi_ip2;
+
+	    /* File offset range information. */
+	    xfs_fileoff_t       sxi_startoff1;
+	    xfs_fileoff_t       sxi_startoff2;
+	    xfs_filblks_t       sxi_blockcount;
+
+	    /* Set these file sizes after the operation, unless negative. */
+	    xfs_fsize_t         sxi_isize1;
+	    xfs_fsize_t         sxi_isize2;
+
+	    /* XFS_SWAP_EXT_* log operation flags */
+	    uint64_t            sxi_flags;
+	};
+
+The new log intent item contains enough information to track two logical fork
+offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
+blockcount)``.
+Each step of a swap operation exchanges the largest file range mapping possible
+from one file to the other.
+After each step in the swap operation, the two startoff fields are incremented
+and the blockcount field is decremented to reflect the progress made.
+The flags field captures behavioral parameters such as swapping the attr fork
+instead of the data fork and other work to be done after the extent swap.
+The two isize fields are used to swap the file size at the end of the operation
+if the file data fork is the target of the swap operation.
+
+When the extent swap is initiated, the sequence of operations is as follows:
+
+1. Create a deferred work item for the extent swap.
+   At the start, it should contain the entirety of the file ranges to be
+   swapped.
+
+2. Call ``xfs_defer_finish`` to process the exchange.
+   This is encapsulated in ``xrep_tempswap_contents`` for scrub operations.
+   This will log an extent swap intent item to the transaction for the deferred
+   extent swap work item.
+
+3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero,
+
+   a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and
+      ``sxi_startoff2``, respectively, and compute the longest extent that can
+      be swapped in a single step.
+      This is the minimum of the two ``br_blockcount`` s in the mappings.
+      Keep advancing through the file forks until at least one of the mappings
+      contains written blocks.
+      Mutual holes, unwritten extents, and extent mappings to the same physical
+      space are not exchanged.
+
+      For the next few steps, this document will refer to the mapping that came
+      from file 1 as "map1", and the mapping that came from file 2 as "map2".
+
+   b. Create a deferred block mapping update to unmap map1 from file 1.
+
+   c. Create a deferred block mapping update to unmap map2 from file 2.
+
+   d. Create a deferred block mapping update to map map1 into file 2.
+
+   e. Create a deferred block mapping update to map map2 into file 1.
+
+   f. Log the block, quota, and extent count updates for both files.
+
+   g. Extend the ondisk size of either file if necessary.
+
+   h. Log an extent swap done log item for the extent swap intent log item
+      that was read at the start of step 3.
+
+   i. Compute the amount of file range that has just been covered.
+      This quantity is ``(map1.br_startoff + map1.br_blockcount -
+      sxi_startoff1)``, because step 3a could have skipped holes.
+
+   j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2``
+      by the number of blocks computed in the previous step, and decrease
+      ``sxi_blockcount`` by the same quantity.
+      This advances the cursor.
+
+   k. Log a new extent swap intent log item reflecting the advanced state of
+      the work item.
+
+   l. Return the proper error code (EAGAIN) to the deferred operation manager
+      to inform it that there is more work to be done.
+      The operation manager completes the deferred work in steps 3b-3e before
+      moving back to the start of step 3.
+
+4. Perform any post-processing.
+   This will be discussed in more detail in subsequent sections.
+
+If the filesystem goes down in the middle of an operation, log recovery will
+find the most recent unfinished extent swap log intent item and restart from
+there.
+This is how extent swapping guarantees that an outside observer will either see
+the old broken structure or the new one, and never a mismash of both.
+
+Preparation for Extent Swapping
+```````````````````````````````
+
+There are a few things that need to be taken care of before initiating an
+atomic extent swap operation.
+First, regular files require the page cache to be flushed to disk before the
+operation begins, and directio writes to be quiesced.
+Like any filesystem operation, extent swapping must determine the maximum
+amount of disk space and quota that can be consumed on behalf of both files in
+the operation, and reserve that quantity of resources to avoid an unrecoverable
+out of space failure once it starts dirtying metadata.
+The preparation step scans the ranges of both files to estimate:
+
+- Data device blocks needed to handle the repeated updates to the fork
+  mappings.
+- Change in data and realtime block counts for both files.
+- Increase in quota usage for both files, if the two files do not share the
+  same set of quota ids.
+- The number of extent mappings that will be added to each file.
+- Whether or not there are partially written realtime extents.
+  User programs must never be able to access a realtime file extent that maps
+  to different extents on the realtime volume, which could happen if the
+  operation fails to run to completion.
+
+The need for precise estimation increases the run time of the swap operation,
+but it is very important to maintain correct accounting.
+The filesystem must not run completely out of free space, nor can the extent
+swap ever add more extent mappings to a fork than it can support.
+Regular users are required to abide the quota limits, though metadata repairs
+may exceed quota to resolve inconsistent metadata elsewhere.
+
+Special Features for Swapping Metadata File Extents
+```````````````````````````````````````````````````
+
+Extended attributes, symbolic links, and directories can set the fork format to
+"local" and treat the fork as a literal area for data storage.
+Metadata repairs must take extra steps to support these cases:
+
+- If both forks are in local format and the fork areas are large enough, the
+  swap is performed by copying the incore fork contents, logging both forks,
+  and committing.
+  The atomic extent swap mechanism is not necessary, since this can be done
+  with a single transaction.
+
+- If both forks map blocks, then the regular atomic extent swap is used.
+
+- Otherwise, only one fork is in local format.
+  The contents of the local format fork are converted to a block to perform the
+  swap.
+  The conversion to block format must be done in the same transaction that
+  logs the initial extent swap intent log item.
+  The regular atomic extent swap is used to exchange the mappings.
+  Special flags are set on the swap operation so that the transaction can be
+  rolled one more time to convert the second file's fork back to local format
+  so that the second file will be ready to go as soon as the ILOCK is dropped.
+
+Extended attributes and directories stamp the owning inode into every block,
+but the buffer verifiers do not actually check the inode number!
+Although there is no verification, it is still important to maintain
+referential integrity, so prior to performing the extent swap, online repair
+builds every block in the new data structure with the owner field of the file
+being repaired.
+
+After a successful swap operation, the repair operation must reap the old fork
+blocks by processing each fork mapping through the standard :ref:`file extent
+reaping <reaping>` mechanism that is done post-repair.
+If the filesystem should go down during the reap part of the repair, the
+iunlink processing at the end of recovery will free both the temporary file and
+whatever blocks were not reaped.
+However, this iunlink processing omits the cross-link detection of online
+repair, and is not completely foolproof.
+
+Swapping Temporary File Extents
+```````````````````````````````
+
+To repair a metadata file, online repair proceeds as follows:
+
+1. Create a temporary repair file.
+
+2. Use the staging data to write out new contents into the temporary repair
+   file.
+   The same fork must be written to as is being repaired.
+
+3. Commit the scrub transaction, since the swap estimation step must be
+   completed before transaction reservations are made.
+
+4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with
+   the appropriate resource reservations, locks, and fill out a ``struct
+   xfs_swapext_req`` with the details of the swap operation.
+
+5. Call ``xrep_tempswap_contents`` to swap the contents.
+
+6. Commit the transaction to complete the repair.
+
+.. _rtsummary:
+
+Case Study: Repairing the Realtime Summary File
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In the "realtime" section of an XFS filesystem, free space is tracked via a
+bitmap, similar to Unix FFS.
+Each bit in the bitmap represents one realtime extent, which is a multiple of
+the filesystem block size between 4KiB and 1GiB in size.
+The realtime summary file indexes the number of free extents of a given size to
+the offset of the block within the realtime free space bitmap where those free
+extents begin.
+In other words, the summary file helps the allocator find free extents by
+length, similar to what the free space by count (cntbt) btree does for the data
+section.
+
+The summary file itself is a flat file (with no block headers or checksums!)
+partitioned into ``log2(total rt extents)`` sections containing enough 32-bit
+counters to match the number of blocks in the rt bitmap.
+Each counter records the number of free extents that start in that bitmap block
+and can satisfy a power-of-two allocation request.
+
+To check the summary file against the bitmap:
+
+1. Take the ILOCK of both the realtime bitmap and summary files.
+
+2. For each free space extent recorded in the bitmap:
+
+   a. Compute the position in the summary file that contains a counter that
+      represents this free extent.
+
+   b. Read the counter from the xfile.
+
+   c. Increment it, and write it back to the xfile.
+
+3. Compare the contents of the xfile against the ondisk file.
+
+To repair the summary file, write the xfile contents into the temporary file
+and use atomic extent swap to commit the new contents.
+The temporary file is then reaped.
+
+The proposed patchset is the
+`realtime summary repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
+series.
+
+Case Study: Salvaging Extended Attributes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In XFS, extended attributes are implemented as a namespaced name-value store.
+Values are limited in size to 64KiB, but there is no limit in the number of
+names.
+The attribute fork is unpartitioned, which means that the root of the attribute
+structure is always in logical block zero, but attribute leaf blocks, dabtree
+index blocks, and remote value blocks are intermixed.
+Attribute leaf blocks contain variable-sized records that associate
+user-provided names with the user-provided values.
+Values larger than a block are allocated separate extents and written there.
+If the leaf information expands beyond a single block, a directory/attribute
+btree (``dabtree``) is created to map hashes of attribute names to entries
+for fast lookup.
+
+Salvaging extended attributes is done as follows:
+
+1. Walk the attr fork mappings of the file being repaired to find the attribute
+   leaf blocks.
+   When one is found,
+
+   a. Walk the attr leaf block to find candidate keys.
+      When one is found,
+
+      1. Check the name for problems, and ignore the name if there are.
+
+      2. Retrieve the value.
+         If that succeeds, add the name and value to the staging xfarray and
+         xfblob.
+
+2. If the memory usage of the xfarray and xfblob exceed a certain amount of
+   memory or there are no more attr fork blocks to examine, unlock the file and
+   add the staged extended attributes to the temporary file.
+
+3. Use atomic extent swapping to exchange the new and old extended attribute
+   structures.
+   The old attribute blocks are now attached to the temporary file.
+
+4. Reap the temporary file.
+
+The proposed patchset is the
+`extended attribute repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
+series.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 12/14] xfs: document directory tree repairs
  2023-03-07  1:30   ` Darrick J. Wong
                       ` (10 preceding siblings ...)
  2023-03-07  1:31     ` [PATCH 11/14] xfs: document metadata file repair Darrick J. Wong
@ 2023-03-07  1:31     ` Darrick J. Wong
  2023-03-07  1:32     ` [PATCH 13/14] xfs: document the userspace fsck driver program Darrick J. Wong
  2023-03-07  1:32     ` [PATCH 14/14] xfs: document future directions of online fsck Darrick J. Wong
  13 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:31 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Directory tree repairs are the least complete part of online fsck, due
to the lack of directory parent pointers.  However, even without that
feature, we can still make some corrections to the directory tree -- we
can salvage as many directory entries as we can from a damaged
directory, and we can reattach orphaned inodes to the lost+found, just
as xfs_repair does now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  419 ++++++++++++++++++++
 1 file changed, 419 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 275eca9b531e..12d3a2866151 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -2150,6 +2150,8 @@ reduce map/unmap cycles.
 Surprisingly, this reduces overall sort runtime by nearly half again after
 accounting for the application of heapsort directly onto xfile pages.
 
+.. _xfblob:
+
 Blob Storage
 ````````````
 
@@ -4368,3 +4370,420 @@ The proposed patchset is the
 `extended attribute repair
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
 series.
+
+Fixing Directories
+------------------
+
+Fixing directories is difficult with currently available filesystem features,
+since directory entries are not redundant.
+The offline repair tool scans all inodes to find files with nonzero link count,
+and then it scans all directories to establish parentage of those linked files.
+Damaged files and directories are zapped, and files with no parent are
+moved to the ``/lost+found`` directory.
+It does not try to salvage anything.
+
+The best that online repair can do at this time is to read directory data
+blocks and salvage any dirents that look plausible, correct link counts, and
+move orphans back into the directory tree.
+The salvage process is discussed in the case study at the end of this section.
+The :ref:`file link count fsck <nlinks>` code takes care of fixing link counts
+and moving orphans to the ``/lost+found`` directory.
+
+Case Study: Salvaging Directories
+`````````````````````````````````
+
+Unlike extended attributes, directory blocks are all the same size, so
+salvaging directories is straightforward:
+
+1. Find the parent of the directory.
+   If the dotdot entry is not unreadable, try to confirm that the alleged
+   parent has a child entry pointing back to the directory being repaired.
+   Otherwise, walk the filesystem to find it.
+
+2. Walk the first partition of data fork of the directory to find the directory
+   entry data blocks.
+   When one is found,
+
+   a. Walk the directory data block to find candidate entries.
+      When an entry is found:
+
+      i. Check the name for problems, and ignore the name if there are.
+
+      ii. Retrieve the inumber and grab the inode.
+          If that succeeds, add the name, inode number, and file type to the
+          staging xfarray and xblob.
+
+3. If the memory usage of the xfarray and xfblob exceed a certain amount of
+   memory or there are no more directory data blocks to examine, unlock the
+   directory and add the staged dirents into the temporary directory.
+   Truncate the staging files.
+
+4. Use atomic extent swapping to exchange the new and old directory structures.
+   The old directory blocks are now attached to the temporary file.
+
+5. Reap the temporary file.
+
+**Future Work Question**: Should repair revalidate the dentry cache when
+rebuilding a directory?
+
+*Answer*: Yes, it should.
+
+In theory it is necessary to scan all dentry cache entries for a directory to
+ensure that one of the following apply:
+
+1. The cached dentry reflects an ondisk dirent in the new directory.
+
+2. The cached dentry no longer has a corresponding ondisk dirent in the new
+   directory and the dentry can be purged from the cache.
+
+3. The cached dentry no longer has an ondisk dirent but the dentry cannot be
+   purged.
+   This is the problem case.
+
+Unfortunately, the current dentry cache design doesn't provide a means to walk
+every child dentry of a specific directory, which makes this a hard problem.
+There is no known solution.
+
+The proposed patchset is the
+`directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
+series.
+
+Parent Pointers
+```````````````
+
+A parent pointer is a piece of file metadata that enables a user to locate the
+file's parent directory without having to traverse the directory tree from the
+root.
+Without them, reconstruction of directory trees is hindered in much the same
+way that the historic lack of reverse space mapping information once hindered
+reconstruction of filesystem space metadata.
+The parent pointer feature, however, makes total directory reconstruction
+possible.
+
+XFS parent pointers include the dirent name and location of the entry within
+the parent directory.
+In other words, child files use extended attributes to store pointers to
+parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``.
+The directory checking process can be strengthened to ensure that the target of
+each dirent also contains a parent pointer pointing back to the dirent.
+Likewise, each parent pointer can be checked by ensuring that the target of
+each parent pointer is a directory and that it contains a dirent matching
+the parent pointer.
+Both online and offline repair can use this strategy.
+
+**Note**: The ondisk format of parent pointers is not yet finalized.
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**:                                                  |
++--------------------------------------------------------------------------+
+| Directory parent pointers were first proposed as an XFS feature more     |
+| than a decade ago by SGI.                                                |
+| Each link from a parent directory to a child file is mirrored with an    |
+| extended attribute in the child that could be used to identify the       |
+| parent directory.                                                        |
+| Unfortunately, this early implementation had major shortcomings and was  |
+| never merged into Linux XFS:                                             |
+|                                                                          |
+| 1. The XFS codebase of the late 2000s did not have the infrastructure to |
+|    enforce strong referential integrity in the directory tree.           |
+|    It did not guarantee that a change in a forward link would always be  |
+|    followed up with the corresponding change to the reverse links.       |
+|                                                                          |
+| 2. Referential integrity was not integrated into offline repair.         |
+|    Checking and repairs were performed on mounted filesystems without    |
+|    taking any kernel or inode locks to coordinate access.                |
+|    It is not clear how this actually worked properly.                    |
+|                                                                          |
+| 3. The extended attribute did not record the name of the directory entry |
+|    in the parent, so the SGI parent pointer implementation cannot be     |
+|    used to reconnect the directory tree.                                 |
+|                                                                          |
+| 4. Extended attribute forks only support 65,536 extents, which means     |
+|    that parent pointer attribute creation is likely to fail at some      |
+|    point before the maximum file link count is achieved.                 |
+|                                                                          |
+| The original parent pointer design was too unstable for something like   |
+| a file system repair to depend on.                                       |
+| Allison Henderson, Chandan Babu, and Catherine Hoang are working on a    |
+| second implementation that solves all shortcomings of the first.         |
+| During 2022, Allison introduced log intent items to track physical       |
+| manipulations of the extended attribute structures.                      |
+| This solves the referential integrity problem by making it possible to   |
+| commit a dirent update and a parent pointer update in the same           |
+| transaction.                                                             |
+| Chandan increased the maximum extent counts of both data and attribute   |
+| forks, thereby ensuring that the extended attribute structure can grow   |
+| to handle the maximum hardlink count of any file.                        |
++--------------------------------------------------------------------------+
+
+Case Study: Repairing Directories with Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
+a :ref:`directory entry live update hook <liveupdate>` as follows:
+
+1. Set up a temporary directory for generating the new directory structure,
+   an xfblob for storing entry names, and an xfarray for stashing directory
+   updates.
+
+2. Set up an inode scanner and hook into the directory entry code to receive
+   updates on directory operations.
+
+3. For each parent pointer found in each file scanned, decide if the parent
+   pointer references the directory of interest.
+   If so:
+
+   a. Stash an addname entry for this dirent in the xfarray for later.
+
+   b. When finished scanning that file, flush the stashed updates to the
+      temporary directory.
+
+4. For each live directory update received via the hook, decide if the child
+   has already been scanned.
+   If so:
+
+   a. Stash an addname or removename entry for this dirent update in the
+      xfarray for later.
+      We cannot write directly to the temporary directory because hook
+      functions are not allowed to modify filesystem metadata.
+      Instead, we stash updates in the xfarray and rely on the scanner thread
+      to apply the stashed updates to the temporary directory.
+
+5. When the scan is complete, atomically swap the contents of the temporary
+   directory and the directory being repaired.
+   The temporary directory now contains the damaged directory structure.
+
+6. Reap the temporary directory.
+
+7. Update the dirent position field of parent pointers as necessary.
+   This may require the queuing of a substantial number of xattr log intent
+   items.
+
+The proposed patchset is the
+`parent pointers directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_
+series.
+
+**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields
+match in the reconstructed directory?
+
+*Answer*: There are a few ways to solve this problem:
+
+1. The field could be designated advisory, since the other three values are
+   sufficient to find the entry in the parent.
+   However, this makes indexed key lookup impossible while repairs are ongoing.
+
+2. We could allow creating directory entries at specified offsets, which solves
+   the referential integrity problem but runs the risk that dirent creation
+   will fail due to conflicts with the free space in the directory.
+
+   These conflicts could be resolved by appending the directory entry and
+   amending the xattr code to support updating an xattr key and reindexing the
+   dabtree, though this would have to be performed with the parent directory
+   still locked.
+
+3. Same as above, but remove the old parent pointer entry and add a new one
+   atomically.
+
+4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``,
+   which would provide the attr name uniqueness that we require, without
+   forcing repair code to update the dirent position.
+   Unfortunately, this requires changes to the xattr code to support attr
+   names as long as 263 bytes.
+
+5. Change the ondisk xattr format to ``(parent_inum, hash(name)) →
+   (name, parent_gen)``.
+   If the hash is sufficiently resistant to collisions (e.g. sha256) then
+   this should provide the attr name uniqueness that we require.
+   Names shorter than 247 bytes could be stored directly.
+
+Discussion is ongoing under the `parent pointers patch deluge
+<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_.
+
+Case Study: Repairing Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Online reconstruction of a file's parent pointer information works similarly to
+directory reconstruction:
+
+1. Set up a temporary file for generating a new extended attribute structure,
+   an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for
+   stashing parent pointer updates.
+
+2. Set up an inode scanner and hook into the directory entry code to receive
+   updates on directory operations.
+
+3. For each directory entry found in each directory scanned, decide if the
+   dirent references the file of interest.
+   If so:
+
+   a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray
+      for later.
+
+   b. When finished scanning the directory, flush the stashed updates to the
+      temporary directory.
+
+4. For each live directory update received via the hook, decide if the parent
+   has already been scanned.
+   If so:
+
+   a. Stash an addpptr or removepptr entry for this dirent update in the
+      xfarray for later.
+      We cannot write parent pointers directly to the temporary file because
+      hook functions are not allowed to modify filesystem metadata.
+      Instead, we stash updates in the xfarray and rely on the scanner thread
+      to apply the stashed parent pointer updates to the temporary file.
+
+5. Copy all non-parent pointer extended attributes to the temporary file.
+
+6. When the scan is complete, atomically swap the attribute fork of the
+   temporary file and the file being repaired.
+   The temporary file now contains the damaged extended attribute structure.
+
+7. Reap the temporary file.
+
+The proposed patchset is the
+`parent pointers repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_
+series.
+
+Digression: Offline Checking of Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Examining parent pointers in offline repair works differently because corrupt
+files are erased long before directory tree connectivity checks are performed.
+Parent pointer checks are therefore a second pass to be added to the existing
+connectivity checks:
+
+1. After the set of surviving files has been established (i.e. phase 6),
+   walk the surviving directories of each AG in the filesystem.
+   This is already performed as part of the connectivity checks.
+
+2. For each directory entry found, record the name in an xfblob, and store
+   ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a
+   per-AG in-memory slab.
+
+3. For each AG in the filesystem,
+
+   a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and
+      dirent_pos.
+
+   b. For each inode in the AG,
+
+      1. Scan the inode for parent pointers.
+         Record the names in a per-file xfblob, and store ``(parent_inum,
+         parent_gen, dirent_pos)`` tuples in a per-file slab.
+
+      2. Sort the per-file tuples in order of parent_inum, and dirent_pos.
+
+      3. Position one slab cursor at the start of the inode's records in the
+         per-AG tuple slab.
+         This should be trivial since the per-AG tuples are in child inumber
+         order.
+
+      4. Position a second slab cursor at the start of the per-file tuple slab.
+
+      5. Iterate the two cursors in lockstep, comparing the parent_ino and
+         dirent_pos fields of the records under each cursor.
+
+         a. Tuples in the per-AG list but not the per-file list are missing and
+            need to be written to the inode.
+
+         b. Tuples in the per-file list but not the per-AG list are dangling
+            and need to be removed from the inode.
+
+         c. For tuples in both lists, update the parent_gen and name components
+            of the parent pointer if necessary.
+
+4. Move on to examining link counts, as we do today.
+
+The proposed patchset is the
+`offline parent pointers repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_
+series.
+
+Rebuilding directories from parent pointers in offline repair is very
+challenging because it currently uses a single-pass scan of the filesystem
+during phase 3 to decide which files are corrupt enough to be zapped.
+This scan would have to be converted into a multi-pass scan:
+
+1. The first pass of the scan zaps corrupt inodes, forks, and attributes
+   much as it does now.
+   Corrupt directories are noted but not zapped.
+
+2. The next pass records parent pointers pointing to the directories noted
+   as being corrupt in the first pass.
+   This second pass may have to happen after the phase 4 scan for duplicate
+   blocks, if phase 4 is also capable of zapping directories.
+
+3. The third pass resets corrupt directories to an empty shortform directory.
+   Free space metadata has not been ensured yet, so repair cannot yet use the
+   directory building code in libxfs.
+
+4. At the start of phase 6, space metadata have been rebuilt.
+   Use the parent pointer information recorded during step 2 to reconstruct
+   the dirents and add them to the now-empty directories.
+
+This code has not yet been constructed.
+
+.. _orphanage:
+
+The Orphanage
+-------------
+
+Filesystems present files as a directed, and hopefully acyclic, graph.
+In other words, a tree.
+The root of the filesystem is a directory, and each entry in a directory points
+downwards either to more subdirectories or to non-directory files.
+Unfortunately, a disruption in the directory graph pointers result in a
+disconnected graph, which makes files impossible to access via regular path
+resolution.
+
+Without parent pointers, the directory parent pointer online scrub code can
+detect a dotdot entry pointing to a parent directory that doesn't have a link
+back to the child directory and the file link count checker can detect a file
+that isn't pointed to by any directory in the filesystem.
+If such a file has a positive link count, the file is an orphan.
+
+With parent pointers, directories can be rebuilt by scanning parent pointers
+and parent pointers can be rebuilt by scanning directories.
+This should reduce the incidence of files ending up in ``/lost+found``.
+
+When orphans are found, they should be reconnected to the directory tree.
+Offline fsck solves the problem by creating a directory ``/lost+found`` to
+serve as an orphanage, and linking orphan files into the orphanage by using the
+inumber as the name.
+Reparenting a file to the orphanage does not reset any of its permissions or
+ACLs.
+
+This process is more involved in the kernel than it is in userspace.
+The directory and file link count repair setup functions must use the regular
+VFS mechanisms to create the orphanage directory with all the necessary
+security attributes and dentry cache entries, just like a regular directory
+tree modification.
+
+Orphaned files are adopted by the orphanage as follows:
+
+1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function
+   to try to ensure that the lost and found directory actually exists.
+   This also attaches the orphanage directory to the scrub context.
+
+2. If the decision is made to reconnect a file, take the IOLOCK of both the
+   orphanage and the file being reattached.
+   The ``xrep_orphanage_iolock_two`` function follows the inode locking
+   strategy discussed earlier.
+
+3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name``
+   to compute the new name in the orphanage and the block reservation required.
+
+4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair
+   transaction.
+
+5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost
+   and found, and update the kernel dentry cache.
+
+The proposed patches are in the
+`orphanage adoption
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
+series.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 13/14] xfs: document the userspace fsck driver program
  2023-03-07  1:30   ` Darrick J. Wong
                       ` (11 preceding siblings ...)
  2023-03-07  1:31     ` [PATCH 12/14] xfs: document directory tree repairs Darrick J. Wong
@ 2023-03-07  1:32     ` Darrick J. Wong
  2023-03-07  1:32     ` [PATCH 14/14] xfs: document future directions of online fsck Darrick J. Wong
  13 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:32 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Add the sixth chapter of the online fsck design documentation, where
we discuss the details of the data structures and algorithms used by the
driver program xfs_scrub.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  316 ++++++++++++++++++++
 1 file changed, 316 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 12d3a2866151..7601f53aa4a3 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -315,6 +315,9 @@ The seven phases are as follows:
 7. Re-check the summary counters and presents the caller with a summary of
    space usage and file counts.
 
+This allocation of responsibilities will be :ref:`revisited <scrubcheck>`
+later in this document.
+
 Steps for Each Scrub Item
 -------------------------
 
@@ -4787,3 +4790,316 @@ The proposed patches are in the
 `orphanage adoption
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
 series.
+
+6. Userspace Algorithms and Data Structures
+===========================================
+
+This section discusses the key algorithms and data structures of the userspace
+program, ``xfs_scrub``, that provide the ability to drive metadata checks and
+repairs in the kernel, verify file data, and look for other potential problems.
+
+.. _scrubcheck:
+
+Checking Metadata
+-----------------
+
+Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier.
+That structure follows naturally from the data dependencies designed into the
+filesystem from its beginnings in 1993.
+In XFS, there are several groups of metadata dependencies:
+
+a. Filesystem summary counts depend on consistency within the inode indices,
+   the allocation group space btrees, and the realtime volume space
+   information.
+
+b. Quota resource counts depend on consistency within the quota file data
+   forks, inode indices, inode records, and the forks of every file on the
+   system.
+
+c. The naming hierarchy depends on consistency within the directory and
+   extended attribute structures.
+   This includes file link counts.
+
+d. Directories, extended attributes, and file data depend on consistency within
+   the file forks that map directory and extended attribute data to physical
+   storage media.
+
+e. The file forks depends on consistency within inode records and the space
+   metadata indices of the allocation groups and the realtime volume.
+   This includes quota and realtime metadata files.
+
+f. Inode records depends on consistency within the inode metadata indices.
+
+g. Realtime space metadata depend on the inode records and data forks of the
+   realtime metadata inodes.
+
+h. The allocation group metadata indices (free space, inodes, reference count,
+   and reverse mapping btrees) depend on consistency within the AG headers and
+   between all the AG metadata btrees.
+
+i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support
+   for online fsck functionality.
+
+Therefore, a metadata dependency graph is a convenient way to schedule checking
+operations in the ``xfs_scrub`` program:
+
+- Phase 1 checks that the provided path maps to an XFS filesystem and detect
+  the kernel's scrubbing abilities, which validates group (i).
+
+- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
+
+- Phase 3 scans inodes in parallel.
+  For each inode, groups (f), (e), and (d) are checked, in that order.
+
+- Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
+  may run reliably.
+
+- Phase 5 starts by checking groups (b) and (c) in parallel before moving on
+  to checking names.
+
+- Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
+  to read them, and to report which blocks of which files are affected.
+
+- Phase 7 checks group (a), having validated everything else.
+
+Notice that the data dependencies between groups are enforced by the structure
+of the program flow.
+
+Parallel Inode Scans
+--------------------
+
+An XFS filesystem can easily contain hundreds of millions of inodes.
+Given that XFS targets installations with large high-performance storage,
+it is desirable to scrub inodes in parallel to minimize runtime, particularly
+if the program has been invoked manually from a command line.
+This requires careful scheduling to keep the threads as evenly loaded as
+possible.
+
+Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single
+workqueue and scheduled a single workqueue item per AG.
+Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find
+inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough
+information to construct file handles.
+The file handle was then passed to a function to generate scrub items for each
+metadata object of each inode.
+This simple algorithm leads to thread balancing problems in phase 3 if the
+filesystem contains one AG with a few large sparse files and the rest of the
+AGs contain many smaller files.
+The inode scan dispatch function was not sufficiently granular; it should have
+been dispatching at the level of individual inodes, or, to constrain memory
+consumption, inode btree records.
+
+Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to
+avoid this problem with ease by adding a second workqueue.
+Just like before, the first workqueue is seeded with one workqueue item per AG,
+and it uses INUMBERS to find inode btree chunks.
+The second workqueue, however, is configured with an upper bound on the number
+of items that can be waiting to be run.
+Each inode btree chunk found by the first workqueue's workers are queued to the
+second workqueue, and it is this second workqueue that queries BULKSTAT,
+creates a file handle, and passes it to a function to generate scrub items for
+each metadata object of each inode.
+If the second workqueue is too full, the workqueue add function blocks the
+first workqueue's workers until the backlog eases.
+This doesn't completely solve the balancing problem, but reduces it enough to
+move on to more pressing issues.
+
+The proposed patchsets are the scrub
+`performance tweaks
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
+and the
+`inode scan rebalance
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
+series.
+
+.. _scrubrepair:
+
+Scheduling Repairs
+------------------
+
+During phase 2, corruptions and inconsistencies reported in any AGI header or
+inode btree are repaired immediately, because phase 3 relies on proper
+functioning of the inode indices to find inodes to scan.
+Failed repairs are rescheduled to phase 4.
+Problems reported in any other space metadata are deferred to phase 4.
+Optimization opportunities are always deferred to phase 4, no matter their
+origin.
+
+During phase 3, corruptions and inconsistencies reported in any part of a
+file's metadata are repaired immediately if all space metadata were validated
+during phase 2.
+Repairs that fail or cannot be repaired immediately are scheduled for phase 4.
+
+In the original design of ``xfs_scrub``, it was thought that repairs would be
+so infrequent that the ``struct xfs_scrub_metadata`` objects used to
+communicate with the kernel could also be used as the primary object to
+schedule repairs.
+With recent increases in the number of optimizations possible for a given
+filesystem object, it became much more memory-efficient to track all eligible
+repairs for a given filesystem object with a single repair item.
+Each repair item represents a single lockable object -- AGs, metadata files,
+individual inodes, or a class of summary information.
+
+Phase 4 is responsible for scheduling a lot of repair work in as quick a
+manner as is practical.
+The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which
+means that ``xfs_scrub`` must try to complete the repair work scheduled by
+phase 2 before trying repair work scheduled by phase 3.
+The repair process is as follows:
+
+1. Start a round of repair with a workqueue and enough workers to keep the CPUs
+   as busy as the user desires.
+
+   a. For each repair item queued by phase 2,
+
+      i.   Ask the kernel to repair everything listed in the repair item for a
+           given filesystem object.
+
+      ii.  Make a note if the kernel made any progress in reducing the number
+           of repairs needed for this object.
+
+      iii. If the object no longer requires repairs, revalidate all metadata
+           associated with this object.
+           If the revalidation succeeds, drop the repair item.
+           If not, requeue the item for more repairs.
+
+   b. If any repairs were made, jump back to 1a to retry all the phase 2 items.
+
+   c. For each repair item queued by phase 3,
+
+      i.   Ask the kernel to repair everything listed in the repair item for a
+           given filesystem object.
+
+      ii.  Make a note if the kernel made any progress in reducing the number
+           of repairs needed for this object.
+
+      iii. If the object no longer requires repairs, revalidate all metadata
+           associated with this object.
+           If the revalidation succeeds, drop the repair item.
+           If not, requeue the item for more repairs.
+
+   d. If any repairs were made, jump back to 1c to retry all the phase 3 items.
+
+2. If step 1 made any repair progress of any kind, jump back to step 1 to start
+   another round of repair.
+
+3. If there are items left to repair, run them all serially one more time.
+   Complain if the repairs were not successful, since this is the last chance
+   to repair anything.
+
+Corruptions and inconsistencies encountered during phases 5 and 7 are repaired
+immediately.
+Corrupt file data blocks reported by phase 6 cannot be recovered by the
+filesystem.
+
+The proposed patchsets are the
+`repair warning improvements
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
+refactoring of the
+`repair data dependency
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
+and
+`object tracking
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
+and the
+`repair scheduling
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
+improvement series.
+
+Checking Names for Confusable Unicode Sequences
+-----------------------------------------------
+
+If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of
+phase 4, it moves on to phase 5, which checks for suspicious looking names in
+the filesystem.
+These names consist of the filesystem label, names in directory entries, and
+the names of extended attributes.
+Like most Unix filesystems, XFS imposes the sparest of constraints on the
+contents of a name:
+
+- Slashes and null bytes are not allowed in directory entries.
+
+- Null bytes are not allowed in userspace-visible extended attributes.
+
+- Null bytes are not allowed in the filesystem label.
+
+Directory entries and attribute keys store the length of the name explicitly
+ondisk, which means that nulls are not name terminators.
+For this section, the term "naming domain" refers to any place where names are
+presented together -- all the names in a directory, or all the attributes of a
+file.
+
+Although the Unix naming constraints are very permissive, the reality of most
+modern-day Linux systems is that programs work with Unicode character code
+points to support international languages.
+These programs typically encode those code points in UTF-8 when interfacing
+with the C library because the kernel expects null-terminated names.
+In the common case, therefore, names found in an XFS filesystem are actually
+UTF-8 encoded Unicode data.
+
+To maximize its expressiveness, the Unicode standard defines separate control
+points for various characters that render similarly or identically in writing
+systems around the world.
+For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders
+identically to "Latin Small Letter A" U+0061 "a".
+
+The standard also permits characters to be constructed in multiple ways --
+either by using a defined code point, or by combining one code point with
+various combining marks.
+For example, the character "Angstrom Sign U+212B "Å" can also be expressed
+as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above"
+U+030A "◌̊".
+Both sequences render identically.
+
+Like the standards that preceded it, Unicode also defines various control
+characters to alter the presentation of text.
+For example, the character "Right-to-Left Override" U+202E can trick some
+programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png".
+A second category of rendering problems involves whitespace characters.
+If the character "Zero Width Space" U+200B is encountered in a file name, the
+name will render identically to a name that does not have the zero width
+space.
+
+If two names within a naming domain have different byte sequences but render
+identically, a user may be confused by it.
+The kernel, in its indifference to upper level encoding schemes, permits this.
+Most filesystem drivers persist the byte sequence names that are given to them
+by the VFS.
+
+Techniques for detecting confusable names are explained in great detail in
+sections 4 and 5 of the
+`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_
+document.
+When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
+Unicode normalization form NFD in conjunction with the confusable name
+detection component of
+`libicu <https://github.com/unicode-org/icu>`_
+to identify names with a directory or within a file's extended attributes that
+could be confused for each other.
+Names are also checked for control characters, non-rendering characters, and
+mixing of bidirectional characters.
+All of these potential issues are reported to the system administrator during
+phase 5.
+
+Media Verification of File Data Extents
+---------------------------------------
+
+The system administrator can elect to initiate a media scan of all file data
+blocks.
+This scan after validation of all filesystem metadata (except for the summary
+counters) as phase 6.
+The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map
+to find areas that are allocated to file data fork extents.
+Gaps betweeen data fork extents that are smaller than 64k are treated as if
+they were data fork extents to reduce the command setup overhead.
+When the space map scan accumulates a region larger than 32MB, a media
+verification request is sent to the disk as a directio read of the raw block
+device.
+
+If the verification read fails, ``xfs_scrub`` retries with single-block reads
+to narrow down the failure to the specific region of the media and recorded.
+When it has finished issuing verification requests, it again uses the space
+mapping ioctl to map the recorded media errors back to metadata structures
+and report what has been lost.
+For media errors in blocks owned by files, parent pointers can be used to
+construct file paths from inode numbers for user-friendly reporting.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 14/14] xfs: document future directions of online fsck
  2023-03-07  1:30   ` Darrick J. Wong
                       ` (12 preceding siblings ...)
  2023-03-07  1:32     ` [PATCH 13/14] xfs: document the userspace fsck driver program Darrick J. Wong
@ 2023-03-07  1:32     ` Darrick J. Wong
  13 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2023-03-07  1:32 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Add the seventh and final chapter of the online fsck documentation,
where we talk about future functionality that can tie in with the
functionality provided by the online fsck patchset.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  210 ++++++++++++++++++++
 1 file changed, 210 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index 7601f53aa4a3..2dc27ed45d01 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -5103,3 +5103,213 @@ mapping ioctl to map the recorded media errors back to metadata structures
 and report what has been lost.
 For media errors in blocks owned by files, parent pointers can be used to
 construct file paths from inode numbers for user-friendly reporting.
+
+7. Conclusion and Future Work
+=============================
+
+It is hoped that the reader of this document has followed the designs laid out
+in this document and now has some familiarity with how XFS performs online
+rebuilding of its metadata indices, and how filesystem users can interact with
+that functionality.
+Although the scope of this work is daunting, it is hoped that this guide will
+make it easier for code readers to understand what has been built, for whom it
+has been built, and why.
+Please feel free to contact the XFS mailing list with questions.
+
+FIEXCHANGE_RANGE
+----------------
+
+As discussed earlier, a second frontend to the atomic extent swap mechanism is
+a new ioctl call that userspace programs can use to commit updates to files
+atomically.
+This frontend has been out for review for several years now, though the
+necessary refinements to online repair and lack of customer demand mean that
+the proposal has not been pushed very hard.
+
+Extent Swapping with Regular User Files
+```````````````````````````````````````
+
+As mentioned earlier, XFS has long had the ability to swap extents between
+files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
+The earliest form of this was the fork swap mechanism, where the entire
+contents of data forks could be exchanged between two files by exchanging the
+raw bytes in each inode fork's immediate area.
+When XFS v5 came along with self-describing metadata, this old mechanism grew
+some log support to continue rewriting the owner fields of BMBT blocks during
+log recovery.
+When the reverse mapping btree was later added to XFS, the only way to maintain
+the consistency of the fork mappings with the reverse mapping index was to
+develop an iterative mechanism that used deferred bmap and rmap operations to
+swap mappings one at a time.
+This mechanism is identical to steps 2-3 from the procedure above except for
+the new tracking items, because the atomic extent swap mechanism is an
+iteration of an existing mechanism and not something totally novel.
+For the narrow case of file defragmentation, the file contents must be
+identical, so the recovery guarantees are not much of a gain.
+
+Atomic extent swapping is much more flexible than the existing swapext
+implementations because it can guarantee that the caller never sees a mix of
+old and new contents even after a crash, and it can operate on two arbitrary
+file fork ranges.
+The extra flexibility enables several new use cases:
+
+- **Atomic commit of file writes**: A userspace process opens a file that it
+  wants to update.
+  Next, it opens a temporary file and calls the file clone operation to reflink
+  the first file's contents into the temporary file.
+  Writes to the original file should instead be written to the temporary file.
+  Finally, the process calls the atomic extent swap system call
+  (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
+  of the updates to the original file, or none of them.
+
+.. _swapext_if_unchanged:
+
+- **Transactional file updates**: The same mechanism as above, but the caller
+  only wants the commit to occur if the original file's contents have not
+  changed.
+  To make this happen, the calling process snapshots the file modification and
+  change timestamps of the original file before reflinking its data to the
+  temporary file.
+  When the program is ready to commit the changes, it passes the timestamps
+  into the kernel as arguments to the atomic extent swap system call.
+  The kernel only commits the changes if the provided timestamps match the
+  original file.
+
+- **Emulation of atomic block device writes**: Export a block device with a
+  logical sector size matching the filesystem block size to force all writes
+  to be aligned to the filesystem block size.
+  Stage all writes to a temporary file, and when that is complete, call the
+  atomic extent swap system call with a flag to indicate that holes in the
+  temporary file should be ignored.
+  This emulates an atomic device write in software, and can support arbitrary
+  scattered writes.
+
+Vectorized Scrub
+----------------
+
+As it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned
+earlier was a catalyst for enabling a vectorized scrub system call.
+Since 2018, the cost of making a kernel call has increased considerably on some
+systems to mitigate the effects of speculative execution attacks.
+This incentivizes program authors to make as few system calls as possible to
+reduce the number of times an execution path crosses a security boundary.
+
+With vectorized scrub, userspace pushes to the kernel the identity of a
+filesystem object, a list of scrub types to run against that object, and a
+simple representation of the data dependencies between the selected scrub
+types.
+The kernel executes as much of the caller's plan as it can until it hits a
+dependency that cannot be satisfied due to a corruption, and tells userspace
+how much was accomplished.
+It is hoped that ``io_uring`` will pick up enough of this functionality that
+online fsck can use that instead of adding a separate vectored scrub system
+call to XFS.
+
+The relevant patchsets are the
+`kernel vectorized scrub
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
+and
+`userspace vectorized scrub
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
+series.
+
+Quality of Service Targets for Scrub
+------------------------------------
+
+One serious shortcoming of the online fsck code is that the amount of time that
+it can spend in the kernel holding resource locks is basically unbounded.
+Userspace is allowed to send a fatal signal to the process which will cause
+``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way
+for userspace to provide a time budget to the kernel.
+Given that the scrub codebase has helpers to detect fatal signals, it shouldn't
+be too much work to allow userspace to specify a timeout for a scrub/repair
+operation and abort the operation if it exceeds budget.
+However, most repair functions have the property that once they begin to touch
+ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS
+timeout is no longer useful.
+
+Defragmenting Free Space
+------------------------
+
+Over the years, many XFS users have requested the creation of a program to
+clear a portion of the physical storage underlying a filesystem so that it
+becomes a contiguous chunk of free space.
+Call this free space defragmenter ``clearspace`` for short.
+
+The first piece the ``clearspace`` program needs is the ability to read the
+reverse mapping index from userspace.
+This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
+The second piece it needs is a new fallocate mode
+(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and
+maps it to a file.
+Call this file the "space collector" file.
+The third piece is the ability to force an online repair.
+
+To clear all the metadata out of a portion of physical storage, clearspace
+uses the new fallocate map-freespace call to map any free space in that region
+to the space collector file.
+Next, clearspace finds all metadata blocks in that region by way of
+``GETFSMAP`` and issues forced repair requests on the data structure.
+This often results in the metadata being rebuilt somewhere that is not being
+cleared.
+After each relocation, clearspace calls the "map free space" function again to
+collect any newly freed space in the region being cleared.
+
+To clear all the file data out of a portion of the physical storage, clearspace
+uses the FSMAP information to find relevant file data blocks.
+Having identified a good target, it uses the ``FICLONERANGE`` call on that part
+of the file to try to share the physical space with a dummy file.
+Cloning the extent means that the original owners cannot overwrite the
+contents; any changes will be written somewhere else via copy-on-write.
+Clearspace makes its own copy of the frozen extent in an area that is not being
+cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
+<swapext_if_unchanged>` feature) to change the target file's data extent
+mapping away from the area being cleared.
+When all other mappings have been moved, clearspace reflinks the space into the
+space collector file so that it becomes unavailable.
+
+There are further optimizations that could apply to the above algorithm.
+To clear a piece of physical storage that has a high sharing factor, it is
+strongly desirable to retain this sharing factor.
+In fact, these extents should be moved first to maximize sharing factor after
+the operation completes.
+To make this work smoothly, clearspace needs a new ioctl
+(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace.
+With the refcount information exposed, clearspace can quickly find the longest,
+most shared data extents in the filesystem, and target them first.
+
+**Future Work Question**: How might the filesystem move inode chunks?
+
+*Answer*: To move inode chunks, Dave Chinner constructed a prototype program
+that creates a new file with the old contents and then locklessly runs around
+the filesystem updating directory entries.
+The operation cannot complete if the filesystem goes down.
+That problem isn't totally insurmountable: create an inode remapping table
+hidden behind a jump label, and a log item that tracks the kernel walking the
+filesystem to update directory entries.
+The trouble is, the kernel can't do anything about open files, since it cannot
+revoke them.
+
+**Future Work Question**: Can static keys be used to minimize the cost of
+supporting ``revoke()`` on XFS files?
+
+*Answer*: Yes.
+Until the first revocation, the bailout code need not be in the call path at
+all.
+
+The relevant patchsets are the
+`kernel freespace defrag
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
+and
+`userspace freespace defrag
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
+series.
+
+Shrinking Filesystems
+---------------------
+
+Removing the end of the filesystem ought to be a simple matter of evacuating
+the data and metadata at the end of the filesystem, and handing the freed space
+to the shrink code.
+That requires an evacuation of the space at end of the filesystem, which is a
+use of free space defragmentation!


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* [PATCH 03/14] xfs: document the testing plan for online fsck
  2022-10-02 18:19 [PATCHSET v23.3 00/14] xfs: design documentation for online fsck Darrick J. Wong
@ 2022-10-02 18:19 ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-10-02 18:19 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang, david

From: Darrick J. Wong <djwong@kernel.org>

Start the third chapter of the online fsck design documentation.  This
covers the testing plan to make sure that both online and offline fsck
can detect arbitrary problems and correct them without making things
worse.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  187 ++++++++++++++++++++
 1 file changed, 187 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index a03a7b9f0250..d630b6bdbe4a 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -563,3 +563,190 @@ functionality.
 Many of these risks are inherent to software programming.
 Despite this, it is hoped that this new functionality will prove useful in
 reducing unexpected downtime.
+
+3. Testing Plan
+===============
+
+As stated before, fsck tools have three main goals:
+
+1. Detect inconsistencies in the metadata;
+
+2. Eliminate those inconsistencies; and
+
+3. Minimize further loss of data.
+
+Demonstrations of correct operation are necessary to build users' confidence
+that the software behaves within expectations.
+Unfortunately, it was not really feasible to perform regular exhaustive testing
+of every aspect of a fsck tool until the introduction of low-cost virtual
+machines with high-IOPS storage.
+With ample hardware availability in mind, the testing strategy for the online
+fsck project involves differential analysis against the existing fsck tools and
+systematic testing of every attribute of every type of metadata object.
+Testing can be split into four major categories, as discussed below.
+
+Integrated Testing with fstests
+-------------------------------
+
+The primary goal of any free software QA effort is to make testing as
+inexpensive and widespread as possible to maximize the scaling advantages of
+community.
+In other words, testing should maximize the breadth of filesystem configuration
+scenarios and hardware setups.
+This improves code quality by enabling the authors of online fsck to find and
+fix bugs early, and helps developers of new features to find integration
+issues earlier in their development effort.
+
+The Linux filesystem community shares a common QA testing suite,
+`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
+functional and regression testing.
+Even before development work began on online fsck, fstests (when run on XFS)
+would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
+scratch filesystems between each test.
+This provides a level of assurance that the kernel and the fsck tools stay in
+alignment about what constitutes consistent metadata.
+During development of the online checking code, fstests was modified to run
+``xfs_scrub -n`` between each test to ensure that the new checking code
+produces the same results as the two existing fsck tools.
+
+To start development of online repair, fstests was modified to run
+``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
+This ensures that offline repair does not crash, leave a corrupt filesystem
+after it exists, or trigger complaints from the online check.
+This also established a baseline for what can and cannot be repaired offline.
+To complete the first phase of development of online repair, fstests was
+modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
+This enables a comparison of the effectiveness of online repair as compared to
+the existing offline repair tools.
+
+General Fuzz Testing of Metadata Blocks
+---------------------------------------
+
+XFS benefits greatly from having a very robust debugging tool, ``xfs_db``.
+
+Before development of online fsck even began, a set of fstests were created
+to test the rather common fault that entire metadata blocks get corrupted.
+This required the creation of fstests library code that can create a filesystem
+containing every possible type of metadata object.
+Next, individual test cases were created to create a test filesystem, identify
+a single block of a specific type of metadata object, trash it with the
+existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
+particular metadata validation strategy.
+
+This earlier test suite enabled XFS developers to test the ability of the
+in-kernel validation functions and the ability of the offline fsck tool to
+detect and eliminate the inconsistent metadata.
+This part of the test suite was extended to cover online fsck in exactly the
+same manner.
+
+In other words, for a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem:
+
+  * Write garbage to it
+
+  * Test the reactions of:
+
+    1. The kernel verifiers to stop obviously bad metadata
+    2. Offline repair (``xfs_repair``) to detect and fix
+    3. Online repair (``xfs_scrub``) to detect and fix
+
+Targeted Fuzz Testing of Metadata Records
+-----------------------------------------
+
+A quick conversation with the other XFS developers revealed that the existing
+test infrastructure could be extended to provide a much more powerful
+facility: targeted fuzz testing of every metadata field of every metadata
+object in the filesystem.
+``xfs_db`` can modify every field of every metadata structure in every
+block in the filesystem to simulate the effects of memory corruption and
+software bugs.
+Given that fstests already contains the ability to create a filesystem
+containing every metadata format known to the filesystem, ``xfs_db`` can be
+used to perform exhaustive fuzz testing!
+
+For a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem...
+
+  * For each record inside that metadata object...
+
+    * For each field inside that record...
+
+      * For each conceivable type of transformation that can be applied to a bit field...
+
+        1. Clear all bits
+        2. Set all bits
+        3. Toggle the most significant bit
+        4. Toggle the middle bit
+        5. Toggle the least significant bit
+        6. Add a small quantity
+        7. Subtract a small quantity
+        8. Randomize the contents
+
+        * ...test the reactions of:
+
+          1. The kernel verifiers to stop obviously bad metadata
+          2. Offline checking (``xfs_repair -n``)
+          3. Offline repair (``xfs_repair``)
+          4. Online checking (``xfs_scrub -n``)
+          5. Online repair (``xfs_scrub``)
+          6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
+
+This is quite the combinatoric explosion!
+
+Fortunately, having this much test coverage makes it easy for XFS developers to
+check the responses of XFS' fsck tools.
+Since the introduction of the fuzz testing framework, these tests have been
+used to discover incorrect repair code and missing functionality for entire
+classes of metadata objects in ``xfs_repair``.
+The enhanced testing was used to finalize the deprecation of ``xfs_check`` by
+confirming that ``xfs_repair`` could detect at least as many corruptions as
+the older tool.
+
+These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
+allow the online fsck developers to compare online fsck against offline fsck,
+and they enable XFS developers to find deficiencies in the code base.
+
+Proposed patchsets include
+`general fuzzer improvements
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
+`fuzzing baselines
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
+and `improvements in fuzz testing comprehensiveness
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
+
+Stress Testing
+--------------
+
+A unique requirement to online fsck is the ability to operate on a filesystem
+concurrently with regular workloads.
+Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
+impact on the running system, the online repair code should never introduce
+inconsistencies into the filesystem metadata, and regular workloads should
+never notice resource starvation.
+To verify that these conditions are being met, fstests has been enhanced in
+the following ways:
+
+* For each scrub item type, create a test to exercise checking that item type
+  while running ``fsstress``.
+* For each scrub item type, create a test to exercise repairing that item type
+  while running ``fsstress``.
+* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
+  filesystem doesn't cause problems.
+* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
+  force-repairing the whole filesystem doesn't cause problems.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+  freezing and thawing the filesystem.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+  remounting the filesystem read-only and read-write.
+* The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)
+
+Success is defined by the ability to run all of these tests without observing
+any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
+check warnings, or any other sort of mischief.
+
+Proposed patchsets include `general stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
+and the `evolution of existing per-function stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/14] xfs: document the testing plan for online fsck
  2022-08-11  0:09   ` Dave Chinner
@ 2022-08-16  2:18     ` Darrick J. Wong
  0 siblings, 0 replies; 220+ messages in thread
From: Darrick J. Wong @ 2022-08-16  2:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang

On Thu, Aug 11, 2022 at 10:09:45AM +1000, Dave Chinner wrote:
> On Sun, Aug 07, 2022 at 11:30:22AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Start the third chapter of the online fsck design documentation.  This
> > covers the testing plan to make sure that both online and offline fsck
> > can detect arbitrary problems and correct them without making things
> > worse.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  .../filesystems/xfs-online-fsck-design.rst         |  187 ++++++++++++++++++++
> >  1 file changed, 187 insertions(+)
> 
> 
> ....
> > +Stress Testing
> > +--------------
> > +
> > +A unique requirement to online fsck is the ability to operate on a filesystem
> > +concurrently with regular workloads.
> > +Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
> > +impact on the running system, the online repair code should never introduce
> > +inconsistencies into the filesystem metadata, and regular workloads should
> > +never notice resource starvation.
> > +To verify that these conditions are being met, fstests has been enhanced in
> > +the following ways:
> > +
> > +* For each scrub item type, create a test to exercise checking that item type
> > +  while running ``fsstress``.
> > +* For each scrub item type, create a test to exercise repairing that item type
> > +  while running ``fsstress``.
> > +* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
> > +  filesystem doesn't cause problems.
> > +* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
> > +  force-repairing the whole filesystem doesn't cause problems.
> > +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
> > +  freezing and thawing the filesystem.
> > +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
> > +  remounting the filesystem read-only and read-write.
> > +* The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)
> 
> I had a thought when reading this that we want to ensure that online
> repair handles concurrent grow/shrink operations so that doesn't
> cause problems, as well as dealing with concurrent attempts to run
> independent online repair processes.
> 
> Not sure that comes under stress testing, but it was the "test while
> freeze/thaw" that triggered me to think of this, so that's where I'm
> commenting about it. :)

Hmm.  I hadn't really given that much thought.  Let me go add that to
the test suite and see how many daemons come pouring out...

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 220+ messages in thread

* Re: [PATCH 03/14] xfs: document the testing plan for online fsck
  2022-08-07 18:30 ` [PATCH 03/14] xfs: document the testing plan " Darrick J. Wong
@ 2022-08-11  0:09   ` Dave Chinner
  2022-08-16  2:18     ` Darrick J. Wong
  0 siblings, 1 reply; 220+ messages in thread
From: Dave Chinner @ 2022-08-11  0:09 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang

On Sun, Aug 07, 2022 at 11:30:22AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Start the third chapter of the online fsck design documentation.  This
> covers the testing plan to make sure that both online and offline fsck
> can detect arbitrary problems and correct them without making things
> worse.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  .../filesystems/xfs-online-fsck-design.rst         |  187 ++++++++++++++++++++
>  1 file changed, 187 insertions(+)


....
> +Stress Testing
> +--------------
> +
> +A unique requirement to online fsck is the ability to operate on a filesystem
> +concurrently with regular workloads.
> +Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
> +impact on the running system, the online repair code should never introduce
> +inconsistencies into the filesystem metadata, and regular workloads should
> +never notice resource starvation.
> +To verify that these conditions are being met, fstests has been enhanced in
> +the following ways:
> +
> +* For each scrub item type, create a test to exercise checking that item type
> +  while running ``fsstress``.
> +* For each scrub item type, create a test to exercise repairing that item type
> +  while running ``fsstress``.
> +* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
> +  filesystem doesn't cause problems.
> +* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
> +  force-repairing the whole filesystem doesn't cause problems.
> +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
> +  freezing and thawing the filesystem.
> +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
> +  remounting the filesystem read-only and read-write.
> +* The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)

I had a thought when reading this that we want to ensure that online
repair handles concurrent grow/shrink operations so that doesn't
cause problems, as well as dealing with concurrent attempts to run
independent online repair processes.

Not sure that comes under stress testing, but it was the "test while
freeze/thaw" that triggered me to think of this, so that's where I'm
commenting about it. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 220+ messages in thread

* [PATCH 03/14] xfs: document the testing plan for online fsck
  2022-08-07 18:30 [PATCHSET v2 00/14] xfs: design documentation " Darrick J. Wong
@ 2022-08-07 18:30 ` Darrick J. Wong
  2022-08-11  0:09   ` Dave Chinner
  0 siblings, 1 reply; 220+ messages in thread
From: Darrick J. Wong @ 2022-08-07 18:30 UTC (permalink / raw)
  To: djwong
  Cc: linux-xfs, willy, chandan.babu, allison.henderson, linux-fsdevel,
	hch, catherine.hoang

From: Darrick J. Wong <djwong@kernel.org>

Start the third chapter of the online fsck design documentation.  This
covers the testing plan to make sure that both online and offline fsck
can detect arbitrary problems and correct them without making things
worse.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs-online-fsck-design.rst         |  187 ++++++++++++++++++++
 1 file changed, 187 insertions(+)


diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst
index a03a7b9f0250..d630b6bdbe4a 100644
--- a/Documentation/filesystems/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs-online-fsck-design.rst
@@ -563,3 +563,190 @@ functionality.
 Many of these risks are inherent to software programming.
 Despite this, it is hoped that this new functionality will prove useful in
 reducing unexpected downtime.
+
+3. Testing Plan
+===============
+
+As stated before, fsck tools have three main goals:
+
+1. Detect inconsistencies in the metadata;
+
+2. Eliminate those inconsistencies; and
+
+3. Minimize further loss of data.
+
+Demonstrations of correct operation are necessary to build users' confidence
+that the software behaves within expectations.
+Unfortunately, it was not really feasible to perform regular exhaustive testing
+of every aspect of a fsck tool until the introduction of low-cost virtual
+machines with high-IOPS storage.
+With ample hardware availability in mind, the testing strategy for the online
+fsck project involves differential analysis against the existing fsck tools and
+systematic testing of every attribute of every type of metadata object.
+Testing can be split into four major categories, as discussed below.
+
+Integrated Testing with fstests
+-------------------------------
+
+The primary goal of any free software QA effort is to make testing as
+inexpensive and widespread as possible to maximize the scaling advantages of
+community.
+In other words, testing should maximize the breadth of filesystem configuration
+scenarios and hardware setups.
+This improves code quality by enabling the authors of online fsck to find and
+fix bugs early, and helps developers of new features to find integration
+issues earlier in their development effort.
+
+The Linux filesystem community shares a common QA testing suite,
+`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
+functional and regression testing.
+Even before development work began on online fsck, fstests (when run on XFS)
+would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
+scratch filesystems between each test.
+This provides a level of assurance that the kernel and the fsck tools stay in
+alignment about what constitutes consistent metadata.
+During development of the online checking code, fstests was modified to run
+``xfs_scrub -n`` between each test to ensure that the new checking code
+produces the same results as the two existing fsck tools.
+
+To start development of online repair, fstests was modified to run
+``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
+This ensures that offline repair does not crash, leave a corrupt filesystem
+after it exists, or trigger complaints from the online check.
+This also established a baseline for what can and cannot be repaired offline.
+To complete the first phase of development of online repair, fstests was
+modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
+This enables a comparison of the effectiveness of online repair as compared to
+the existing offline repair tools.
+
+General Fuzz Testing of Metadata Blocks
+---------------------------------------
+
+XFS benefits greatly from having a very robust debugging tool, ``xfs_db``.
+
+Before development of online fsck even began, a set of fstests were created
+to test the rather common fault that entire metadata blocks get corrupted.
+This required the creation of fstests library code that can create a filesystem
+containing every possible type of metadata object.
+Next, individual test cases were created to create a test filesystem, identify
+a single block of a specific type of metadata object, trash it with the
+existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
+particular metadata validation strategy.
+
+This earlier test suite enabled XFS developers to test the ability of the
+in-kernel validation functions and the ability of the offline fsck tool to
+detect and eliminate the inconsistent metadata.
+This part of the test suite was extended to cover online fsck in exactly the
+same manner.
+
+In other words, for a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem:
+
+  * Write garbage to it
+
+  * Test the reactions of:
+
+    1. The kernel verifiers to stop obviously bad metadata
+    2. Offline repair (``xfs_repair``) to detect and fix
+    3. Online repair (``xfs_scrub``) to detect and fix
+
+Targeted Fuzz Testing of Metadata Records
+-----------------------------------------
+
+A quick conversation with the other XFS developers revealed that the existing
+test infrastructure could be extended to provide a much more powerful
+facility: targeted fuzz testing of every metadata field of every metadata
+object in the filesystem.
+``xfs_db`` can modify every field of every metadata structure in every
+block in the filesystem to simulate the effects of memory corruption and
+software bugs.
+Given that fstests already contains the ability to create a filesystem
+containing every metadata format known to the filesystem, ``xfs_db`` can be
+used to perform exhaustive fuzz testing!
+
+For a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem...
+
+  * For each record inside that metadata object...
+
+    * For each field inside that record...
+
+      * For each conceivable type of transformation that can be applied to a bit field...
+
+        1. Clear all bits
+        2. Set all bits
+        3. Toggle the most significant bit
+        4. Toggle the middle bit
+        5. Toggle the least significant bit
+        6. Add a small quantity
+        7. Subtract a small quantity
+        8. Randomize the contents
+
+        * ...test the reactions of:
+
+          1. The kernel verifiers to stop obviously bad metadata
+          2. Offline checking (``xfs_repair -n``)
+          3. Offline repair (``xfs_repair``)
+          4. Online checking (``xfs_scrub -n``)
+          5. Online repair (``xfs_scrub``)
+          6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
+
+This is quite the combinatoric explosion!
+
+Fortunately, having this much test coverage makes it easy for XFS developers to
+check the responses of XFS' fsck tools.
+Since the introduction of the fuzz testing framework, these tests have been
+used to discover incorrect repair code and missing functionality for entire
+classes of metadata objects in ``xfs_repair``.
+The enhanced testing was used to finalize the deprecation of ``xfs_check`` by
+confirming that ``xfs_repair`` could detect at least as many corruptions as
+the older tool.
+
+These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
+allow the online fsck developers to compare online fsck against offline fsck,
+and they enable XFS developers to find deficiencies in the code base.
+
+Proposed patchsets include
+`general fuzzer improvements
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
+`fuzzing baselines
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
+and `improvements in fuzz testing comprehensiveness
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
+
+Stress Testing
+--------------
+
+A unique requirement to online fsck is the ability to operate on a filesystem
+concurrently with regular workloads.
+Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
+impact on the running system, the online repair code should never introduce
+inconsistencies into the filesystem metadata, and regular workloads should
+never notice resource starvation.
+To verify that these conditions are being met, fstests has been enhanced in
+the following ways:
+
+* For each scrub item type, create a test to exercise checking that item type
+  while running ``fsstress``.
+* For each scrub item type, create a test to exercise repairing that item type
+  while running ``fsstress``.
+* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
+  filesystem doesn't cause problems.
+* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
+  force-repairing the whole filesystem doesn't cause problems.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+  freezing and thawing the filesystem.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+  remounting the filesystem read-only and read-write.
+* The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)
+
+Success is defined by the ability to run all of these tests without observing
+any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
+check warnings, or any other sort of mischief.
+
+Proposed patchsets include `general stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
+and the `evolution of existing per-function stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.


^ permalink raw reply related	[flat|nested] 220+ messages in thread

end of thread, other threads:[~2023-03-07  1:33 UTC | newest]

Thread overview: 220+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-30 21:13 [NYE DELUGE 1/4] xfs: all pending online scrub improvements Darrick J. Wong
2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
2022-12-30 22:10   ` [PATCH 02/14] xfs: document the general theory underlying online fsck design Darrick J. Wong
2023-01-11  1:25     ` Allison Henderson
2023-01-11 23:39       ` Darrick J. Wong
2023-01-12  0:29         ` Dave Chinner
2023-01-18  0:03         ` Allison Henderson
2023-01-18  2:35           ` Darrick J. Wong
2022-12-30 22:10   ` [PATCH 01/14] xfs: document the motivation for " Darrick J. Wong
2023-01-07  5:01     ` Allison Henderson
2023-01-11 19:10       ` Darrick J. Wong
2023-01-18  0:03         ` Allison Henderson
2023-01-18  1:29           ` Darrick J. Wong
2023-01-12  0:10       ` Darrick J. Wong
2022-12-30 22:10   ` [PATCH 08/14] xfs: document btree bulk loading Darrick J. Wong
2023-02-09  5:47     ` Allison Henderson
2023-02-10  0:24       ` Darrick J. Wong
2023-02-16 15:46         ` Allison Henderson
2023-02-16 21:08           ` Darrick J. Wong
2022-12-30 22:10   ` [PATCH 04/14] xfs: document the user interface for online fsck Darrick J. Wong
2023-01-18  0:03     ` Allison Henderson
2023-01-18  2:42       ` Darrick J. Wong
2022-12-30 22:10   ` [PATCH 03/14] xfs: document the testing plan " Darrick J. Wong
2023-01-18  0:03     ` Allison Henderson
2023-01-18  2:38       ` Darrick J. Wong
2022-12-30 22:10   ` [PATCH 05/14] xfs: document the filesystem metadata checking strategy Darrick J. Wong
2023-01-21  1:38     ` Allison Henderson
2023-02-02 19:04       ` Darrick J. Wong
2023-02-09  5:41         ` Allison Henderson
2022-12-30 22:10   ` [PATCH 07/14] xfs: document pageable kernel memory Darrick J. Wong
2023-02-02  7:14     ` Allison Henderson
2023-02-02 23:14       ` Darrick J. Wong
2023-02-09  5:41         ` Allison Henderson
2023-02-09 23:14           ` Darrick J. Wong
2023-02-25  7:32             ` Allison Henderson
2022-12-30 22:10   ` [PATCH 09/14] xfs: document online file metadata repair code Darrick J. Wong
2022-12-30 22:10   ` [PATCH 06/14] xfs: document how online fsck deals with eventual consistency Darrick J. Wong
2023-01-05  9:08     ` Amir Goldstein
2023-01-05 19:40       ` Darrick J. Wong
2023-01-06  3:33         ` Amir Goldstein
2023-01-11 17:54           ` Darrick J. Wong
2023-01-31  6:11     ` Allison Henderson
2023-02-02 19:55       ` Darrick J. Wong
2023-02-09  5:41         ` Allison Henderson
2022-12-30 22:10   ` [PATCH 13/14] xfs: document the userspace fsck driver program Darrick J. Wong
2023-03-01  5:36     ` Allison Henderson
2023-03-02  0:27       ` Darrick J. Wong
2023-03-03 23:51         ` Allison Henderson
2023-03-04  2:25           ` Darrick J. Wong
2022-12-30 22:10   ` [PATCH 14/14] xfs: document future directions of online fsck Darrick J. Wong
2023-03-01  5:37     ` Allison Henderson
2023-03-02  0:39       ` Darrick J. Wong
2023-03-03 23:51         ` Allison Henderson
2023-03-04  2:28           ` Darrick J. Wong
2022-12-30 22:10   ` [PATCH 11/14] xfs: document metadata file repair Darrick J. Wong
2023-02-25  7:33     ` Allison Henderson
2023-03-01  2:42       ` Darrick J. Wong
2022-12-30 22:10   ` [PATCH 10/14] xfs: document full filesystem scans for online fsck Darrick J. Wong
2023-02-16 15:47     ` Allison Henderson
2023-02-16 22:48       ` Darrick J. Wong
2023-02-25  7:33         ` Allison Henderson
2023-03-01 22:09           ` Darrick J. Wong
2022-12-30 22:10   ` [PATCH 12/14] xfs: document directory tree repairs Darrick J. Wong
2023-01-14  2:32     ` [PATCH v24.2 " Darrick J. Wong
2023-02-03  2:12     ` [PATCH v24.3 " Darrick J. Wong
2023-02-25  7:33       ` Allison Henderson
2023-03-02  0:14         ` Darrick J. Wong
2023-03-03 23:50           ` Allison Henderson
2023-03-04  2:19             ` Darrick J. Wong
2023-03-07  1:30   ` [PATCHSET v24.3 00/14] xfs: design documentation for online fsck Darrick J. Wong
2023-03-07  1:30   ` Darrick J. Wong
2023-03-07  1:30     ` [PATCH 01/14] xfs: document the motivation for online fsck design Darrick J. Wong
2023-03-07  1:31     ` [PATCH 02/14] xfs: document the general theory underlying " Darrick J. Wong
2023-03-07  1:31     ` [PATCH 03/14] xfs: document the testing plan for online fsck Darrick J. Wong
2023-03-07  1:31     ` [PATCH 04/14] xfs: document the user interface " Darrick J. Wong
2023-03-07  1:31     ` [PATCH 05/14] xfs: document the filesystem metadata checking strategy Darrick J. Wong
2023-03-07  1:31     ` [PATCH 06/14] xfs: document how online fsck deals with eventual consistency Darrick J. Wong
2023-03-07  1:31     ` [PATCH 07/14] xfs: document pageable kernel memory Darrick J. Wong
2023-03-07  1:31     ` [PATCH 08/14] xfs: document btree bulk loading Darrick J. Wong
2023-03-07  1:31     ` [PATCH 09/14] xfs: document online file metadata repair code Darrick J. Wong
2023-03-07  1:31     ` [PATCH 10/14] xfs: document full filesystem scans for online fsck Darrick J. Wong
2023-03-07  1:31     ` [PATCH 11/14] xfs: document metadata file repair Darrick J. Wong
2023-03-07  1:31     ` [PATCH 12/14] xfs: document directory tree repairs Darrick J. Wong
2023-03-07  1:32     ` [PATCH 13/14] xfs: document the userspace fsck driver program Darrick J. Wong
2023-03-07  1:32     ` [PATCH 14/14] xfs: document future directions of online fsck Darrick J. Wong
2022-12-30 22:10 ` [PATCHSET v24.0 0/8] xfs: variable and naming cleanups for intent items Darrick J. Wong
2022-12-30 22:10   ` [PATCH 1/8] xfs: pass the xfs_bmbt_irec directly through the log intent code Darrick J. Wong
2022-12-30 22:10   ` [PATCH 2/8] xfs: fix confusing variable names in xfs_bmap_item.c Darrick J. Wong
2022-12-30 22:10   ` [PATCH 8/8] xfs: fix confusing variable names in xfs_refcount_item.c Darrick J. Wong
2022-12-30 22:10   ` [PATCH 3/8] xfs: pass xfs_extent_free_item directly through the log intent code Darrick J. Wong
2022-12-30 22:10   ` [PATCH 5/8] xfs: pass rmap space mapping " Darrick J. Wong
2022-12-30 22:10   ` [PATCH 4/8] xfs: fix confusing xfs_extent_item variable names Darrick J. Wong
2022-12-30 22:10   ` [PATCH 6/8] xfs: fix confusing variable names in xfs_rmap_item.c Darrick J. Wong
2022-12-30 22:10   ` [PATCH 7/8] xfs: pass refcount intent directly through the log intent code Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: make intent items take a perag reference Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/5] xfs: give xfs_bmap_intent its own " Darrick J. Wong
2022-12-30 22:11   ` [PATCH 5/5] xfs: give xfs_refcount_intent " Darrick J. Wong
2022-12-30 22:11   ` [PATCH 2/5] xfs: pass per-ag references to xfs_free_extent Darrick J. Wong
2022-12-30 22:11   ` [PATCH 3/5] xfs: give xfs_extfree_intent its own perag reference Darrick J. Wong
2022-12-30 22:11   ` [PATCH 4/5] xfs: give xfs_rmap_intent " Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/1] xfs: pass perag references around when possible Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/1] xfs: create a function to duplicate an active perag reference Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: drain deferred work items when scrubbing Darrick J. Wong
2022-12-30 22:11   ` [PATCH 3/5] xfs: clean up scrub context if scrub setup returns -EDEADLOCK Darrick J. Wong
2022-12-30 22:11   ` [PATCH 2/5] xfs: allow queued AG intents to drain before scrubbing Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/5] xfs: add a tracepoint to report incorrect extent refcounts Darrick J. Wong
2022-12-30 22:11   ` [PATCH 4/5] xfs: minimize overhead of drain wakeups by using jump labels Darrick J. Wong
2022-12-30 22:11   ` [PATCH 5/5] xfs: scrub should use ECHRNG to signal that the drain is needed Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/8] xfs: standardize btree record checking code Darrick J. Wong
2022-12-30 22:11   ` [PATCH 4/8] xfs: return a failure address from xfs_rmap_irec_offset_unpack Darrick J. Wong
2022-12-30 22:11   ` [PATCH 2/8] xfs: standardize ondisk to incore conversion for inode btrees Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/8] xfs: standardize ondisk to incore conversion for free space btrees Darrick J. Wong
2022-12-30 22:11   ` [PATCH 3/8] xfs: standardize ondisk to incore conversion for refcount btrees Darrick J. Wong
2022-12-30 22:11   ` [PATCH 7/8] xfs: complain about bad records in query_range helpers Darrick J. Wong
2022-12-30 22:11   ` [PATCH 6/8] xfs: standardize ondisk to incore conversion for bmap btrees Darrick J. Wong
2022-12-30 22:11   ` [PATCH 8/8] xfs: complain about bad file mapping records in the ondisk bmbt Darrick J. Wong
2022-12-30 22:11   ` [PATCH 5/8] xfs: standardize ondisk to incore conversion for rmap btrees Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: hoist scrub record checks into libxfs Darrick J. Wong
2022-12-30 22:11   ` [PATCH 3/3] xfs: hoist inode record alignment checks from scrub Darrick J. Wong
2022-12-30 22:11   ` [PATCH 2/3] xfs: hoist rmap record flag " Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/3] " Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: fix rmap btree key flag handling Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/2] xfs: fix rm_offset flag handling in rmap keys Darrick J. Wong
2022-12-30 22:11   ` [PATCH 2/2] xfs: detect unwritten bit set in rmapbt node block keys Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: enhance btree key scrubbing Darrick J. Wong
2022-12-30 22:11   ` [PATCH 2/2] xfs: always scrub record/key order of interior records Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/2] xfs: check btree keys reflect the child block Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect incorrect gaps in refcount btree Darrick J. Wong
2022-12-30 22:11   ` [PATCH 3/6] xfs: replace xfs_btree_has_record with a general keyspace scanner Darrick J. Wong
2022-12-30 22:11   ` [PATCH 2/6] xfs: refactor ->diff_two_keys callsites Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/6] xfs: refactor converting btree irec to btree key Darrick J. Wong
2022-12-30 22:11   ` [PATCH 5/6] xfs: check the reference counts of gaps in the refcount btree Darrick J. Wong
2022-12-30 22:11   ` [PATCH 4/6] xfs: implement masked btree key comparisons for _has_records scans Darrick J. Wong
2022-12-30 22:11   ` [PATCH 6/6] xfs: ensure that all metadata and data blocks are not cow staging extents Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/4] xfs: detect incorrect gaps in inode btree Darrick J. Wong
2022-12-30 22:11   ` [PATCH 2/4] xfs: clean up broken eearly-exit code in the inode btree scrubber Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/4] xfs: remove pointless shadow variable from xfs_difree_inobt Darrick J. Wong
2022-12-30 22:11   ` [PATCH 4/4] xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan results Darrick J. Wong
2022-12-30 22:11   ` [PATCH 3/4] xfs: directly cross-reference the inode btrees with each other Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/2] xfs: detect incorrect gaps in rmap btree Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/2] xfs: teach scrub to check for sole ownership of metadata objects Darrick J. Wong
2022-12-30 22:11   ` [PATCH 2/2] xfs: ensure that single-owner file blocks are not owned by others Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/4] xfs: fix iget/irele usage in online fsck Darrick J. Wong
2022-12-30 22:11   ` [PATCH 2/4] xfs: fix an inode lookup race in xchk_get_inode Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/4] xfs: manage inode DONTCACHE status at irele time Darrick J. Wong
2022-12-30 22:11   ` [PATCH 4/4] xfs: retain the AGI when we can't iget an inode to scrub the core Darrick J. Wong
2022-12-30 22:11   ` [PATCH 3/4] xfs: rename xchk_get_inode -> xchk_iget_for_scrubbing Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: fix iget usage in directory scrub Darrick J. Wong
2022-12-30 22:11   ` [PATCH 2/3] xfs: xfs_iget in the directory scrubber needs to use UNTRUSTED Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/3] xfs: make checking directory dotdot entries more reliable Darrick J. Wong
2022-12-30 22:11   ` [PATCH 3/3] xfs: always check the existence of a dirent's child inode Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/6] xfs: detect mergeable and overlapping btree records Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/6] xfs: change bmap scrubber to store the previous mapping Darrick J. Wong
2022-12-30 22:11   ` [PATCH 2/6] xfs: alert the user about data/attr fork mappings that could be merged Darrick J. Wong
2022-12-30 22:11   ` [PATCH 5/6] xfs: check overlapping rmap btree records Darrick J. Wong
2022-12-30 22:11   ` [PATCH 4/6] xfs: flag refcount btree records that could be merged Darrick J. Wong
2022-12-30 22:11   ` [PATCH 3/6] xfs: flag free space " Darrick J. Wong
2022-12-30 22:11   ` [PATCH 6/6] xfs: check for reverse mapping " Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 00/11] xfs: clean up memory management in xattr scrub Darrick J. Wong
2022-12-30 22:11   ` [PATCH 04/11] xfs: split freemap from xchk_xattr_buf.buf Darrick J. Wong
2022-12-30 22:11   ` [PATCH 06/11] xfs: split valuebuf " Darrick J. Wong
2022-12-30 22:11   ` [PATCH 02/11] xfs: don't shadow @leaf in xchk_xattr_block Darrick J. Wong
2022-12-30 22:11   ` [PATCH 01/11] xfs: xattr scrub should ensure one namespace bit per name Darrick J. Wong
2022-12-30 22:11   ` [PATCH 03/11] xfs: remove unnecessary dstmap in xattr scrubber Darrick J. Wong
2022-12-30 22:11   ` [PATCH 05/11] xfs: split usedmap from xchk_xattr_buf.buf Darrick J. Wong
2022-12-30 22:11   ` [PATCH 07/11] xfs: remove flags argument from xchk_setup_xattr_buf Darrick J. Wong
2022-12-30 22:11   ` [PATCH 09/11] xfs: check used space of shortform xattr structures Darrick J. Wong
2022-12-30 22:11   ` [PATCH 10/11] xfs: clean up xattr scrub initialization Darrick J. Wong
2022-12-30 22:11   ` [PATCH 11/11] xfs: only allocate free space bitmap for xattr scrub if needed Darrick J. Wong
2022-12-30 22:11   ` [PATCH 08/11] xfs: move xattr scrub buffer allocation to top level function Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/3] xfs: rework online fsck incore bitmap Darrick J. Wong
2022-12-30 22:11   ` [PATCH 2/3] xfs: drop the _safe behavior from the xbitmap foreach macro Darrick J. Wong
2022-12-30 22:11   ` [PATCH 3/3] xfs: convert xbitmap to interval tree Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/3] xfs: remove the for_each_xbitmap_ helpers Darrick J. Wong
2022-12-30 22:11 ` [PATCHSET v24.0 0/5] xfs: strengthen rmapbt scrubbing Darrick J. Wong
2022-12-30 22:11   ` [PATCH 1/5] xfs: introduce bitmap type for AG blocks Darrick J. Wong
2022-12-30 22:11   ` [PATCH 5/5] xfs: cross-reference rmap records with refcount btrees Darrick J. Wong
2022-12-30 22:11   ` [PATCH 4/5] xfs: cross-reference rmap records with inode btrees Darrick J. Wong
2022-12-30 22:11   ` [PATCH 3/5] xfs: cross-reference rmap records with free space btrees Darrick J. Wong
2022-12-30 22:11   ` [PATCH 2/5] xfs: cross-reference rmap records with ag btrees Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/4] xfs: fix rmap btree key flag handling Darrick J. Wong
2022-12-30 22:12   ` [PATCH 1/4] xfs: fix rm_offset flag handling in rmap keys Darrick J. Wong
2022-12-30 22:12   ` [PATCH 2/4] xfs_repair: check low keys of rmap btrees Darrick J. Wong
2022-12-30 22:12   ` [PATCH 4/4] xfs_db: expose the unwritten flag in rmapbt keys Darrick J. Wong
2022-12-30 22:12   ` [PATCH 3/4] xfs_repair: warn about unwritten bits set in rmap btree keys Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 00/16] fstests: refactor online fsck stress tests Darrick J. Wong
2022-12-30 22:12   ` [PATCH 03/16] xfs/422: rework feature detection so we only test-format scratch once Darrick J. Wong
2022-12-30 22:12   ` [PATCH 01/16] xfs/422: create a new test group for fsstress/repair racers Darrick J. Wong
2022-12-30 22:12   ` [PATCH 07/16] fuzzy: give each test local control over what scrub stress tests get run Darrick J. Wong
2022-12-30 22:12   ` [PATCH 04/16] fuzzy: clean up scrub stress programs quietly Darrick J. Wong
2022-12-30 22:12   ` [PATCH 05/16] fuzzy: rework scrub stress output filtering Darrick J. Wong
2022-12-30 22:12   ` [PATCH 06/16] fuzzy: explicitly check for common/inject in _require_xfs_stress_online_repair Darrick J. Wong
2022-12-30 22:12   ` [PATCH 02/16] xfs/422: move the fsstress/freeze/scrub racing logic to common/fuzzy Darrick J. Wong
2022-12-30 22:12   ` [PATCH 12/16] fuzzy: increase operation count for each fsstress invocation Darrick J. Wong
2023-01-13 19:55     ` Zorro Lang
2023-01-13 21:28       ` Darrick J. Wong
2022-12-30 22:12   ` [PATCH 11/16] fuzzy: clear out the scratch filesystem if it's too full Darrick J. Wong
2022-12-30 22:12   ` [PATCH 09/16] fuzzy: make scrub stress loop control more robust Darrick J. Wong
2022-12-30 22:12   ` [PATCH 08/16] fuzzy: test the scrub stress subcommands before looping Darrick J. Wong
2022-12-30 22:12   ` [PATCH 13/16] fuzzy: clean up frozen fses after scrub stress testing Darrick J. Wong
2022-12-30 22:12   ` [PATCH 10/16] fuzzy: abort scrub stress testing if the scratch fs went down Darrick J. Wong
2022-12-30 22:12   ` [PATCH 14/16] fuzzy: make freezing optional for scrub stress tests Darrick J. Wong
2022-12-30 22:12   ` [PATCH 15/16] fuzzy: allow substitution of AG numbers when configuring scrub stress test Darrick J. Wong
2022-12-30 22:12   ` [PATCH 16/16] fuzzy: delay the start of the scrub loop when stress-testing scrub Darrick J. Wong
2022-12-30 22:12 ` [PATCHSET v24.0 0/3] fstests: refactor GETFSMAP stress tests Darrick J. Wong
2022-12-30 22:12   ` [PATCH 1/3] fuzzy: enhance scrub stress testing to use fsx Darrick J. Wong
2023-01-05  5:49     ` Zorro Lang
2023-01-05 18:28       ` Darrick J. Wong
2023-01-05 18:28     ` [PATCH v24.1 " Darrick J. Wong
2022-12-30 22:12   ` [PATCH 3/3] xfs: race fsmap with readonly remounts to detect crash or livelock Darrick J. Wong
2022-12-30 22:12   ` [PATCH 2/3] fuzzy: refactor fsmap stress test to use our helper functions Darrick J. Wong
2022-12-30 22:13 ` [PATCHSET v24.0 0/2] fstests: race online scrub with mount state changes Darrick J. Wong
2022-12-30 22:13   ` [PATCH 1/2] xfs: stress test xfs_scrub(8) with fsstress Darrick J. Wong
2022-12-30 22:13   ` [PATCH 2/2] xfs: stress test xfs_scrub(8) with freeze and ro-remount loops Darrick J. Wong
2023-01-13 20:10 ` [NYE DELUGE 1/4] xfs: all pending online scrub improvements Zorro Lang
2023-01-13 21:28   ` Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2022-10-02 18:19 [PATCHSET v23.3 00/14] xfs: design documentation for online fsck Darrick J. Wong
2022-10-02 18:19 ` [PATCH 03/14] xfs: document the testing plan " Darrick J. Wong
2022-08-07 18:30 [PATCHSET v2 00/14] xfs: design documentation " Darrick J. Wong
2022-08-07 18:30 ` [PATCH 03/14] xfs: document the testing plan " Darrick J. Wong
2022-08-11  0:09   ` Dave Chinner
2022-08-16  2:18     ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.