All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: darrick.wong@oracle.com
Cc: linux-xfs@vger.kernel.org, linux-doc@vger.kernel.org, corbet@lwn.net
Subject: [PATCH 12/22] docs: add XFS reverse mapping structures to the DS&A book
Date: Wed, 03 Oct 2018 21:19:40 -0700	[thread overview]
Message-ID: <153862678010.26427.10700488839888247014.stgit@magnolia> (raw)
In-Reply-To: <153862669110.26427.16504658853992750743.stgit@magnolia>

From: Darrick J. Wong <darrick.wong@oracle.com>

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../xfs-data-structures/allocation_groups.rst      |    2 
 .../filesystems/xfs-data-structures/rmapbt.rst     |  336 ++++++++++++++++++++
 2 files changed, 338 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/rmapbt.rst


diff --git a/Documentation/filesystems/xfs-data-structures/allocation_groups.rst b/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
index 30d169ab5cc5..6c0ffd3a170b 100644
--- a/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
+++ b/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
@@ -1379,3 +1379,5 @@ response times that come from metadata operations.
 
 None of the XFS per-AG B+trees are involved with real time files. It is not
 possible for real time files to share data blocks.
+
+.. include:: rmapbt.rst
diff --git a/Documentation/filesystems/xfs-data-structures/rmapbt.rst b/Documentation/filesystems/xfs-data-structures/rmapbt.rst
new file mode 100644
index 000000000000..eefcee5d4e95
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/rmapbt.rst
@@ -0,0 +1,336 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Reverse-Mapping B+tree
+~~~~~~~~~~~~~~~~~~~~~~
+
+If the feature is enabled, each allocation group has its own reverse
+block-mapping B+tree, which grows in the free space like the free space
+B+trees. As mentioned in the chapter about
+`reconstruction <#metadata-reconstruction>`__, this data structure is another piece of
+the puzzle necessary to reconstruct the data or attribute fork of a file from
+reverse-mapping records; we can also use it to double-check allocations to
+ensure that we are not accidentally cross-linking blocks, which can cause
+severe damage to the filesystem.
+
+This B+tree is only present if the XFS\_SB\_FEAT\_RO\_COMPAT\_RMAPBT feature
+is enabled. The feature requires a version 5 filesystem.
+
+Each record in the reverse-mapping B+tree has the following structure:
+
+.. code:: c
+
+    struct xfs_rmap_rec {
+         __be32                     rm_startblock;
+         __be32                     rm_blockcount;
+         __be64                     rm_owner;
+         __be64                     rm_fork:1;
+         __be64                     rm_bmbt:1;
+         __be64                     rm_unwritten:1;
+         __be64                     rm_unused:7;
+         __be64                     rm_offset:54;
+    };
+
+**rm\_startblock**
+    AG block number of this record.
+
+**rm\_blockcount**
+    The length of this extent.
+
+**rm\_owner**
+    A 64-bit number describing the owner of this extent. This is typically the
+    absolute inode number, but can also correspond to one of the following:
+
+.. list-table::
+   :widths: 28 52
+   :header-rows: 1
+
+   * - Flag
+     - Description
+   * - XFS\_RMAP\_OWN\_NULL
+     - No owner. This should never appear on disk.
+
+   * - XFS\_RMAP\_OWN\_UNKNOWN
+     - Unknown owner; for EFI recovery. This should never appear on disk.
+
+   * - XFS\_RMAP\_OWN\_FS
+     - Allocation group headers.
+
+   * - XFS\_RMAP\_OWN\_LOG
+     - XFS log blocks.
+
+   * - XFS\_RMAP\_OWN\_AG
+     - Per-allocation group B+tree blocks. This means free space B+tree blocks,
+       blocks on the freelist, and reverse-mapping B+tree blocks.
+
+   * - XFS\_RMAP\_OWN\_INOBT
+     - Per-allocation group inode B+tree blocks. This includes free inode
+       B+tree blocks.
+
+   * - XFS\_RMAP\_OWN\_INODES
+     - Inode chunks.
+
+   * - XFS\_RMAP\_OWN\_REFC
+     - Per-allocation group refcount B+tree blocks. This will be used for
+       reflink support.
+
+   * - XFS\_RMAP\_OWN\_COW
+     - Blocks that have been reserved for a copy-on-write operation that has
+       not completed.
+
+Table: Special owner values
+
+**rm\_fork**
+    If rm\_owner describes an inode, this can be 1 if this record is for an
+    attribute fork.
+
+**rm\_bmbt**
+    If rm\_owner describes an inode, this can be 1 to signify that this record
+    is for a block map B+tree block. In this case, rm\_offset has no meaning.
+
+**rm\_unwritten**
+    A flag indicating that the extent is unwritten. This corresponds to the
+    flag in the `extent record <#data-extents>`__ format which means
+    XFS\_EXT\_UNWRITTEN.
+
+**rm\_offset**
+    The 54-bit logical file block offset, if rm\_owner describes an inode.
+    Meaningless otherwise.
+
+    **Note**
+
+    The single-bit flag values rm\_unwritten, rm\_fork, and rm\_bmbt are
+    packed into the larger fields in the C structure definition.
+
+The key has the following structure:
+
+.. code:: c
+
+    struct xfs_rmap_key {
+         __be32                     rm_startblock;
+         __be64                     rm_owner;
+         __be64                     rm_fork:1;
+         __be64                     rm_bmbt:1;
+         __be64                     rm_reserved:1;
+         __be64                     rm_unused:7;
+         __be64                     rm_offset:54;
+    };
+
+For the reverse-mapping B+tree on a filesystem that supports sharing of file
+data blocks, the key definition is larger than the usual AG block number. On a
+classic XFS filesystem, each block has only one owner, which means that
+rm\_startblock is sufficient to uniquely identify each record. However, shared
+block support (reflink) on XFS breaks that assumption; now filesystem blocks
+can be linked to any logical block offset of any file inode. Therefore, the
+key must include the owner and offset information to preserve the 1 to 1
+relation between key and record.
+
+-  As the reference counting is AG relative, all the block numbers are only
+   32-bits.
+
+-  The bb\_magic value is "RMB3" (0x524d4233).
+
+-  The xfs\_btree\_sblock\_t header is used for intermediate B+tree node as
+   well as the leaves.
+
+-  Each pointer is associated with two keys. The first of these is the "low
+   key", which is the key of the smallest record accessible through the
+   pointer. This low key has the same meaning as the key in all other btrees.
+   The second key is the high key, which is the maximum of the largest key
+   that can be used to access a given record underneath the pointer. Recall
+   that each record in the reverse mapping b+tree describes an interval of
+   physical blocks mapped to an interval of logical file block offsets;
+   therefore, it makes sense that a range of keys can be used to find to a
+   record.
+
+xfs\_db rmapbt Example
+^^^^^^^^^^^^^^^^^^^^^^
+
+This example shows a reverse-mapping B+tree from a freshly populated root
+filesystem:
+
+::
+
+    xfs_db> agf 0
+    xfs_db> addr rmaproot
+    xfs_db> p
+    magic = 0x524d4233
+    level = 1
+    numrecs = 43
+    leftsib = null
+    rightsib = null
+    bno = 56
+    lsn = 0x3000004c8
+    uuid = 1977221d-8345-464e-b1f4-aa2ea36895f4
+    owner = 0
+    crc = 0x7cf8be6f (correct)
+    keys[1-43] = [startblock,owner,offset]
+    keys[1-43] = [startblock,owner,offset,attrfork,bmbtblock,startblock_hi,owner_hi,
+             offset_hi,attrfork_hi,bmbtblock_hi]
+            1:[0,-3,0,0,0,351,4418,66,0,0]
+            2:[417,285,0,0,0,827,4419,2,0,0]
+            3:[829,499,0,0,0,2352,573,55,0,0]
+            4:[1292,710,0,0,0,32168,262923,47,0,0]
+            5:[32215,-5,0,0,0,34655,2365,3411,0,0]
+            6:[34083,1161,0,0,0,34895,265220,1,0,1]
+            7:[34896,256191,0,0,0,36522,-9,0,0,0]
+            ...
+            41:[50998,326734,0,0,0,51430,-5,0,0,0]
+            42:[51431,327010,0,0,0,51600,325722,11,0,0]
+            43:[51611,327112,0,0,0,94063,23522,28375272,0,0]
+    ptrs[1-43] = 1:5 2:6 3:8 4:9 5:10 6:11 7:418 ... 41:46377 42:48784 43:49522
+
+We arbitrarily pick pointer 17 to traverse downwards:
+
+::
+
+    xfs_db> addr ptrs[17]
+    xfs_db> p
+    magic = 0x524d4233
+    level = 0
+    numrecs = 168
+    leftsib = 36284
+    rightsib = 37617
+    bno = 294760
+    lsn = 0x200002761
+    uuid = 1977221d-8345-464e-b1f4-aa2ea36895f4
+    owner = 0
+    crc = 0x2dad3fbe (correct)
+    recs[1-168] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock]
+            1:[40326,1,259615,0,0,0,0] 2:[40327,1,-5,0,0,0,0]
+            3:[40328,2,259618,0,0,0,0] 4:[40330,1,259619,0,0,0,0]
+            ...
+            127:[40540,1,324266,0,0,0,0] 128:[40541,1,324266,8388608,0,0,0]
+            129:[40542,2,324266,1,0,0,0] 130:[40544,32,-7,0,0,0,0]
+
+Several interesting things pop out here. The first record shows that inode
+259,615 has mapped AG block 40,326 at offset 0. We confirm this by looking at
+the block map for that inode:
+
+::
+
+    xfs_db> inode 259615
+    xfs_db> bmap
+    data offset 0 startblock 40326 (0/40326) count 1 flag 0
+
+Next, notice records 127 and 128, which describe neighboring AG blocks that
+are mapped to non-contiguous logical blocks in inode 324,266. Given the
+logical offset of 8,388,608 we surmise that this is a leaf directory, but let
+us confirm:
+
+::
+
+    xfs_db> inode 324266
+    xfs_db> p core.mode
+    core.mode = 040755
+    xfs_db> bmap
+    data offset 0 startblock 40540 (0/40540) count 1 flag 0
+    data offset 1 startblock 40542 (0/40542) count 2 flag 0
+    data offset 3 startblock 40576 (0/40576) count 1 flag 0
+    data offset 8388608 startblock 40541 (0/40541) count 1 flag 0
+    xfs_db> p core.mode
+    core.mode = 0100644
+    xfs_db> dblock 0
+    xfs_db> p dhdr.hdr.magic
+    dhdr.hdr.magic = 0x58444433
+    xfs_db> dblock 8388608
+    xfs_db> p lhdr.info.hdr.magic
+    lhdr.info.hdr.magic = 0x3df1
+
+Indeed, this inode 324,266 appears to be a leaf directory, as it has regular
+directory data blocks at low offsets, and a single leaf block.
+
+Notice further the two reverse-mapping records with negative owners. An owner
+of -7 corresponds to XFS\_RMAP\_OWN\_INODES, which is an inode chunk, and an
+owner code of -5 corresponds to XFS\_RMAP\_OWN\_AG, which covers free space
+B+trees and free space. Let’s see if block 40,544 is part of an inode chunk:
+
+::
+
+    xfs_db> blockget
+    xfs_db> fsblock 40544
+    xfs_db> blockuse
+    block 40544 (0/40544) type inode
+    xfs_db> stack
+    1:
+            byte offset 166068224, length 4096
+            buffer block 324352 (fsbno 40544), 8 bbs
+            inode 324266, dir inode 324266, type data
+    xfs_db> type inode
+    xfs_db> p
+    core.magic = 0x494e
+
+Our suspicions are confirmed. Let’s also see if 40,327 is part of a free space
+tree:
+
+::
+
+    xfs_db> fsblock 40327
+    xfs_db> blockuse
+    block 40327 (0/40327) type btrmap
+    xfs_db> type rmapbt
+    xfs_db> p
+    magic = 0x524d4233
+
+As you can see, the reverse block-mapping B+tree is an important secondary
+metadata structure, which can be used to reconstruct damaged primary metadata.
+Now let’s look at an extend rmap btree:
+
+::
+
+    xfs_db> agf 0
+    xfs_db> addr rmaproot
+    xfs_db> p
+    magic = 0x34524d42
+    level = 1
+    numrecs = 5
+    leftsib = null
+    rightsib = null
+    bno = 6368
+    lsn = 0x100000d1b
+    uuid = 400f0928-6b88-4c37-af1e-cef1f8911f3f
+    owner = 0
+    crc = 0x8d4ace05 (correct)
+    keys[1-5] = [startblock,owner,offset,attrfork,bmbtblock,startblock_hi,owner_hi,offset_hi,attrfork_hi,bmbtblock_hi]
+    1:[0,-3,0,0,0,705,132,681,0,0]
+    2:[24,5761,0,0,0,548,5761,524,0,0]
+    3:[24,5929,0,0,0,380,5929,356,0,0]
+    4:[24,6097,0,0,0,212,6097,188,0,0]
+    5:[24,6277,0,0,0,807,-7,0,0,0]
+    ptrs[1-5] = 1:5 2:771 3:9 4:10 5:11
+
+The second pointer stores both the low key [24,5761,0,0,0] and the high key
+[548,5761,524,0,0], which means that we can expect block 771 to contain
+records starting at physical block 24, inode 5761, offset zero; and that one
+of the records can be used to find a reverse mapping for physical block 548,
+inode 5761, and offset 524:
+
+::
+
+    xfs_db> addr ptrs[2]
+    xfs_db> p
+    magic = 0x34524d42
+    level = 0
+    numrecs = 168
+    leftsib = 5
+    rightsib = 9
+    bno = 6168
+    lsn = 0x100000d1b
+    uuid = 400f0928-6b88-4c37-af1e-cef1f8911f3f
+    owner = 0
+    crc = 0xd58eff0e (correct)
+    recs[1-168] = [startblock,blockcount,owner,offset,extentflag,attrfork,bmbtblock]
+    1:[24,525,5761,0,0,0,0]
+    2:[24,524,5762,0,0,0,0]
+    3:[24,523,5763,0,0,0,0]
+    ...
+    166:[24,360,5926,0,0,0,0]
+    167:[24,359,5927,0,0,0,0]
+    168:[24,358,5928,0,0,0,0]
+
+Observe that the first record in the block starts at physical block 24, inode
+5761, offset zero, just as we expected. Note that this first record is also
+indexed by the highest key as provided in the node block; physical block 548,
+inode 5761, offset 524 is the very last block mapped by this record.
+Furthermore, note that record 168, despite being the last record in this
+block, has a lower maximum key (physical block 382, inode 5928, offset 23)
+than the first record.

  parent reply	other threads:[~2018-10-04 11:11 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-04  4:18 [PATCH v2 00/22] xfs-4.20: major documentation surgery Darrick J. Wong
2018-10-04  4:18 ` [PATCH 01/22] docs: add skeleton of XFS Data Structures and Algorithms book Darrick J. Wong
2018-10-04  4:18 ` [PATCH 03/22] docs: add XFS self-describing metadata integrity doc to DS&A book Darrick J. Wong
2018-10-04  4:18 ` [PATCH 04/22] docs: add XFS delayed logging design " Darrick J. Wong
2018-10-04  4:18 ` [PATCH 05/22] docs: add XFS shared data block chapter " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 06/22] docs: add XFS online repair " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 07/22] docs: add XFS common types and magic numbers " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 08/22] docs: add XFS testing chapter to the " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 09/22] docs: add XFS btrees " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 10/22] docs: add XFS dir/attr btree structure " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 11/22] docs: add XFS allocation group metadata " Darrick J. Wong
2018-10-04  4:19 ` Darrick J. Wong [this message]
2018-10-04  4:19 ` [PATCH 13/22] docs: add XFS refcount btree structure to " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 14/22] docs: add XFS log to the " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 15/22] docs: add XFS internal inodes " Darrick J. Wong
2018-10-04  4:20 ` [PATCH 16/22] docs: add preliminary XFS realtime rmapbt structures " Darrick J. Wong
2018-10-04  4:20 ` [PATCH 17/22] docs: add XFS inode format " Darrick J. Wong
2018-10-04  4:20 ` [PATCH 18/22] docs: add XFS data extent map doc " Darrick J. Wong
2018-10-04  4:20 ` [PATCH 19/22] docs: add XFS directory structure " Darrick J. Wong
2018-10-04  4:20 ` [PATCH 20/22] docs: add XFS extended attributes structures " Darrick J. Wong
2018-10-04  4:20 ` [PATCH 21/22] docs: add XFS symlink " Darrick J. Wong
2018-10-04  4:20 ` [PATCH 22/22] docs: add XFS metadump structure to " Darrick J. Wong
2018-10-06  0:51 ` [PATCH v2 00/22] xfs-4.20: major documentation surgery Dave Chinner
2018-10-06  1:01   ` Jonathan Corbet
2018-10-06  1:09     ` Dave Chinner
2018-10-06 13:29   ` Matthew Wilcox
2018-10-06 14:10     ` Jonathan Corbet
2018-10-11 17:27   ` Jonathan Corbet
2018-10-12  1:33     ` Dave Chinner
2018-10-15  9:55     ` Christoph Hellwig
2018-10-15 14:28       ` Jonathan Corbet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=153862678010.26427.10700488839888247014.stgit@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=corbet@lwn.net \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.