[PATCH 18/22] docs: add XFS data extent map doc to the DS&A book

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: darrick.wong@oracle.com
Cc: linux-xfs@vger.kernel.org, linux-doc@vger.kernel.org, corbet@lwn.net
Subject: [PATCH 18/22] docs: add XFS data extent map doc to the DS&A book
Date: Wed, 03 Oct 2018 21:20:18 -0700	[thread overview]
Message-ID: <153862681852.26427.2638721124503707692.stgit@magnolia> (raw)
In-Reply-To: <153862669110.26427.16504658853992750743.stgit@magnolia>

From: Darrick J. Wong <darrick.wong@oracle.com>

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../xfs-data-structures/data_extents.rst           |  337 ++++++++++++++++++++
 .../filesystems/xfs-data-structures/dynamic.rst    |    1 
 2 files changed, 338 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/data_extents.rst

diff --git a/Documentation/filesystems/xfs-data-structures/data_extents.rst b/Documentation/filesystems/xfs-data-structures/data_extents.rst
new file mode 100644
index 000000000000..a410397e9892
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/data_extents.rst
@@ -0,0 +1,337 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Data Extents
+------------
+
+XFS manages space using extents, which are defined as a starting location and
+length. A fork in an XFS inode maps a logical offset to a space extent. This
+enables a file’s extent map to support sparse files (i.e. "holes" in the
+file). A flag is also used to specify if the extent has been preallocated but
+has not yet been written (unwritten extent).
+
+A file can have more than one extent if one chunk of contiguous disk space is
+not available for the file. As a file grows, the XFS space allocator will
+attempt to keep space contiguous and to merge extents. If more than one file
+is being allocated space in the same AG at the same time, multiple extents for
+the files will occur as the extent allocations interleave. The effect of this
+can vary depending on the extent allocator used in the XFS driver.
+
+An extent is 128 bits in size and uses the following packed layout:
+
+.. figure:: images/31.png
+   :alt: Extent record format
+
+   Extent record format
+
+The extent is represented by the xfs\_bmbt\_rec structure which uses a big
+endian format on-disk. In-core management of extents use the xfs\_bmbt\_irec
+structure which is the unpacked version of xfs\_bmbt\_rec:
+
+.. code:: c
+
+    struct xfs_bmbt_irec {
+         xfs_fileoff_t             br_startoff;
+         xfs_fsblock_t             br_startblock;
+         xfs_filblks_t             br_blockcount;
+         xfs_exntst_t              br_state;
+    };
+
+**br\_startoff**
+    Logical block offset of this mapping.
+
+**br\_startblock**
+    Filesystem block of this mapping.
+
+**br\_blockcount**
+    The length of this mapping.
+
+**br\_state**
+    The extent br\_state field uses the following enum declaration:
+
+.. code:: c
+
+    typedef enum {
+         XFS_EXT_NORM,
+         XFS_EXT_UNWRITTEN,
+         XFS_EXT_INVALID
+    } xfs_exntst_t;
+
+Some other points about extents:
+
+-  The xfs\_bmbt\_rec\_32\_t and xfs\_bmbt\_rec\_64\_t structures were
+   effectively the same as xfs\_bmbt\_rec\_t, just different representations
+   of the same 128 bits in on-disk big endian format. xfs\_bmbt\_rec\_32\_t
+   was removed and xfs\_bmbt\_rec\_64\_t renamed to xfs\_bmbt\_rec\_t some
+   time ago.
+
+-  When a file is created and written to, XFS will endeavour to keep the
+   extents within the same AG as the inode. It may use a different AG if the
+   AG is busy or there is no space left in it.
+
+-  If a file is zero bytes long, it will have no extents and di\_nblocks and
+   di\_nexents will be zero. Any file with data will have at least one extent,
+   and each extent can use from 1 to over 2 million blocks (2:sup:`21`) on the
+   filesystem. For a default 4KB block size filesystem, a single extent can be
+   up to 8GB in length.
+
+The following two subsections cover the two methods of storing extent
+information for a file. The first is the fastest and simplest where the inode
+completely contains an extent array to the file’s data. The second is slower
+and more complex B+tree which can handle thousands to millions of extents
+efficiently.
+
+Extent List
+~~~~~~~~~~~
+
+If the entire extent list is short enough to fit within the inode’s fork
+region, we say that the fork is in "extent list" format. This is the most
+optimal in terms of speed and resource consumption. The trade-off is the file
+can only have a few extents before the inode runs out of space.
+
+The data fork of the inode contains an array of extents; the size of the array
+is determined by the inode’s di\_nextents value.
+
+.. figure:: images/32.png
+   :alt: Inode data fork extent layout
+
+   Inode data fork extent layout
+
+The number of extents that can fit in the inode depends on the inode size and
+di\_forkoff. For a default 256 byte inode with no extended attributes, a file
+can have up to 9 extents with this format. On a default v5 filesystem with 512
+byte inodes, a file can have up to 21 extents with this format. Beyond that,
+extents have to use the B+tree format.
+
+xfs\_db Inode Data Fork Extents Example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An 8MB file with one extent:
+
+::
+
+    xfs_db> inode <inode#>
+    xfs_db> p
+    core.magic = 0x494e
+    core.mode = 0100644
+    core.version = 1
+    core.format = 2 (extents)
+    ...
+    core.size = 8294400
+    core.nblocks = 2025
+    core.extsize = 0
+    core.nextents = 1
+    core.naextents = 0
+    core.forkoff = 0
+    ...
+    u.bmx[0] = [startoff,startblock,blockcount,extentflag]
+        0:[0,25356,2025,0]
+
+A 24MB file with three extents:
+
+::
+
+    xfs_db> inode <inode#>
+    xfs_db> p
+    ...
+    core.format = 2 (extents)
+    ...
+    core.size = 24883200
+    core.nblocks = 6075
+    core.nextents = 3
+    ...
+    u.bmx[0-2] = [startoff,startblock,blockcount,extentflag]
+        0:[0,27381,2025,0]
+        1:[2025,31431,2025,0]
+        2:[4050,35481,2025,0]
+
+Raw disk version of the inode with the third extent highlighted (di\_u starts
+at offset 0x64):
+
+::
+
+    xfs_db> type text
+    xfs_db> p
+    00: 49 4e 81 a4 01 02 00 01 00 00 00 00 00 00 00 00 IN..............
+    10: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 01 ................
+    20: 44 b6 88 dd 2f 8a ed d0 44 b6 88 f7 10 8c 5b de D.......D.......
+    30: 44 b6 88 f7 10 8c 5b d0 00 00 00 00 01 7b b0 00 D...............
+    40: 00 00 00 00 00 00 17 bb 00 00 00 00 00 00 00 03 ................
+    50: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 ................
+    60: ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 0d ................
+    70: 5e a0 07 e9 00 00 00 00 00 0f d2 00 00 00 00 0f ................
+    80: 58 e0 07 e9 00 00 00 00 00 1f a4 00 00 00 00 11 X...............
+    90: 53 20 07 e9 00 00 00 00 00 00 00 00 00 00 00 00 S...............
+    a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+    be: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+    co: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+    do: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+    e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+    fo: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+
+We can expand the highlighted section into the following bit array from MSB to
+LSB with the file offset and the block count highlighted:
+
+::
+
+    127-96:  0000 0000 0000 0000  0000 0000 0000 0000
+     95-64:  0000 0000 0001 1111  1010 0100 0000 0000
+     63-32:  0000 0000 0000 0000  0000 0000 0000 1111
+     31-0 :  0101 1000 1110 0000  0000 0111 1110 1001
+
+    Grouping by highlights we get:
+       file offset = 0x0fd2 (4050)
+       start block = 0x7ac7 (31431)
+       block count = 0x07e9 (2025)
+
+A 4MB file with two extents and a hole in the middle, the first extent
+containing 64KB of data, the second about 4MB in containing 32KB (write 64KB,
+lseek 4MB, write 32KB operations):
+
+::
+
+    xfs_db> inode <inode#>
+    xfs_db> p
+    ...
+    core.format = 2 (extents)
+    ...
+    core.size = 4063232
+    core.nblocks = 24
+    core.nextents = 2
+    ...
+    u.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
+        0:[0,37506,16,0]
+        1:[984,37522,8,0]
+
+B+tree Extent List
+~~~~~~~~~~~~~~~~~~
+
+To manage extent maps that cannot fit in the inode fork area, XFS uses `long
+format B+trees <#long-format-b-trees>`__. The root node of the B+tree is stored
+in the inode’s data fork. All block pointers for extent B+trees are 64-bit
+filesystem block numbers.
+
+For a single level B+tree, the root node points to the B+tree’s leaves. Each
+leaf occupies one filesystem block and contains a header and an array of
+extents sorted by the file’s offset. Each leaf has left and right (or backward
+and forward) block pointers to adjacent leaves. For a standard 4KB filesystem
+block, a leaf can contain up to 254 extents before a B+tree rebalance is
+triggered.
+
+For a multi-level B+tree, the root node points to other B+tree nodes which
+eventually point to the extent leaves. B+tree keys are based on the file’s
+offset and have pointers to the next level down. Nodes at each level in the
+B+tree also have pointers to the adjacent nodes.
+
+The base B+tree node is used for extents, directories and extended attributes.
+The structures used for an inode’s B+tree root are:
+
+.. code:: c
+
+    struct xfs_bmdr_block {
+         __be16                     bb_level;
+         __be16                     bb_numrecs;
+    };
+    struct xfs_bmbt_key {
+         xfs_fileoff_t              br_startoff;
+    };
+    typedef xfs_fsblock_t xfs_bmbt_ptr_t, xfs_bmdr_ptr_t;
+
+-  On disk, the B+tree node starts with the xfs\_bmdr\_block\_t header
+   followed by an array of xfs\_bmbt\_key\_t values and then an array of
+   xfs\_bmbt\_ptr\_t values. The size of both arrays is specified by the
+   header’s bb\_numrecs value.
+
+-  The root node in the inode can only contain up to 9 key/pointer pairs for a
+   standard 256 byte inode before a new level of nodes is added between the
+   root and the leaves. This will be less if di\_forkoff is not zero (i.e.
+   attributes are in use on the inode).
+
+-  The magic number for a BMBT block is "BMAP" (0x424d4150).  On a v5
+   filesystem, this is "BMA3" (0x424d4133).
+
+-  For intermediate nodes, the data following xfs\_btree\_lblock is the same
+   as the root node: array of xfs\_bmbt\_key value followed by an array of
+   xfs\_bmbt\_ptr\_t values that starts halfway through the block (offset
+   0x808 for a 4096 byte filesystem block).
+
+-  For leaves, an array of xfs\_bmbt\_rec extents follow the
+   xfs\_btree\_lblock header.
+
+-  Nodes and leaves use the same value for bb\_magic.
+
+-  The bb\_level value determines if the node is an intermediate node or a
+   leaf. Leaves have a bb\_level of zero, nodes are one or greater.
+
+-  Intermediate nodes, like leaves, can contain up to 254 pointers to leaf
+   blocks for a standard 4KB filesystem block size as both the keys and
+   pointers are 64 bits in size.
+
+.. ifconfig:: builder != 'latex'
+
+   .. figure:: images/35.png
+      :alt: Single level extent B+tree
+
+      Single level extent B+tree
+
+   .. figure:: images/36.png
+      :alt: Multiple level extent B+tree
+
+      Multiple level extent B+tree
+
+.. ifconfig:: builder == 'latex'
+
+   .. figure:: images/35.png
+      :scale: 45%
+      :alt: Single level extent B+tree
+
+      Single level extent B+tree
+
+   .. figure:: images/36.png
+      :scale: 40%
+      :alt: Multiple level extent B+tree
+
+      Multiple level extent B+tree
+
+xfs\_db bmbt Example
+^^^^^^^^^^^^^^^^^^^^
+
+In this example, we dissect the data fork of a VM image that is sufficiently
+sparse and interleaved to have become a B+tree.
+
+::
+
+    xfs_db> inode 132
+    xfs_db> p
+    core.magic = 0x494e
+    core.mode = 0100600
+    core.version = 3
+    core.format = 3 (btree)
+    ...
+    u3.bmbt.level = 1
+    u3.bmbt.numrecs = 3
+    u3.bmbt.keys[1-3] = [startoff] 1:[0] 2:[9072] 3:[13136]
+    u3.bmbt.ptrs[1-3] = 1:8568 2:8569 3:8570
+
+As you can see, the block map B+tree is rooted in the inode. This tree has two
+levels, so let’s go down a level to look at the records:
+
+::
+
+    xfs_db> addr u3.bmbt.ptrs[1]
+    xfs_db> p
+    magic = 0x424d4133
+    level = 0
+    numrecs = 251
+    leftsib = null
+    rightsib = 8569
+    bno = 68544
+    lsn = 0x100000006
+    uuid = 9579903c-333f-4673-a7d4-3254c05816ea
+    owner = 132
+    crc = 0xc61513dc (correct)
+    recs[1-251] = [startoff,startblock,blockcount,extentflag]
+            1:[0,8520,48,0] 2:[48,4421,16,0] 3:[80,9136,16,0] 4:[96,8569,16,0]
+            5:[144,8601,32,0] 6:[192,8637,16,0] 7:[240,8680,16,0] 8:[288,9870,16,0]
+            9:[320,9920,16,0] 10:[336,9950,16,0] 11:[384,4004,32,0]
+            12:[432,6771,16,0] 13:[480,2702,16,0] 14:[528,8420,16,0]
+            ...
diff --git a/Documentation/filesystems/xfs-data-structures/dynamic.rst b/Documentation/filesystems/xfs-data-structures/dynamic.rst
index 945b07be2034..5ba6f3940808 100644
--- a/Documentation/filesystems/xfs-data-structures/dynamic.rst
+++ b/Documentation/filesystems/xfs-data-structures/dynamic.rst
@@ -4,3 +4,4 @@ Dynamic Allocated Structures
 ============================
 
 .. include:: ondisk_inode.rst
+.. include:: data_extents.rst