All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: darrick.wong@oracle.com
Cc: linux-xfs@vger.kernel.org, linux-doc@vger.kernel.org, corbet@lwn.net
Subject: [PATCH 18/22] docs: add XFS data extent map doc to the DS&A book
Date: Wed, 03 Oct 2018 21:20:18 -0700	[thread overview]
Message-ID: <153862681852.26427.2638721124503707692.stgit@magnolia> (raw)
In-Reply-To: <153862669110.26427.16504658853992750743.stgit@magnolia>

From: Darrick J. Wong <darrick.wong@oracle.com>

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../xfs-data-structures/data_extents.rst           |  337 ++++++++++++++++++++
 .../filesystems/xfs-data-structures/dynamic.rst    |    1 
 2 files changed, 338 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/data_extents.rst


diff --git a/Documentation/filesystems/xfs-data-structures/data_extents.rst b/Documentation/filesystems/xfs-data-structures/data_extents.rst
new file mode 100644
index 000000000000..a410397e9892
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/data_extents.rst
@@ -0,0 +1,337 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Data Extents
+------------
+
+XFS manages space using extents, which are defined as a starting location and
+length. A fork in an XFS inode maps a logical offset to a space extent. This
+enables a file’s extent map to support sparse files (i.e. "holes" in the
+file). A flag is also used to specify if the extent has been preallocated but
+has not yet been written (unwritten extent).
+
+A file can have more than one extent if one chunk of contiguous disk space is
+not available for the file. As a file grows, the XFS space allocator will
+attempt to keep space contiguous and to merge extents. If more than one file
+is being allocated space in the same AG at the same time, multiple extents for
+the files will occur as the extent allocations interleave. The effect of this
+can vary depending on the extent allocator used in the XFS driver.
+
+An extent is 128 bits in size and uses the following packed layout:
+
+.. figure:: images/31.png
+   :alt: Extent record format
+
+   Extent record format
+
+The extent is represented by the xfs\_bmbt\_rec structure which uses a big
+endian format on-disk. In-core management of extents use the xfs\_bmbt\_irec
+structure which is the unpacked version of xfs\_bmbt\_rec:
+
+.. code:: c
+
+    struct xfs_bmbt_irec {
+         xfs_fileoff_t             br_startoff;
+         xfs_fsblock_t             br_startblock;
+         xfs_filblks_t             br_blockcount;
+         xfs_exntst_t              br_state;
+    };
+
+**br\_startoff**
+    Logical block offset of this mapping.
+
+**br\_startblock**
+    Filesystem block of this mapping.
+
+**br\_blockcount**
+    The length of this mapping.
+
+**br\_state**
+    The extent br\_state field uses the following enum declaration:
+
+.. code:: c
+
+    typedef enum {
+         XFS_EXT_NORM,
+         XFS_EXT_UNWRITTEN,
+         XFS_EXT_INVALID
+    } xfs_exntst_t;
+
+Some other points about extents:
+
+-  The xfs\_bmbt\_rec\_32\_t and xfs\_bmbt\_rec\_64\_t structures were
+   effectively the same as xfs\_bmbt\_rec\_t, just different representations
+   of the same 128 bits in on-disk big endian format. xfs\_bmbt\_rec\_32\_t
+   was removed and xfs\_bmbt\_rec\_64\_t renamed to xfs\_bmbt\_rec\_t some
+   time ago.
+
+-  When a file is created and written to, XFS will endeavour to keep the
+   extents within the same AG as the inode. It may use a different AG if the
+   AG is busy or there is no space left in it.
+
+-  If a file is zero bytes long, it will have no extents and di\_nblocks and
+   di\_nexents will be zero. Any file with data will have at least one extent,
+   and each extent can use from 1 to over 2 million blocks (2:sup:`21`) on the
+   filesystem. For a default 4KB block size filesystem, a single extent can be
+   up to 8GB in length.
+
+The following two subsections cover the two methods of storing extent
+information for a file. The first is the fastest and simplest where the inode
+completely contains an extent array to the file’s data. The second is slower
+and more complex B+tree which can handle thousands to millions of extents
+efficiently.
+
+Extent List
+~~~~~~~~~~~
+
+If the entire extent list is short enough to fit within the inode’s fork
+region, we say that the fork is in "extent list" format. This is the most
+optimal in terms of speed and resource consumption. The trade-off is the file
+can only have a few extents before the inode runs out of space.
+
+The data fork of the inode contains an array of extents; the size of the array
+is determined by the inode’s di\_nextents value.
+
+.. figure:: images/32.png
+   :alt: Inode data fork extent layout
+
+   Inode data fork extent layout
+
+The number of extents that can fit in the inode depends on the inode size and
+di\_forkoff. For a default 256 byte inode with no extended attributes, a file
+can have up to 9 extents with this format. On a default v5 filesystem with 512
+byte inodes, a file can have up to 21 extents with this format. Beyond that,
+extents have to use the B+tree format.
+
+xfs\_db Inode Data Fork Extents Example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An 8MB file with one extent:
+
+::
+
+    xfs_db> inode <inode#>
+    xfs_db> p
+    core.magic = 0x494e
+    core.mode = 0100644
+    core.version = 1
+    core.format = 2 (extents)
+    ...
+    core.size = 8294400
+    core.nblocks = 2025
+    core.extsize = 0
+    core.nextents = 1
+    core.naextents = 0
+    core.forkoff = 0
+    ...
+    u.bmx[0] = [startoff,startblock,blockcount,extentflag]
+        0:[0,25356,2025,0]
+
+A 24MB file with three extents:
+
+::
+
+    xfs_db> inode <inode#>
+    xfs_db> p
+    ...
+    core.format = 2 (extents)
+    ...
+    core.size = 24883200
+    core.nblocks = 6075
+    core.nextents = 3
+    ...
+    u.bmx[0-2] = [startoff,startblock,blockcount,extentflag]
+        0:[0,27381,2025,0]
+        1:[2025,31431,2025,0]
+        2:[4050,35481,2025,0]
+
+Raw disk version of the inode with the third extent highlighted (di\_u starts
+at offset 0x64):
+
+::
+
+    xfs_db> type text
+    xfs_db> p
+    00: 49 4e 81 a4 01 02 00 01 00 00 00 00 00 00 00 00 IN..............
+    10: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 01 ................
+    20: 44 b6 88 dd 2f 8a ed d0 44 b6 88 f7 10 8c 5b de D.......D.......
+    30: 44 b6 88 f7 10 8c 5b d0 00 00 00 00 01 7b b0 00 D...............
+    40: 00 00 00 00 00 00 17 bb 00 00 00 00 00 00 00 03 ................
+    50: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 ................
+    60: ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 0d ................
+    70: 5e a0 07 e9 00 00 00 00 00 0f d2 00 00 00 00 0f ................
+    80: 58 e0 07 e9 00 00 00 00 00 1f a4 00 00 00 00 11 X...............
+    90: 53 20 07 e9 00 00 00 00 00 00 00 00 00 00 00 00 S...............
+    a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+    be: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+    co: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+    do: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+    e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+    fo: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+
+We can expand the highlighted section into the following bit array from MSB to
+LSB with the file offset and the block count highlighted:
+
+::
+
+    127-96:  0000 0000 0000 0000  0000 0000 0000 0000
+     95-64:  0000 0000 0001 1111  1010 0100 0000 0000
+     63-32:  0000 0000 0000 0000  0000 0000 0000 1111
+     31-0 :  0101 1000 1110 0000  0000 0111 1110 1001
+
+    Grouping by highlights we get:
+       file offset = 0x0fd2 (4050)
+       start block = 0x7ac7 (31431)
+       block count = 0x07e9 (2025)
+
+A 4MB file with two extents and a hole in the middle, the first extent
+containing 64KB of data, the second about 4MB in containing 32KB (write 64KB,
+lseek 4MB, write 32KB operations):
+
+::
+
+    xfs_db> inode <inode#>
+    xfs_db> p
+    ...
+    core.format = 2 (extents)
+    ...
+    core.size = 4063232
+    core.nblocks = 24
+    core.nextents = 2
+    ...
+    u.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
+        0:[0,37506,16,0]
+        1:[984,37522,8,0]
+
+B+tree Extent List
+~~~~~~~~~~~~~~~~~~
+
+To manage extent maps that cannot fit in the inode fork area, XFS uses `long
+format B+trees <#long-format-b-trees>`__. The root node of the B+tree is stored
+in the inode’s data fork. All block pointers for extent B+trees are 64-bit
+filesystem block numbers.
+
+For a single level B+tree, the root node points to the B+tree’s leaves. Each
+leaf occupies one filesystem block and contains a header and an array of
+extents sorted by the file’s offset. Each leaf has left and right (or backward
+and forward) block pointers to adjacent leaves. For a standard 4KB filesystem
+block, a leaf can contain up to 254 extents before a B+tree rebalance is
+triggered.
+
+For a multi-level B+tree, the root node points to other B+tree nodes which
+eventually point to the extent leaves. B+tree keys are based on the file’s
+offset and have pointers to the next level down. Nodes at each level in the
+B+tree also have pointers to the adjacent nodes.
+
+The base B+tree node is used for extents, directories and extended attributes.
+The structures used for an inode’s B+tree root are:
+
+.. code:: c
+
+    struct xfs_bmdr_block {
+         __be16                     bb_level;
+         __be16                     bb_numrecs;
+    };
+    struct xfs_bmbt_key {
+         xfs_fileoff_t              br_startoff;
+    };
+    typedef xfs_fsblock_t xfs_bmbt_ptr_t, xfs_bmdr_ptr_t;
+
+-  On disk, the B+tree node starts with the xfs\_bmdr\_block\_t header
+   followed by an array of xfs\_bmbt\_key\_t values and then an array of
+   xfs\_bmbt\_ptr\_t values. The size of both arrays is specified by the
+   header’s bb\_numrecs value.
+
+-  The root node in the inode can only contain up to 9 key/pointer pairs for a
+   standard 256 byte inode before a new level of nodes is added between the
+   root and the leaves. This will be less if di\_forkoff is not zero (i.e.
+   attributes are in use on the inode).
+
+-  The magic number for a BMBT block is "BMAP" (0x424d4150).  On a v5
+   filesystem, this is "BMA3" (0x424d4133).
+
+-  For intermediate nodes, the data following xfs\_btree\_lblock is the same
+   as the root node: array of xfs\_bmbt\_key value followed by an array of
+   xfs\_bmbt\_ptr\_t values that starts halfway through the block (offset
+   0x808 for a 4096 byte filesystem block).
+
+-  For leaves, an array of xfs\_bmbt\_rec extents follow the
+   xfs\_btree\_lblock header.
+
+-  Nodes and leaves use the same value for bb\_magic.
+
+-  The bb\_level value determines if the node is an intermediate node or a
+   leaf. Leaves have a bb\_level of zero, nodes are one or greater.
+
+-  Intermediate nodes, like leaves, can contain up to 254 pointers to leaf
+   blocks for a standard 4KB filesystem block size as both the keys and
+   pointers are 64 bits in size.
+
+.. ifconfig:: builder != 'latex'
+
+   .. figure:: images/35.png
+      :alt: Single level extent B+tree
+
+      Single level extent B+tree
+
+   .. figure:: images/36.png
+      :alt: Multiple level extent B+tree
+
+      Multiple level extent B+tree
+
+.. ifconfig:: builder == 'latex'
+
+   .. figure:: images/35.png
+      :scale: 45%
+      :alt: Single level extent B+tree
+
+      Single level extent B+tree
+
+   .. figure:: images/36.png
+      :scale: 40%
+      :alt: Multiple level extent B+tree
+
+      Multiple level extent B+tree
+
+xfs\_db bmbt Example
+^^^^^^^^^^^^^^^^^^^^
+
+In this example, we dissect the data fork of a VM image that is sufficiently
+sparse and interleaved to have become a B+tree.
+
+::
+
+    xfs_db> inode 132
+    xfs_db> p
+    core.magic = 0x494e
+    core.mode = 0100600
+    core.version = 3
+    core.format = 3 (btree)
+    ...
+    u3.bmbt.level = 1
+    u3.bmbt.numrecs = 3
+    u3.bmbt.keys[1-3] = [startoff] 1:[0] 2:[9072] 3:[13136]
+    u3.bmbt.ptrs[1-3] = 1:8568 2:8569 3:8570
+
+As you can see, the block map B+tree is rooted in the inode. This tree has two
+levels, so let’s go down a level to look at the records:
+
+::
+
+    xfs_db> addr u3.bmbt.ptrs[1]
+    xfs_db> p
+    magic = 0x424d4133
+    level = 0
+    numrecs = 251
+    leftsib = null
+    rightsib = 8569
+    bno = 68544
+    lsn = 0x100000006
+    uuid = 9579903c-333f-4673-a7d4-3254c05816ea
+    owner = 132
+    crc = 0xc61513dc (correct)
+    recs[1-251] = [startoff,startblock,blockcount,extentflag]
+            1:[0,8520,48,0] 2:[48,4421,16,0] 3:[80,9136,16,0] 4:[96,8569,16,0]
+            5:[144,8601,32,0] 6:[192,8637,16,0] 7:[240,8680,16,0] 8:[288,9870,16,0]
+            9:[320,9920,16,0] 10:[336,9950,16,0] 11:[384,4004,32,0]
+            12:[432,6771,16,0] 13:[480,2702,16,0] 14:[528,8420,16,0]
+            ...
diff --git a/Documentation/filesystems/xfs-data-structures/dynamic.rst b/Documentation/filesystems/xfs-data-structures/dynamic.rst
index 945b07be2034..5ba6f3940808 100644
--- a/Documentation/filesystems/xfs-data-structures/dynamic.rst
+++ b/Documentation/filesystems/xfs-data-structures/dynamic.rst
@@ -4,3 +4,4 @@ Dynamic Allocated Structures
 ============================
 
 .. include:: ondisk_inode.rst
+.. include:: data_extents.rst

  parent reply	other threads:[~2018-10-04 11:11 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-04  4:18 [PATCH v2 00/22] xfs-4.20: major documentation surgery Darrick J. Wong
2018-10-04  4:18 ` [PATCH 01/22] docs: add skeleton of XFS Data Structures and Algorithms book Darrick J. Wong
2018-10-04  4:18 ` [PATCH 03/22] docs: add XFS self-describing metadata integrity doc to DS&A book Darrick J. Wong
2018-10-04  4:18 ` [PATCH 04/22] docs: add XFS delayed logging design " Darrick J. Wong
2018-10-04  4:18 ` [PATCH 05/22] docs: add XFS shared data block chapter " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 06/22] docs: add XFS online repair " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 07/22] docs: add XFS common types and magic numbers " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 08/22] docs: add XFS testing chapter to the " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 09/22] docs: add XFS btrees " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 10/22] docs: add XFS dir/attr btree structure " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 11/22] docs: add XFS allocation group metadata " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 12/22] docs: add XFS reverse mapping structures " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 13/22] docs: add XFS refcount btree structure to " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 14/22] docs: add XFS log to the " Darrick J. Wong
2018-10-04  4:19 ` [PATCH 15/22] docs: add XFS internal inodes " Darrick J. Wong
2018-10-04  4:20 ` [PATCH 16/22] docs: add preliminary XFS realtime rmapbt structures " Darrick J. Wong
2018-10-04  4:20 ` [PATCH 17/22] docs: add XFS inode format " Darrick J. Wong
2018-10-04  4:20 ` Darrick J. Wong [this message]
2018-10-04  4:20 ` [PATCH 19/22] docs: add XFS directory structure " Darrick J. Wong
2018-10-04  4:20 ` [PATCH 20/22] docs: add XFS extended attributes structures " Darrick J. Wong
2018-10-04  4:20 ` [PATCH 21/22] docs: add XFS symlink " Darrick J. Wong
2018-10-04  4:20 ` [PATCH 22/22] docs: add XFS metadump structure to " Darrick J. Wong
2018-10-06  0:51 ` [PATCH v2 00/22] xfs-4.20: major documentation surgery Dave Chinner
2018-10-06  1:01   ` Jonathan Corbet
2018-10-06  1:09     ` Dave Chinner
2018-10-06 13:29   ` Matthew Wilcox
2018-10-06 14:10     ` Jonathan Corbet
2018-10-11 17:27   ` Jonathan Corbet
2018-10-12  1:33     ` Dave Chinner
2018-10-15  9:55     ` Christoph Hellwig
2018-10-15 14:28       ` Jonathan Corbet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=153862681852.26427.2638721124503707692.stgit@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=corbet@lwn.net \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.