All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: david@fromorbit.com, darrick.wong@oracle.com
Cc: linux-xfs@vger.kernel.org, xfs@oss.sgi.com
Subject: [PATCH 4/7] xfsdocs: move the discussions of short and long format btrees to a separate chapter
Date: Thu, 25 Aug 2016 16:27:23 -0700	[thread overview]
Message-ID: <147216764310.32447.6110430430725995824.stgit@birch.djwong.org> (raw)
In-Reply-To: <147216761636.32447.4229640006064129056.stgit@birch.djwong.org>

Move the discussion of short and long format btrees into a separate
chapter.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../allocation_groups.asciidoc                     |   59 ------
 design/XFS_Filesystem_Structure/btrees.asciidoc    |  196 ++++++++++++++++++++
 .../XFS_Filesystem_Structure/data_extents.asciidoc |   72 +------
 .../xfs_filesystem_structure.asciidoc              |    2 
 4 files changed, 204 insertions(+), 125 deletions(-)
 create mode 100644 design/XFS_Filesystem_Structure/btrees.asciidoc


diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
index 0633175..55bbc50 100644
--- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
+++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
@@ -612,65 +612,6 @@ Checksum of the AGF sector.
 *agf_spare2*::
 Empty space in the unlogged part of the AGF sector.
 
-[[Short_Format_Btrees]]
-=== Short Format B+trees
-
-Each allocation group uses a ``short format'' B+tree to index various
-information about the allocation group.  The structure is called short format
-because all block pointers are AG block numbers.  The trees use the following
-header:
-
-[source, c]
-----
-struct xfs_btree_sblock {
-     __be32                    bb_magic;
-     __be16                    bb_level;
-     __be16                    bb_numrecs;
-     __be32                    bb_leftsib;
-     __be32                    bb_rightsib;
-
-     /* version 5 filesystem fields start here */
-     __be64                    bb_blkno;
-     __be64                    bb_lsn;
-     uuid_t                    bb_uuid;
-     __be32                    bb_owner;
-     __le32                    bb_crc;
-};
-----
-
-*bb_magic*::
-Specifies the magic number for the per-AG B+tree block.
-
-*bb_level*::
-The level of the tree in which this block is found.  If this value is 0, this
-is a leaf block and contains records; otherwise, it is a node block and
-contains keys and pointers.
-
-*bb_numrecs*::
-Number of records in this block.
-
-*bb_leftsib*::
-AG block number of the left sibling of this B+tree node.
-
-*bb_rightsib*::
-AG block number of the right sibling of this B+tree node.
-
-*bb_blkno*::
-FS block number of this B+tree block.
-
-*bb_lsn*::
-Log sequence number of the last write to this block.
-
-*bb_uuid*::
-The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
-depending on which features are set.
-
-*bb_owner*::
-The AG number that this B+tree block ought to be in.
-
-*bb_crc*::
-Checksum of the B+tree block.
-
 [[AG_Free_Space_Btrees]]
 === AG Free Space B+trees
 
diff --git a/design/XFS_Filesystem_Structure/btrees.asciidoc b/design/XFS_Filesystem_Structure/btrees.asciidoc
new file mode 100644
index 0000000..306e061
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/btrees.asciidoc
@@ -0,0 +1,196 @@
+= B+trees
+
+XFS uses b+trees to index all metadata records.  This well known data structure
+is used to provide efficient random and sequential access to metadata records
+while minimizing seek times.  There are two btree formats: a short format
+for records pertaining to a single allocation group, since all block pointers
+in an AG are 32-bits in size; and a long format for records pertaining to a
+file, since file data can have 64-bit block offsets.  Each b+tree block is
+either a leaf node containing records, or an internal node containing keys and
+pointers to other b+tree blocks.  The tree consists of a root block which may
+point to some number of other blocks; blocks in the bottom level of the b+tree
+contains only records.
+
+Leaf blocks of both types of b+trees have the same general format: a header
+describing the data in the block, and an array of records.  The specific header
+formats are given in the next two sections, and the record format is provided
+by the b+tree client itself.  The generic b+tree code does not have any
+specific knowledge of the record format.
+
+----
++--------+------------+------------+
+| header |   record   | records... |
++--------+------------+------------+
+----
+
+Internal node blocks of both types of b+trees also have the same general
+format: a header describing the data in the block, an array of keys, and an
+array of pointers.  Each pointer may be associated with one or two keys.  The
+first key uniquely identifies the first record accessible via the leftmost path
+down the branch of the tree.
+
+If the records in a b+tree are indexed by an interval, then a range of keys can
+uniquely identify a single record.  For example, if a record covers blocks
+12-16, then any one of the keys 12, 13, 14, 15, or 16 return the same record.
+In this case, the key for the record describing "12-16" is 12.  If none of the
+records overlap, we only need to store one key.
+
+This is the format of a standard b+tree node:
+
+----
++--------+---------+---------+---------+---------+
+| header |   key   | keys... |   ptr   | ptrs... |
++--------+---------+---------+---------+---------+
+----
+
+If the b+tree records do not overlap, performing a b+tree lookup is simple.
+Start with the root.  If it is a leaf block, perform a binary search of the
+records until we find the record with a lower key than our search key.  If the
+block is a node block, perform a binary search of the keys until we find a
+key lower than our search key, then follow the pointer to the next block.
+Repeat until we find a record.
+
+However, if b+tree records contain intervals and are allowed to overlap, the
+internal nodes of the b+tree become larger:
+
+----
++--------+---------+----------+---------+-------------+---------+---------+
+| header | low key | high key | low key | high key... |   ptr   | ptrs... |
++--------+---------+----------+---------+-------------+---------+---------+
+----
+
+The low keys are exactly the same as the keys in the non-overlapping b+tree.
+High keys, however, are a little different.  Recall that a record with a key
+consisting of an interval can be referenced by a number of keys.  Since the low
+key of a record indexes the low end of that key range, the high key indexes the
+high end of the key range.  Returning to the example above, the high key for
+the record describing "12-16" is 16.  The high key recorded in a b+tree node
+is the largest of the high keys of all records accessible under the subtree
+rooted by the pointer.  For a level 1 node, this is the largest high key in
+the pointed-to leaf node; for any other node, this is the largest of the high
+keys in the pointed-to node.
+
+Nodes and leaves use the same magic numbers.
+
+[[Short_Format_Btrees]]
+== Short Format B+trees
+
+Each allocation group uses a ``short format'' B+tree to index various
+information about the allocation group.  The structure is called short format
+because all block pointers are AG block numbers.  The trees use the following
+header:
+
+[source, c]
+----
+struct xfs_btree_sblock {
+     __be32                    bb_magic;
+     __be16                    bb_level;
+     __be16                    bb_numrecs;
+     __be32                    bb_leftsib;
+     __be32                    bb_rightsib;
+
+     /* version 5 filesystem fields start here */
+     __be64                    bb_blkno;
+     __be64                    bb_lsn;
+     uuid_t                    bb_uuid;
+     __be32                    bb_owner;
+     __le32                    bb_crc;
+};
+----
+
+*bb_magic*::
+Specifies the magic number for the per-AG B+tree block.
+
+*bb_level*::
+The level of the tree in which this block is found.  If this value is 0, this
+is a leaf block and contains records; otherwise, it is a node block and
+contains keys and pointers.
+
+*bb_numrecs*::
+Number of records in this block.
+
+*bb_leftsib*::
+AG block number of the left sibling of this B+tree node.
+
+*bb_rightsib*::
+AG block number of the right sibling of this B+tree node.
+
+*bb_blkno*::
+FS block number of this B+tree block.
+
+*bb_lsn*::
+Log sequence number of the last write to this block.
+
+*bb_uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*bb_owner*::
+The AG number that this B+tree block ought to be in.
+
+*bb_crc*::
+Checksum of the B+tree block.
+
+[[Long_Format_Btrees]]
+== Long Format B+trees
+
+Long format B+trees are similar to short format B+trees, except that their
+block pointers are 64-bit filesystem block numbers instead of 32-bit AG block
+numbers.  Because of this, long format b+trees can be (and usually are) rooted
+in an inode's data or attribute fork.  The nodes and leaves of this B+tree use
+the +xfs_btree_lblock+ declaration:
+
+[source, c]
+----
+struct xfs_btree_lblock {
+     __be32                    bb_magic;
+     __be16                    bb_level;
+     __be16                    bb_numrecs;
+     __be64                    bb_leftsib;
+     __be64                    bb_rightsib;
+
+     /* version 5 filesystem fields start here */
+     __be64                    bb_blkno;
+     __be64                    bb_lsn;
+     uuid_t                    bb_uuid;
+     __be64                    bb_owner;
+     __le32                    bb_crc;
+     __be32                    bb_pad;
+};
+----
+
+*bb_magic*::
+Specifies the magic number for the btree block.
+
+*bb_level*::
+The level of the tree in which this block is found.  If this value is 0, this
+is a leaf block and contains records; otherwise, it is a node block and
+contains keys and pointers.
+
+*bb_numrecs*::
+Number of records in this block.
+
+*bb_leftsib*::
+FS block number of the left sibling of this B+tree node.
+
+*bb_rightsib*::
+FS block number of the right sibling of this B+tree node.
+
+*bb_blkno*::
+FS block number of this B+tree block.
+
+*bb_lsn*::
+Log sequence number of the last write to this block.
+
+*bb_uuid*::
+The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
+depending on which features are set.
+
+*bb_owner*::
+The AG number that this B+tree block ought to be in.
+
+*bb_crc*::
+Checksum of the B+tree block.
+
+*bb_pad*::
+Pads the structure to 64 bytes.
diff --git a/design/XFS_Filesystem_Structure/data_extents.asciidoc b/design/XFS_Filesystem_Structure/data_extents.asciidoc
index a39045d..4f1109b 100644
--- a/design/XFS_Filesystem_Structure/data_extents.asciidoc
+++ b/design/XFS_Filesystem_Structure/data_extents.asciidoc
@@ -203,9 +203,10 @@ u.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
 [[Btree_Extent_List]]
 == B+tree Extent List
 
-To manage extent maps that cannot fit in the inode fork area, XFS uses long
-format B+trees.  The root node of the B+tree is stored in the inode's data
-fork.  All block pointers for extent B+trees are 64-bit absolute block numbers.
+To manage extent maps that cannot fit in the inode fork area, XFS uses
+xref:Long_Format_Btrees[long format B+trees].  The root node of the B+tree is
+stored in the inode's data fork.  All block pointers for extent B+trees are
+64-bit filesystem block numbers.
 
 For a single level B+tree, the root node points to the B+tree's leaves. Each
 leaf occupies one filesystem block and contains a header and an array of extents
@@ -242,69 +243,8 @@ standard 256 byte inode before a new level of nodes is added between the root
 and the leaves. This will be less if +di_forkoff+ is not zero (i.e. attributes
 are in use on the inode).
 
-[[Long_Format_Btrees]]
-=== Long Format B+trees
-
-The subsequent nodes and leaves of the B+tree use the +xfs_btree_lblock+
-declaration:
-
-[source, c]
-----
-struct xfs_btree_lblock {
-     __be32                    bb_magic;
-     __be16                    bb_level;
-     __be16                    bb_numrecs;
-     __be64                    bb_leftsib;
-     __be64                    bb_rightsib;
-
-     /* version 5 filesystem fields start here */
-     __be64                    bb_blkno;
-     __be64                    bb_lsn;
-     uuid_t                    bb_uuid;
-     __be64                    bb_owner;
-     __le32                    bb_crc;
-     __be32                    bb_pad;
-};
-----
-
-*bb_magic*::
-Specifies the magic number for the BMBT block: ``BMAP'' (0x424d4150).
-On a v5 filesystem, this is ``BMA3'' (0x424d4133).
-
-*bb_level*::
-The level of the tree in which this block is found.  If this value is 0, this
-is a leaf block and contains records; otherwise, it is a node block and
-contains keys and pointers.
-
-*bb_numrecs*::
-Number of records in this block.
-
-*bb_leftsib*::
-FS block number of the left sibling of this B+tree node.
-
-*bb_rightsib*::
-FS block number of the right sibling of this B+tree node.
-
-*bb_blkno*::
-FS block number of this B+tree block.
-
-*bb_lsn*::
-Log sequence number of the last write to this block.
-
-*bb_uuid*::
-The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
-depending on which features are set.
-
-*bb_owner*::
-The AG number that this B+tree block ought to be in.
-
-*bb_crc*::
-Checksum of the B+tree block.
-
-*bb_pad*::
-Pads the structure to 64 bytes.
-
-// force-split the lists
+* The magic number for a BMBT block is ``BMAP'' (0x424d4150).  On a v5
+filesystem, this is ``BMA3'' (0x424d4133).
 
 * For intermediate nodes, the data following +xfs_btree_lblock+ is the same as
 the root node: array of +xfs_bmbt_key+ value followed by an array of
diff --git a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
index f580aab..62502b3 100644
--- a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
+++ b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
@@ -62,6 +62,8 @@ Global Structures
 
 :leveloffset: 1
 
+include::btrees.asciidoc[]
+
 include::allocation_groups.asciidoc[]
 
 include::journaling_log.asciidoc[]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  parent reply	other threads:[~2016-08-25 23:27 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-25 23:26 [PATCH v8 0/7] xfs-docs: reorganize chapters, document rmap and reflink Darrick J. Wong
2016-08-25 23:27 ` [PATCH 1/7] journaling_log: fix some typos in the section about EFDs Darrick J. Wong
2016-08-25 23:27 ` [PATCH 2/7] xfsdocs: document known testing procedures Darrick J. Wong
2016-08-25 23:27 ` [PATCH 3/7] xfsdocs: update the on-disk format with changes for Linux 4.5 Darrick J. Wong
2016-08-25 23:27 ` Darrick J. Wong [this message]
2016-08-25 23:27 ` [PATCH 5/7] xfsdocs: reverse-mapping btree documentation Darrick J. Wong
2016-08-25 23:27 ` [PATCH 6/7] xfsdocs: document refcount btree and reflink Darrick J. Wong
2016-08-25 23:27 ` [PATCH 7/7] xfsdocs: document the realtime reverse mapping btree Darrick J. Wong
2016-09-08  1:38   ` Dave Chinner
2016-09-08  2:03     ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=147216764310.32447.6110430430725995824.stgit@birch.djwong.org \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.